In a previous tutorial, we showed you how to setup Galera replication for MariaDB so you could have multi-node, multi-master clusters. We also saw how MariaDB painlessly replicates data. In this tutorial, we’re doing to forcibly interject a little pain into the process by seeing how Galera deals with down loads and replaced nodes.
Once again, we’re using the following cluster setup:
In this design, I’m using the following cluster and anonymized IPs:
- db1.lowend.party on 1.1.1.1
- db2.lowend.party on 2.2.2.2
- db3.lowend.party on 3.3.3.3
Bigger Data
Let’s start by making some bigger data. We’re still using our Buffy the Vampire Slayer-themed table created in the previous tutorial, but if you didn’t follow that one, you can set it up easily:
root@dbsecure:~# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 47 Server version: 10.3.22-MariaDB-0+deb10u1 Debian 10 Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> create database gtest; Query OK, 1 row affected (0.000 sec) MariaDB [(none)]> use gtest; Database changed MariaDB [gtest]> create table buffy ( name varchar(30), team_role varchar(100) ); Query OK, 0 rows affected (0.012 sec) MariaDB [gtest]>
Now I’d like to load a bigger quantity of data. Start by shutting down db2 and db3. I’m doing this because we’re in synchronous mode and the initial data load will be much faster. On db2 and db3:
systemctl stop mariadb
On db1, I’m going to load 250,000 rows. Here’s the perl script I used to generate the random data. Place this in /root/make_load.pl:
#!/usr/bin/perl -w @lets = qw ( a b c d e f g h i j k l m n o p q r s t u v w x y z ); sub random_word { my $word = ""; for ( my $i = 1; $i<=30; $i++ ) { $c = int(rand($#lets)); $word .= $lets[$c]; } return $word; } open (OUT,">load.sql"); print OUT "use gtest;\n"; for ( $i = 1; $i <=250000; $i++ ) { $name = random_word(); $tr = random_word(); print OUT "insert into buffy (name, team_role) values ( '$name', '$tr');\n"; if ( $i % 10000 == 0 ) { print "$i done\n"; } }
Now run it:
root@db1:~# chmod 755 make_load.pl root@db1:~# time ./make_load.pl 10000 done 20000 done 30000 done 40000 done 50000 done 60000 done 70000 done 80000 done 90000 done 100000 done 110000 done 120000 done 130000 done 140000 done 150000 done 160000 done 170000 done 180000 done 190000 done 200000 done 210000 done 220000 done 230000 done 240000 done 250000 done real 0m5.541s user 0m5.472s sys 0m0.052s
And now load this data into our table:
root@db1:~# time mysql < ./load.sql real 3m28.575s user 0m3.124s sys 0m2.008s
And verifying that we now have 250,009 rows:
MariaDB [gtest]> select count(*) from buffy; +----------+ | count(*) | +----------+ | 250009 | +----------+ 1 row in set (0.111 sec)
Now startup nodes db2 and db3 with:
systemctl start mysql
Let’s see where db2 is at:
MariaDB [gtest]> select count(*) from buffy; ERROR 1047 (08S01): WSREP has not yet prepared node for application use
This is because Galera has discovered that its table is out of sync. Take a look in /var/log/mysql/error.log. You’ll see messages about “IST” which is an “incremental state transfer” (see below):
2020-07-22 9:03:55 0 [Note] WSREP: Receiving IST... 9.9% ( 26832/269926 events) complete. 2020-07-22 9:04:05 0 [Note] WSREP: Receiving IST... 13.9% ( 37456/269926 events) complete. 2020-07-22 9:04:15 0 [Note] WSREP: Receiving IST... 17.8% ( 48032/269926 events) complete. 2020-07-22 9:04:25 0 [Note] WSREP: Receiving IST... 23.3% ( 62864/269926 events) complete. 2020-07-22 9:04:35 0 [Note] WSREP: Receiving IST... 28.7% ( 77488/269926 events) complete. 2020-07-22 9:04:45 0 [Note] WSREP: Receiving IST... 33.8% ( 91344/269926 events) complete. 2020-07-22 9:04:55 0 [Note] WSREP: Receiving IST... 38.5% (103840/269926 events) complete. 2020-07-22 9:05:05 0 [Note] WSREP: Receiving IST... 42.2% (113888/269926 events) complete. 2020-07-22 9:05:15 0 [Note] WSREP: Receiving IST... 46.0% (124192/269926 events) complete. 2020-07-22 9:05:25 0 [Note] WSREP: Receiving IST... 50.9% (137360/269926 events) complete. 2020-07-22 9:05:35 0 [Note] WSREP: Receiving IST... 56.0% (151024/269926 events) complete. 2020-07-22 9:05:45 0 [Note] WSREP: Receiving IST... 61.5% (165952/269926 events) complete. 2020-07-22 9:05:55 0 [Note] WSREP: Receiving IST... 67.2% (181328/269926 events) complete. 2020-07-22 9:06:05 0 [Note] WSREP: Receiving IST... 72.2% (194880/269926 events) complete. 2020-07-22 9:06:16 0 [Note] WSREP: Receiving IST... 77.1% (208192/269926 events) complete. 2020-07-22 9:06:26 0 [Note] WSREP: Receiving IST... 82.3% (222160/269926 events) complete. 2020-07-22 9:06:36 0 [Note] WSREP: Receiving IST... 88.0% (237424/269926 events) complete. 2020-07-22 9:06:46 0 [Note] WSREP: Receiving IST... 93.9% (253344/269926 events) complete. 2020-07-22 9:06:51 0 [Warning] WSREP: Could not find peer: 2020-07-22 9:06:51 0 [Note] WSREP: 1.0 (db2): State transfer from -1.-1 (left the group) complete. 2020-07-22 9:06:51 0 [Note] WSREP: Member 1.0 (db2) synced with group. 2020-07-22 9:06:56 0 [Note] WSREP: Receiving IST... 99.6% (268880/269926 events) complete. 2020-07-22 9:06:56 0 [Note] WSREP: Receiving IST...100.0% (269926/269926 events) complete. 2020-07-22 9:06:56 1 [Note] WSREP: IST received: 0757dbf9-cc32-11ea-8f14-7fd9119b8ded:276248 2020-07-22 9:06:57 0 [Note] WSREP: 2.0 (db3): State transfer from 0.0 (db1) complete. 2020-07-22 9:06:57 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 276248) 2020-07-22 9:06:57 0 [Note] WSREP: Member 2.0 (db3) synced with group. 2020-07-22 9:06:57 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 276248) 2020-07-22 9:06:57 1 [Note] WSREP: Synchronized with group, ready for connections 2020-07-22 9:06:57 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Once you see the “Synchronized with group” message, the node is in sync:
MariaDB [gtest]> select count(*) from buffy; +----------+ | count(*) | +----------+ | 250009 | +----------+ 1 row in set (0.095 sec)
Replication Modes
A Galera node can pull from any other node – there is no concept of a “master” node.
Galera has two replication modes:
- Incremental State Transfer (IST) – in this mode, only missing transactions are sent
- State Snapshot Transfer (SST) – in this mode, a full data copy is done
As you see above, the messages about “Receiving IST” indicate that IST is being used.
Let’s look at an SST example and then we’ll discuss how the choice between IST and SST is made.
Simulating a Node Failure
I took the db3 node and reinitialized it on the provider panel (back to a Debian 10 base). This simulates the complete failure of a node and the need to replace it.
After reinstalling mariadb-server and rsync, I added the galera.cnf and restarted MariaDB. It joined itself to the cluster. I didn’t need to touch db1 or db2 because their cluster knowledge hasn’t changed. The cluster definition is still the same.
Here is some of what I saw in /var/log/mysql/error.log:
2020-07-22 13:48:03 1 [Note] WSREP: New cluster view: global state: b270c869-cc53-11ea-9e49-7ec801f4f158:250011, view# 9: Primary, number of nodes: 3, my index: 1, protocol version 3 2020-07-22 13:48:03 1 [Warning] WSREP: Gap in state sequence. Need state transfer. 2020-07-22 13:48:03 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '3.3.3.3' --datadir '/var/lib/mysql/' --parent '10989' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1' 2020-07-22 13:48:03 1 [Note] WSREP: Prepared SST request: rsync|3.3.3.3:4444/rsync_sst 2020-07-22 13:48:03 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-07-22 13:48:03 1 [Note] WSREP: REPL Protocols: 9 (4, 2) 2020-07-22 13:48:03 1 [Note] WSREP: Assign initial position for certification: 250011, protocol version: 4 2020-07-22 13:48:03 0 [Note] WSREP: Service thread queue flushed. 2020-07-22 13:48:03 1 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b270c869-cc53-11ea-9e49-7ec801f4f158): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable. 2020-07-22 13:48:03 0 [Note] WSREP: Member 1.0 (db3) requested state transfer from '*any*'. Selected 0.0 (db2)(SYNCED) as donor. 2020-07-22 13:48:03 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 250011) 2020-07-22 13:48:03 1 [Note] WSREP: Requesting state transfer: success, donor: 0 2020-07-22 13:48:03 1 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> b270c869-cc53-11ea-9e49-7ec801f4f158:250011 2020-07-22 13:48:05 0 [Note] WSREP: (a2649e0d, 'tcp://0.0.0.0:4567') connection to peer a2649e0d with addr tcp://3.3.3.3:4567 timed out, no messages seen in PT3S 2020-07-22 13:48:05 0 [Note] WSREP: (a2649e0d, 'tcp://0.0.0.0:4567') turning message relay requesting off WSREP_SST: [INFO] Joiner cleanup. rsync PID: 11046 (20200722 13:48:10.514) 2020-07-22 13:48:10 0 [Note] WSREP: 0.0 (db2): State transfer to 1.0 (db3) complete. 2020-07-22 13:48:10 0 [Note] WSREP: Member 0.0 (db2) synced with group. WSREP_SST: [INFO] Joiner cleanup done. (20200722 13:48:11.024) 2020-07-22 13:48:11 0 [Note] WSREP: SST complete, seqno: 250011
As you can see, SST was used, which makes sense since it’s initializing from scratch.
How is the choice between IST and SST made? It mainly depends on the gcache_size value, which is 128MB by default. If the data requested (the missing transactions) is in this cache, MariaDB will use IST. If it is not, then SST is required. So on a busy DB server with a lot of change, often data will come and go from the cache quickly. Keep in mind that our 250,000 rows are only about 17MB of space:
root@db1:~# mysqldump gtest > /tmp/dump.sql root@db1:~# du -sh /tmp/dump.sql 17M /tmp/dump.sql
You can see your gcache_size value by
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE ‘wsrep_provider_options’\G
You will see all the options, which is a big dump of variables. Here is an extract showing gcache_size:
gcache.page_size = 128M;
gcache.recover = no;
gcache.size = 128M;
gcomm.thread_prio = ;
gcs.fc_debug = 0;
You can increase this if you wish in the galera.cnf:
wsrep_provider_options="gcache.size=1G"
Related Posts:
- LowEndBoxTV: Are All Hetzner Locations Alike?No!And Piotr Has the Proof: “This is Clearly Not the Same Processor” - November 22, 2024
- One Week From Tomorrow…THE WORLD WILL LOSE THEIR MINDS!Lines Are Already Forming! - November 21, 2024
- Crunchbits Discontinuing Popular Annual Plans – The Community Mourns! - November 20, 2024
Leave a Reply