LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

Nodes and Modes: Maintaining Galera Replication for MariaDB

Tags: , , , , Date/Time: September 23, 2021 @ 12:00 am, by raindog308

GaleraIn a previous tutorial, we showed you how to setup Galera replication for MariaDB so you could have multi-node, multi-master clusters.  We also saw how MariaDB painlessly replicates data.  In this tutorial, we’re doing to forcibly interject a little pain into the process by seeing how Galera deals with down loads and replaced nodes.

Once again, we’re using the following cluster setup:

In this design, I’m using the following cluster and anonymized IPs:

  • db1.lowend.party on 1.1.1.1
  • db2.lowend.party on 2.2.2.2
  • db3.lowend.party on 3.3.3.3

Bigger Data

Let’s start by making some bigger data.  We’re still using our Buffy the Vampire Slayer-themed table created in the previous tutorial, but if you didn’t follow that one, you can set it up easily:

root@dbsecure:~# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 47
Server version: 10.3.22-MariaDB-0+deb10u1 Debian 10
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> create database gtest;
Query OK, 1 row affected (0.000 sec)

MariaDB [(none)]> use gtest;
Database changed

MariaDB [gtest]> create table buffy ( name varchar(30), team_role varchar(100) );
Query OK, 0 rows affected (0.012 sec)

MariaDB [gtest]> 

Now I’d like to load a bigger quantity of data.  Start by shutting down db2 and db3.  I’m doing this because we’re in synchronous mode and the initial data load will be much faster.  On db2 and db3:

systemctl stop mariadb

On db1, I’m going to load 250,000 rows.  Here’s the perl script I used to generate the random data.  Place this in /root/make_load.pl:

#!/usr/bin/perl -w

@lets = qw ( a b c d e f g h i j k l m n o p q r s t u v w x y z );

sub random_word {
  my $word = "";
  for ( my $i = 1; $i<=30; $i++ ) {
    $c = int(rand($#lets));
    $word .= $lets[$c];
  }
  return $word;
}

open (OUT,">load.sql");
print OUT "use gtest;\n";
for ( $i = 1; $i <=250000; $i++ ) {
  $name = random_word();
  $tr = random_word();
  print OUT "insert into buffy (name, team_role) values ( '$name', '$tr');\n";
  if ( $i % 10000 == 0 ) { print "$i done\n"; }
}

Now run it:

root@db1:~# chmod 755 make_load.pl
root@db1:~# time ./make_load.pl 
10000 done
20000 done
30000 done
40000 done
50000 done
60000 done
70000 done
80000 done
90000 done
100000 done
110000 done
120000 done
130000 done
140000 done
150000 done
160000 done
170000 done
180000 done
190000 done
200000 done
210000 done
220000 done
230000 done
240000 done
250000 done

real 0m5.541s
user 0m5.472s
sys 0m0.052s

And now load this data into our table:

root@db1:~# time mysql < ./load.sql 
real 3m28.575s
user 0m3.124s
sys 0m2.008s

And verifying that we now have 250,009 rows:

MariaDB [gtest]> select count(*) from buffy;
+----------+
| count(*) |
+----------+
|   250009 |
+----------+
1 row in set (0.111 sec)

Now startup nodes db2 and db3 with:

systemctl start mysql

Let’s see where db2 is at:

MariaDB [gtest]> select count(*) from buffy;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

This is because Galera has discovered that its table is out of sync.  Take a look in /var/log/mysql/error.log.  You’ll see messages about “IST” which is an “incremental state transfer” (see below):

2020-07-22  9:03:55 0 [Note] WSREP: Receiving IST...  9.9% ( 26832/269926 events) complete.
2020-07-22  9:04:05 0 [Note] WSREP: Receiving IST... 13.9% ( 37456/269926 events) complete.
2020-07-22  9:04:15 0 [Note] WSREP: Receiving IST... 17.8% ( 48032/269926 events) complete.
2020-07-22  9:04:25 0 [Note] WSREP: Receiving IST... 23.3% ( 62864/269926 events) complete.
2020-07-22  9:04:35 0 [Note] WSREP: Receiving IST... 28.7% ( 77488/269926 events) complete.
2020-07-22  9:04:45 0 [Note] WSREP: Receiving IST... 33.8% ( 91344/269926 events) complete.
2020-07-22  9:04:55 0 [Note] WSREP: Receiving IST... 38.5% (103840/269926 events) complete.
2020-07-22  9:05:05 0 [Note] WSREP: Receiving IST... 42.2% (113888/269926 events) complete.
2020-07-22  9:05:15 0 [Note] WSREP: Receiving IST... 46.0% (124192/269926 events) complete.
2020-07-22  9:05:25 0 [Note] WSREP: Receiving IST... 50.9% (137360/269926 events) complete.
2020-07-22  9:05:35 0 [Note] WSREP: Receiving IST... 56.0% (151024/269926 events) complete.
2020-07-22  9:05:45 0 [Note] WSREP: Receiving IST... 61.5% (165952/269926 events) complete.
2020-07-22  9:05:55 0 [Note] WSREP: Receiving IST... 67.2% (181328/269926 events) complete.
2020-07-22  9:06:05 0 [Note] WSREP: Receiving IST... 72.2% (194880/269926 events) complete.
2020-07-22  9:06:16 0 [Note] WSREP: Receiving IST... 77.1% (208192/269926 events) complete.
2020-07-22  9:06:26 0 [Note] WSREP: Receiving IST... 82.3% (222160/269926 events) complete.
2020-07-22  9:06:36 0 [Note] WSREP: Receiving IST... 88.0% (237424/269926 events) complete.
2020-07-22  9:06:46 0 [Note] WSREP: Receiving IST... 93.9% (253344/269926 events) complete.
2020-07-22  9:06:51 0 [Warning] WSREP: Could not find peer: 
2020-07-22  9:06:51 0 [Note] WSREP: 1.0 (db2): State transfer from -1.-1 (left the group) complete.
2020-07-22  9:06:51 0 [Note] WSREP: Member 1.0 (db2) synced with group.
2020-07-22  9:06:56 0 [Note] WSREP: Receiving IST... 99.6% (268880/269926 events) complete.
2020-07-22  9:06:56 0 [Note] WSREP: Receiving IST...100.0% (269926/269926 events) complete.
2020-07-22  9:06:56 1 [Note] WSREP: IST received: 0757dbf9-cc32-11ea-8f14-7fd9119b8ded:276248
2020-07-22  9:06:57 0 [Note] WSREP: 2.0 (db3): State transfer from 0.0 (db1) complete.
2020-07-22  9:06:57 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 276248)
2020-07-22  9:06:57 0 [Note] WSREP: Member 2.0 (db3) synced with group.
2020-07-22  9:06:57 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 276248)
2020-07-22  9:06:57 1 [Note] WSREP: Synchronized with group, ready for connections
2020-07-22  9:06:57 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

Once you see the “Synchronized with group” message, the node is in sync:

MariaDB [gtest]> select count(*) from buffy;
+----------+
| count(*) |
+----------+
|   250009 |
+----------+
1 row in set (0.095 sec)

Replication Modes

A Galera node can pull from any other node – there is no concept of a “master” node.

Galera has two replication modes:

  • Incremental State Transfer (IST) – in this mode, only missing transactions are sent
  • State Snapshot Transfer (SST) – in this mode, a full data copy is done

As you see above, the messages about “Receiving IST” indicate that IST is being used.

Let’s look at an SST example and then we’ll discuss how the choice between IST and SST is made.

Simulating a Node Failure

I took the db3 node and reinitialized it on the provider panel (back to a Debian 10 base).  This simulates the complete failure of a node and the need to replace it.

After reinstalling mariadb-server and rsync, I added the galera.cnf and restarted MariaDB.  It joined itself to the cluster.  I didn’t need to touch db1 or db2 because their cluster knowledge hasn’t changed.  The cluster definition is still the same.

Here is some of what I saw in /var/log/mysql/error.log:

2020-07-22 13:48:03 1 [Note] WSREP: New cluster view: global state: b270c869-cc53-11ea-9e49-7ec801f4f158:250011, view# 9: Primary, number of nodes: 3, my index: 1, protocol version 3
2020-07-22 13:48:03 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2020-07-22 13:48:03 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '3.3.3.3' --datadir '/var/lib/mysql/' --parent '10989' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
2020-07-22 13:48:03 1 [Note] WSREP: Prepared SST request: rsync|3.3.3.3:4444/rsync_sst
2020-07-22 13:48:03 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-07-22 13:48:03 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
2020-07-22 13:48:03 1 [Note] WSREP: Assign initial position for certification: 250011, protocol version: 4
2020-07-22 13:48:03 0 [Note] WSREP: Service thread queue flushed.
2020-07-22 13:48:03 1 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b270c869-cc53-11ea-9e49-7ec801f4f158): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2020-07-22 13:48:03 0 [Note] WSREP: Member 1.0 (db3) requested state transfer from '*any*'. Selected 0.0 (db2)(SYNCED) as donor.
2020-07-22 13:48:03 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 250011)
2020-07-22 13:48:03 1 [Note] WSREP: Requesting state transfer: success, donor: 0
2020-07-22 13:48:03 1 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> b270c869-cc53-11ea-9e49-7ec801f4f158:250011
2020-07-22 13:48:05 0 [Note] WSREP: (a2649e0d, 'tcp://0.0.0.0:4567') connection to peer a2649e0d with addr tcp://3.3.3.3:4567 timed out, no messages seen in PT3S
2020-07-22 13:48:05 0 [Note] WSREP: (a2649e0d, 'tcp://0.0.0.0:4567') turning message relay requesting off WSREP_SST: [INFO] Joiner cleanup. rsync PID: 11046 (20200722 13:48:10.514)
2020-07-22 13:48:10 0 [Note] WSREP: 0.0 (db2): State transfer to 1.0 (db3) complete.
2020-07-22 13:48:10 0 [Note] WSREP: Member 0.0 (db2) synced with group. WSREP_SST: [INFO] Joiner cleanup done. (20200722 13:48:11.024)
2020-07-22 13:48:11 0 [Note] WSREP: SST complete, seqno: 250011

As you can see, SST was used, which makes sense since it’s initializing from scratch.

How is the choice between IST and SST made?  It mainly depends on the gcache_size value, which is 128MB by default.  If the data requested (the missing transactions) is in this cache, MariaDB will use IST.  If it is not, then SST is required.   So on a busy DB server with a lot of change, often data will come and go from the cache quickly.  Keep in mind that our 250,000 rows are only about 17MB of space:

root@db1:~# mysqldump gtest > /tmp/dump.sql
root@db1:~# du -sh /tmp/dump.sql
17M /tmp/dump.sql

You can see your gcache_size value by

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE ‘wsrep_provider_options’\G

You will see all the options, which is a big dump of variables.  Here is an extract showing gcache_size:

gcache.page_size = 128M; 
gcache.recover = no; 
gcache.size = 128M; 
gcomm.thread_prio = ; 
gcs.fc_debug = 0; 

You can increase this if you wish in the galera.cnf:

wsrep_provider_options="gcache.size=1G"

 

 

I'm Andrew, techno polymath and long-time LowEndTalk community Moderator. My technical interests include all things Unix, perl, python, shell scripting, and relational database systems. I enjoy writing technical articles here on LowEndBox to help people get more out of their VPSes.

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published. Required fields are marked *