How to Recover an Out of Sync Mariadb Galera OpenStack Database cluster

MariaDB-Seal-with-Text

Introduction

This process can be used whenever your databases are out of sync. For example when someone without thinking reboots all nodes in a cluster without shutting down the databases first.

Resolution

Place all cluster hosts into standby mode and cleanup any failed resources.

I sugguest making a back-up of Mariadb on each controller nodes – just in case.

root@controler3#  mysqldump --all-databases > mariadb_dump_06152015
root@controller3 # pcs cluster standby --all
root@controller3 # pcs status
root@controller3 # pcs resource cleanup ${resource}

Then on each Controller Node – verify that mariadb (mysql) has stopped. If any instance has not stopped properly via PCS as shown below, please stop them manually.

# systemctl status mariadb.service
mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled)
   Active: inactive (dead)

Find the latest (i.e – largest) version of the DB (seqno), or choose a node if all have the same version. In my instance, controller3 had the largest/highest version number, so we will be doing most of our recovery work on the instance below.

root@controller3 # cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    2fb3bbe0-eed6-11e4-ac79-4b77a337d598
seqno:   12175047
cert_index:

Stop puppet if it is running and edit the /etc/my.cnf.d/galera.cnf file. If you are not running puppet, or if puppet is not managing your Galera cluster you can skip this step and go ahead and edit the file.

root@controller3 # systemctl stop puppet

Set wsrep_cluster_address manually in /etc/my.cnf.d/galera.cnf making a note of the original value and restart mariadb manually.

For example, your default configuration should look something like what is shown below. Each IP address listed is the IP address of a mariadb instance in your cluster.

# Group communication system handle
wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22"

Modify to this by commenting out the default string and adding the string shown in the example below. Note that in this example we are working on controller3.

# Group communication system handle
#wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22"
wsrep_cluster_address="gcomm://"

Now restart Mariadb manually on the controller3.

root@controller3 # systemctl start mariadb

Now start mariadb on one of the remaining controllers

root@controller2 # systemctl start mariadb

Below you can see the newly started Maridb instance as requested to sync

Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsr...ver.pid'
Jun 15 10:08:04 controller2 mysqld_safe[24999]: 150615 10:08:04 mysqld_safe WSREP: Recovered position 2fb3bbe0-eed6-11e4-ac79-4b77a337d598:12175047
Jun 15 10:08:06 controller2 systemd[1]: Started MariaDB database server.


Once running you can un-standby the nodes and monitor the status as the remaining resources are loaded.

root@rcontroller3 # pcs cluster unstandby --all
root@node # pcs status

Finally, correct the value of wsrep_cluster_address to it’s original value in /etc/my.cnf.d/galera.cnf and restart the service, monitoring that the resource remains active.

root@controller3 # vi /etc/my.cnf.d/galera.cnf
root@controller3 # systemctl restart mariadb
root@controller3 # pcs status

Now check your databases to make sure that they are in sync. The following file should be the same on each controller.

[root@lppcldiuctl01 ~(openstack_admin)]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    2fb3bbe0-eed6-11e4-ac79-4b77a337d598
seqno:   -1
cert_index:

Reference:
https://access.redhat.com/solutions/1283463

One thought on “How to Recover an Out of Sync Mariadb Galera OpenStack Database cluster

  1. Howdy – I just got through recovering from a nasty set of problems (self-induced). In my case: memcached all of sudden stopped working with Keystone auth tokens (I haven’t debugged why). So I switched backend to be MySQL (I use Galera MariaDB cluster for HA) and – of course – keystone tokens are not cleaned up. So I had 900K of expired token records – and my scripts to do the cleanup were failing because the max replication size of 1GB wasn’t enough.

    I solved the problem by:

    1. Stopping mariadb on both primary / secondary (also puppet).

    2. On master: Start shell as mysql user (requires /etc/passwd mod) and start db manually: /usr/bin/mysqld_safe –basedir=/usr

    3. On secondary: set recovery variables in /etc/my.cnf:
    innodb_force_recovery=5
    innodb_purge_threads=0
    port=8881
    The port change prevents secondary from being hammered in recovery. See http://blackbird.si/mysql-corrupted-innodb-tables-recovery-step-by-step-guide/ for other notes.

    4. On secondary: start shell as mysql user (again, /etc/passwd mod required) and then start database instance normally (/usr/bin/mysqld_safe –basedir=/usr)

    5. Now some magic happens: The secondary first uses rsync to get the entire /var/lib/mysql/ibdata1 file – which effectively brings the databases back in sync. However, the innodb_force_recovery option placed the secondary in read-only mode – so you will see an “error” on the command line (and in /var/log/mariadb/mariadb.log). Not to worry! Just remove the recovery variables from /etc/my.cnf and restart the secondary.

    6. After another rsync (faster this time) you should see the cluster is properly synchronized. In my case, this is where I had to fix the underlying problem: deleting the 900K expired rows from Keystone token table 1000 rows at a time (to keep transactions small and manageable). That took 2 hours to complete. After that – I stopped both primary and secondary (mysqladmin [-u -p -h] shutdown) and then started normally (systemctl start mariadb) and verified that all came back online. Finally, I verified I could login to OpenStack and that I could issue Nova / Glance / Neutron queries as expected (remember that MySQL is required in my case as Keystone caches auth tokens in the database).

    Hope this helps others!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.