Introduction
This process can be used whenever your databases are out of sync. For example when someone without thinking reboots all nodes in a cluster without shutting down the databases first.
Resolution
Place all cluster hosts into standby mode and cleanup any failed resources.
I sugguest making a back-up of Mariadb on each controller nodes – just in case.
root@controler3# mysqldump --all-databases > mariadb_dump_06152015root@controller3 # pcs cluster standby --all root@controller3 # pcs status root@controller3 # pcs resource cleanup ${resource}
Then on each Controller Node – verify that mariadb (mysql) has stopped. If any instance has not stopped properly via PCS as shown below, please stop them manually.
# systemctl status mariadb.service mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled) Active: inactive (dead)
Find the latest (i.e – largest) version of the DB (seqno), or choose a node if all have the same version. In my instance, controller3 had the largest/highest version number, so we will be doing most of our recovery work on the instance below.
root@controller3 # cat /var/lib/mysql/grastate.dat # GALERA saved state version: 2.1 uuid: 2fb3bbe0-eed6-11e4-ac79-4b77a337d598 seqno: 12175047 cert_index:
Stop puppet if it is running and edit the /etc/my.cnf.d/galera.cnf file. If you are not running puppet, or if puppet is not managing your Galera cluster you can skip this step and go ahead and edit the file.
root@controller3 # systemctl stop puppet
Set wsrep_cluster_address manually in /etc/my.cnf.d/galera.cnf making a note of the original value and restart mariadb manually.
For example, your default configuration should look something like what is shown below. Each IP address listed is the IP address of a mariadb instance in your cluster.
# Group communication system handle wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22"
Modify to this by commenting out the default string and adding the string shown in the example below. Note that in this example we are working on controller3.
# Group communication system handle #wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22" wsrep_cluster_address="gcomm://"
Now restart Mariadb manually on the controller3.
root@controller3 # systemctl start mariadb
Now start mariadb on one of the remaining controllers
root@controller2 # systemctl start mariadb
Below you can see the newly started Maridb instance as requested to sync
Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsr...ver.pid' Jun 15 10:08:04 controller2 mysqld_safe[24999]: 150615 10:08:04 mysqld_safe WSREP: Recovered position 2fb3bbe0-eed6-11e4-ac79-4b77a337d598:12175047 Jun 15 10:08:06 controller2 systemd[1]: Started MariaDB database server.
Once running you can un-standby the nodes and monitor the status as the remaining resources are loaded.
root@rcontroller3 # pcs cluster unstandby --all root@node # pcs status
Finally, correct the value of wsrep_cluster_address to it’s original value in /etc/my.cnf.d/galera.cnf and restart the service, monitoring that the resource remains active.
root@controller3 # vi /etc/my.cnf.d/galera.cnf root@controller3 # systemctl restart mariadb root@controller3 # pcs status
Now check your databases to make sure that they are in sync. The following file should be the same on each controller.
[root@lppcldiuctl01 ~(openstack_admin)]# cat /var/lib/mysql/grastate.dat # GALERA saved state version: 2.1 uuid: 2fb3bbe0-eed6-11e4-ac79-4b77a337d598 seqno: -1 cert_index:
Reference:
https://access.redhat.com/solutions/1283463
Howdy – I just got through recovering from a nasty set of problems (self-induced). In my case: memcached all of sudden stopped working with Keystone auth tokens (I haven’t debugged why). So I switched backend to be MySQL (I use Galera MariaDB cluster for HA) and – of course – keystone tokens are not cleaned up. So I had 900K of expired token records – and my scripts to do the cleanup were failing because the max replication size of 1GB wasn’t enough.
I solved the problem by:
1. Stopping mariadb on both primary / secondary (also puppet).
2. On master: Start shell as mysql user (requires /etc/passwd mod) and start db manually: /usr/bin/mysqld_safe –basedir=/usr
3. On secondary: set recovery variables in /etc/my.cnf:
innodb_force_recovery=5
innodb_purge_threads=0
port=8881
The port change prevents secondary from being hammered in recovery. See http://blackbird.si/mysql-corrupted-innodb-tables-recovery-step-by-step-guide/ for other notes.
4. On secondary: start shell as mysql user (again, /etc/passwd mod required) and then start database instance normally (/usr/bin/mysqld_safe –basedir=/usr)
5. Now some magic happens: The secondary first uses rsync to get the entire /var/lib/mysql/ibdata1 file – which effectively brings the databases back in sync. However, the innodb_force_recovery option placed the secondary in read-only mode – so you will see an “error” on the command line (and in /var/log/mariadb/mariadb.log). Not to worry! Just remove the recovery variables from /etc/my.cnf and restart the secondary.
6. After another rsync (faster this time) you should see the cluster is properly synchronized. In my case, this is where I had to fix the underlying problem: deleting the 900K expired rows from Keystone token table 1000 rows at a time (to keep transactions small and manageable). That took 2 hours to complete. After that – I stopped both primary and secondary (mysqladmin [-u -p -h] shutdown) and then started normally (systemctl start mariadb) and verified that all came back online. Finally, I verified I could login to OpenStack and that I could issue Nova / Glance / Neutron queries as expected (remember that MySQL is required in my case as Keystone caches auth tokens in the database).
Hope this helps others!