How to Recover an Out of Sync Mariadb Galera OpenStack Database cluster

MariaDB-Seal-with-Text

Introduction

This process can be used whenever your databases are out of sync. For example when someone without thinking reboots all nodes in a cluster without shutting down the databases first.

Resolution

Place all cluster hosts into standby mode and cleanup any failed resources.

I sugguest making a back-up of Mariadb on each controller nodes – just in case.

root@controler3#  mysqldump --all-databases > mariadb_dump_06152015
root@controller3 # pcs cluster standby --all
root@controller3 # pcs status
root@controller3 # pcs resource cleanup ${resource}

Then on each Controller Node – verify that mariadb (mysql) has stopped. If any instance has not stopped properly via PCS as shown below, please stop them manually.

# systemctl status mariadb.service
mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled)
   Active: inactive (dead)

Find the latest (i.e – largest) version of the DB (seqno), or choose a node if all have the same version. In my instance, controller3 had the largest/highest version number, so we will be doing most of our recovery work on the instance below.

root@controller3 # cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    2fb3bbe0-eed6-11e4-ac79-4b77a337d598
seqno:   12175047
cert_index:

Stop puppet if it is running and edit the /etc/my.cnf.d/galera.cnf file. If you are not running puppet, or if puppet is not managing your Galera cluster you can skip this step and go ahead and edit the file.

root@controller3 # systemctl stop puppet

Set wsrep_cluster_address manually in /etc/my.cnf.d/galera.cnf making a note of the original value and restart mariadb manually.

For example, your default configuration should look something like what is shown below. Each IP address listed is the IP address of a mariadb instance in your cluster.

# Group communication system handle
wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22"

Modify to this by commenting out the default string and adding the string shown in the example below. Note that in this example we are working on controller3.

# Group communication system handle
#wsrep_cluster_address="gcomm://172.17.9.23,172.17.9.24,172.17.9.22"
wsrep_cluster_address="gcomm://"

Now restart Mariadb manually on the controller3.

root@controller3 # systemctl start mariadb

Now start mariadb on one of the remaining controllers

root@controller2 # systemctl start mariadb

Below you can see the newly started Maridb instance as requested to sync

Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Jun 15 10:08:02 controller2 mysqld_safe[24999]: 150615 10:08:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsr...ver.pid'
Jun 15 10:08:04 controller2 mysqld_safe[24999]: 150615 10:08:04 mysqld_safe WSREP: Recovered position 2fb3bbe0-eed6-11e4-ac79-4b77a337d598:12175047
Jun 15 10:08:06 controller2 systemd[1]: Started MariaDB database server.


Once running you can un-standby the nodes and monitor the status as the remaining resources are loaded.

root@rcontroller3 # pcs cluster unstandby --all
root@node # pcs status

Finally, correct the value of wsrep_cluster_address to it’s original value in /etc/my.cnf.d/galera.cnf and restart the service, monitoring that the resource remains active.

root@controller3 # vi /etc/my.cnf.d/galera.cnf
root@controller3 # systemctl restart mariadb
root@controller3 # pcs status

Now check your databases to make sure that they are in sync. The following file should be the same on each controller.

[root@lppcldiuctl01 ~(openstack_admin)]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    2fb3bbe0-eed6-11e4-ac79-4b77a337d598
seqno:   -1
cert_index:

Reference:
https://access.redhat.com/solutions/1283463

XenServer 6: There Was an Error Connecting to the Server. The Service Did Not Reply Properly

GoombaWow, this is a really overly complicated error for such a simple problem to resolve. Allow me to give you some background.

I am currently building my first production ready (well non-production really) XenServer cluster and ran into this issue when attempting to add my second host into the cluster. I hit google and found out that this was actually just a dns issue.

A quick check on the /etc/resolv.conf on two of my nodes, shows nothing but the following line.

; generated by /etc/sysconfig/network-scripts/ifup-post

Well great, on a standard linux box I would have just added my name server and would have been half way to the bar, but judging by the contents of the resolv.conf I figured that I was supposed to add it another way.

Well after a bit of poking around in XenCenter I found this. Click on the hostname of the XenServer, then click on the "Networking" tab, from there click on "Configure…" below the "Management Interfaces" section as illustrated below. You will then be presented with a pop-up window where you can enter your nameservers.

Screenshot4
Once you have configured DNS properly you can then add the host to the cluster.

Note that you can also do this from the command line, however you have to go basically reconfigure your management interface.. ip, gateway, and everything that goes with it.

First run the command below

#xe pif-list host-name-label=xen01 management=true

Then using the UUID of the management interface, run the command below. Replace my IP addresses and uuid with yours.

#xe pif-reconfigure-ip mode=static IP=10.120.72.11 uuid=dc6b6651-6067-9a52-2011-6ba102da39e1 DNS=10.120.69.1 netmask=255.255.255.0 gateway=10.120.72.1

Seeing how fickle XenServer Clustering is regarding DNS, its probably not a bad idea to add /etc/host entries on your XenServer nodes for each server that will be in your cluster.  You never know when dns might go out to lunch and you don't want your HA cluster affected.

For future reference you can check all the configuration parameters of your management interface with the following commands.

First get the UUID of your management interface.

xe pif-list management=true host-name-label=xen01

Then check the configuration via the UUID.

xe pif-param-list uuid=f61b8d4d-67ec-e262-3e16-4348baaed076

And for example if you need to configure the DNS search domain, you can run the following.

xe pif-param set uuid=f61b8d4d-67ec-e262-3e16-4348baaed076 other-config:domain=MYDOMAIN

Cisco UCS Failover Via 6248 CLI

Ucs_6248_lg1Need to fail over Cisco UCS Manager from one 6248 to another. Oh did I mention that you are in a hurry because you are going to have to sift through pages and pages of pdfs to get the information. I found this out the hard way. Anyway, out of the kindness of my heart I have documented this process below.

First log into one of your 6248s and figure out which one is the primary by running the command below.

ucs01-A# show cluster state
Cluster Id: 0x4b05e6042b6111e1-0x8c9e547fee4bbf24

A: UP, SUBORDINATE
B: UP, PRIMARY

Note that UCS01-B is our primary. So log into UCS-B and issue the command below.

ucs01-B# connect local-mgmt

Then run the command to make UCS-A our primary.

ucs01-B(local-mgmt)# cluster lead a
Cluster Id: 0x4b05e6042b6111e1-0x8c9e547fee4bbf24

Note that you must issue the failover command on the node that is the primary, otherwise this happens.

ucs01-A(local-mgmt)# cluster lead a
Cluster Id: 0x4b05e6042b6111e1-0x8c9e547fee4bbf24
request failed: local node is subordinate

During the failover process you will see the output below when checking the cluster state.

ucs01-B(local-mgmt)# show cluster state

Cluster Id: 0x4b05e6042b6111e1-0x8c9e547fee4bbf24 B: UP, PRIMARY, (Management services: SWITCHOVER IN PROGRESS) A: UP, SUBORDINATE,

(Management services: SWITCHOVER IN PROGRESS) HA NOT READY Management services: switchover in progress on local Fabric Interconnect

How Not to Assign KVM IP Addresses Via Cisco UCS Manager

Boxing-gloveAfter a few hours poking around a newly deployed UCS cluster trying to get some basic profiles created and assigned. I realized that I had actually no idea how the KVM is actually supposed to work inside the UCS cluster. Which is funny as this was a subject that we touched on during my DCUDC class. Apparently we did not touch on it enough.

Anyway, before I get ahead of myself, lets review the gear in this cluster.

2 5108 chassis
7 B200 M2 blades with 2104 IOMs
2 6248s Fabric Interconnects

Now in my network all lights out management ips (ilos, ipmi, etc) are all on one particular vlan, which for the purpose of this post we will call VLAN 100. Non application related infrastructure equipment (servers, virtual hosts) are on another vlan, which we will call VLAN 200. So when the Fabric Interconnets were deployed, I gave them each an ip address on VLAN 200. And once UCS Manager was up and running, I created a KVM ip address pool of unused ip addresses on VLAN 100. Well guess what, this is wrong.

Routing for the KVM addresses is done through the management interfaces on the Fabric Interconects, so unless you are using vlan tagging, your KVM pool must be on the same vlan as the ip addresses assigned to your Fabric Interconnects.

But wait, why is this?

I thought that I could even assign private 192.168.x.x ip addreses to the KVMs as they were only supposed to be managed via the UCS Manager, but this also incorrect.

Navigate to one of your working KVM ip addresses in a web browser and you can access the KVM of the blade outside of UCS Manager. Nice, which is how I actually expected this to work. 

Logo

Note that I find it rather dumb to have my KVM management ips and Fabric Interconnects on the same vlan as a rule, however since this is how its supposed to work I am going to have to let that one go.

Now, the fact that you can navigate to a specific KVM IP address via a web browser also makes the idea of using a pool of ip addresses silly. Would you not want to hard code the KVM ip address in the service profile so that you always know which server's console you are logging into? Dunno, I am still working on figuring that one out.

 

How to Disable DRS for one VM in a DRS Enabled Cluster

Vmotion_archVMware DRS (Distributed Resource Scheduler) is a feature of ESX that balances
computing workloads with available resources in a virtualized
environment. 

When you enable a cluster for DRS, VirtualCenter continuously monitors the distribution
of CPU and memory resources for all hosts and virtual machines in the
cluster. DRS compares these metrics to what resource utilization
ideally should be given the attributes of the resource pools and
virtual machines in the cluster, and the current load. Note that DRS is only available in ESX Enterprise or above.

When DRS is enabled in a cluster, ESX then will automagically vmotion guest VMs to other hosts in your cluster in an attempt to ballance out the load evenly across the cluster. However, sometimes this behavior is not always desired. For exmaple if you have a large VM that you want to stay pinned to a particular host.

In order to override the default DRS cluster settings for a vm, you need to do the following.

  1. Right Click on your cluster and then click on "edit settings"
  2. Under DRS, click on "Virtual Machine Options"
  3. Locate the particular VM and the drop down box under "Automation Level"
  4. Change "Default (Fully Automated)" to "Manual"