RHEV 3.5: Recovering from a Catastrophic Host Failure

 

Cat-day.jpg

A while ago, I needed to tear down one of the nodes in my RHEV cluster, as I wanted to re-purpose it as a RHEL7 host. A few hours before a planned to shut it down, I logged into my RHEV-M console and put the node in “Maintenance” status.

Later, when I came home I powered off the node and rebuilt it, not once checking to ensure that the VMs running on the host had migrated properly. They had not, and this was a problem. I unknowingly fubared my cluster.

When I logged back into RHEV-M, I found the several VMs with a status of “?” or “Unknown State“, and a physical host that I was unable to remove from the cluster, despite the fact that it had been rebuilt.

So now what….

Fix the VMs first…

There were several VMs that were running on the now retired RHEV-H host when the server was powered off. I needed to fix these before I could remove the host. I tried powering them off, on, or migrating them from the WebUI, but this did not work. So I had to delete them manually from the RHEV-M database. Note that its possible that I could have figured out how to save the VMs instead of deleting them, but this was not a priority since this is my homelab environment.

First lets SSH to the RHEV-M server and log into the database.

# source /etc/ovirt-engine/engine.conf.d/10-setup-database.conf
# export PGPASSWORD=$ENGINE_DB_PASSWORD
#psql -h localhost -U engine engine

Now we figure out the vm_guids for each VM.  Below I am starting with my logging server (log.lab.localdomain).

Below we are telling RHEV to mark each VM as Powered Off. Nothing here will delete your VM.

engine=> select vm_guid from vm_static where vm_name = ‘log.lab.localdomain’;
vm_guid
————————————–
f2c43e33-fd02-4b74-b86c-9e9ff9b8c51b
(1 row)

engine=> update vm_dynamic set status = 0 where vm_guid = ‘f2c43e33-fd02-4b74-b86c-9e9ff9b8c51b’;
UPDATE 1

I needed to run through this process a total of four times, once for each VM that was orphaned, scared, and alone.

Now put the node in Maintenance Mode…

First open up another ssh session to your RHEV-M host, and stop Jboss

# service jbossas stop

Now back in your other window, which should still be connected to the engine database….

My node is named titan.lab.localdomain – we need to get ids vds_id.

engine=# select vds_id, storage_pool_name from vds where vds_name = ‘titan.lab.localdomain’ ;
vds_id | storage_pool_name
————————————–+——————-
e59198b0-fc75-4c5d-b31e-1ab639a1f708 | Auburn_Datacenter
(1 row)

Now lets put it in maintenance mode.

engine=# update vds_dynamic set status = 4 where vds_id = ‘e59198b0-fc75-4c5d-b31e-1ab639a1f708’;
UPDATE 1

Now lets start jboss in our other window.

# service jbossas start

At this point I was able to log back into the WebUI and remove the host.

2 thoughts on “RHEV 3.5: Recovering from a Catastrophic Host Failure

  1. There are easier way to achieve this than performing changes to the database.
    Did you try to consult with the ovirt users mailing list about this?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.