This week we’re going to cover how to configure our RHEV system for High Availability (HA). Specifically to handle VMs that need to be restarted automatically should something happen to the underlying hardware. Keep in mind, not every VM needs HA and not every VM with an HA configuration needs the same priority. We’re going to cover that as well as some optional features that may or may not be needed for your VMs, depending on the scenario.
Let’s get started
If you’ve been following along in the series or already have RHEV up and running, this will be a short but important lesson. We’re going to start working at the cluster level, the host level, then the VM level. After logging in, go straight to the “Clusters” tab, and select the RHEV Cluster that needs to be configured, then click “edit”. Select the “Resilience Policy” tab, and select either the “Migrate” option or the “Migrate only HA VMs” option. Most of the time the default “Migrate” option is what you will want.
Next go to the “Scheduling Policy” tab. Optionally, select the “Enable HA Reservation” This essentially allows RHEV to monitor the cluster for HA VMs and then guarantee that capacity exists for those VMs. You don’t need anything else on this tab. Go to the “Fencing Policy” tab and ensure that fencing is enabled; it is by default.
Next up, we need to configure our hosts for power management. This is how we properly “fence” them in case of hardware failure. In short, if a server hangs but doesn’t actually reboot or shutdown, then the VMs will also hang without a way to restart elsewhere. Fencing by way of power management ensures a clean and fast way of rebooting a server, thereby forcing the VMs to restart. Select each host in the cluster, then click “edit”, and then enable power management. You’ll need to add a fence agent which of course requires a supported device. Typically this is either a “smart power strip” from APS or WTI or onboard server management such as HPE iLO, Dell DRAC, etc.
If this post were more complete, I’d also highlight bonding the physical network interfaces, but that will be a separate demo and blog post. (HINT: I’ve already cut the demo, I just need to write the post…)
Finally, we configure the VM(s). For each VM that needs to have HA enabled, select that VM, click on “edit”, select the “High Availability” tab, and check the box for HA and use the drop down box to select the priority (high, medium, or low). The priority is relative to the other VMs in the same cluster.
Optionally, you can also add a watchdog timer to the VM. This is a virtual card that continually counts down to zero but is reset by the VM before it actually gets to zero. If something causes the VM to hang, it wouldn’t be able to reset the watchdog, allowing it to get to zero, which then triggers a watchdog event. The available events include none (simply logs the hang), reset (reboots the VM), poweroff (shutdown the VM), dump (force a core dump then pause), and pause (simply pause the VM). Again, the watchdog timer is optional.
Check out the demo, the setup is faster than the explanation (best viewed in full screen):
As always, comments and questions are welcome.
Hope this helps,