vSphere HA is one of the great features that simplifies high availability protection for any supported Guest OS running in a VMware environment. If a vSphere Host fails HA will kick in and restart any VM’s that were running on another available host. HA has gone through a continual evolution and cycle of enhancemnts over the years and in vSphere 5 was greatly enhanced. But some of you may be wondering exactly what is going to happen when your clusters fail for real, and you have a total outage. Well I have experienced a complete outage a couple of times now for real. So I’d like to share with you the experience with HA in vSphere 5, and how that compares to vSphere 4.x.
Before I get started Duncan Epping and Frank Denneman wrote a great book about vSphere 5 and Clustering including HA titled vSphere 5.0 Clustering Technical Deepdive. I highly recommend you buy this book. It covers vSphere 5 clustering and HA in a lot of detail and is basically the bible on the subject. I also have links to Duncan (Yellow Bricks) and Frank’s blogs on my side bar.
Those of you who have read about My Lab Environment know I run some pretty serious kit. But unfortunately this is not housed in a robust commercial data center. It happens to be located in my home office. Now being as it is in my home office it is subject to unplanned downtime on a scale that most commercial environments would never face. Firstly I don’t have swipe card access controls to my office so there are my two young Sons that just love the flashing lights and the buttons on the front of the servers. Then there is the fact that until recently all of the UPS’s came off the same circuit. This mix means I regularly and unexpectedly test HA on either a partial or full scale failure basis, depending if it’s a server that’s been shut down due to a button press, or a circuit breaker that has gone because too many appliances have been turned on in the kitchen (no kidding).
I know by now you are probably rolling on the floor laughing, and that is good as this is meant to be lighthearted. But trust me there is a serious side to this post. With vSphere 4.x if I had a power outage, either by button press or power failure, it would take considerable time to recover, and there would be quite a lot of manual intervention (single host going down wasn’t a problem as HA would kick in and recover provided storage was available). All Paths Down events would occur, and situations such as the one described in my post The Achilles Heel of the VMware Distributed Switch could happen. As the majority of my storage (before the CX500) as well as all of my systems are virtualized it would take a little while to get everything back up and running, and potentially a few reboots. When I upgraded to vSphere 5 I was hopeful that the vast improvements in technology I had experienced over the years would continue. I was more than pleased with the actual results.
This is how the story goes:
One night while I was working late, sometime after I had upgraded my lab to vSphere 5, I got a call from my Wife. She was a little upset. She had been in the middle of cooking dinner and now nothing in the kitchen was working (of course this was my fault). There was also some loud beeping coming from my office, but she didn’t know what. I’m sure you can imagine what it was. It was all of my UPS’s running on battery power, which only lasts for 10 minutes, and I don’t have integrated shutdown (Does anyone make integrated shutdown compatible with vSphere 5? Please tell me). There wasn’t anything I could do about it so we agreed to leave it till I got home. Fortunately we have a gas oven/stove top so dinner wasn’t completely ruined.
When I got home I was expecting the worst. A long time to not only get power restored, but also to restore all of my systems. Little did I know I was about to experience one of the best new enhancements to vSphere HA (in my opinion).
I found and reset the circuit breaker that had tripped and my office sprung back into life. I decided to leave it for 10 minutes and go and make a coffee. When I returned I expected to find the virtual equivalent of the aftermath of a nuclear meltdown. However upon my return all of the hosts had restarted, including all of the virtual machines that were running on them before they powered down (crashed). This included my HP P4000 iSCSI storage that is local to each host but replicated across the network.
This was one of the little features that is easy to gloss over. But vSphere HA remembered the state of all of the virtual machines, whether they were powered on or powered off. Therefore when the power was restored, HA went about power on all the VM’s. Without any human intervention, within the space of about 15 minutes I had everything back up and running to a crash consistent state from where it was left. Luckily the loss of power had not caused any other damage.
So on top of all the great new features in vSphere 5 HA, which include removing the restrictions caused by 5 Primary nodes by moving to FDM, enhancements to admission control and vSphere HA VM Monitoring API’s, you also get auto restarting of the VM’s that were powered on at the time of the crash. vSphere 5 HA when enabled and configured appropriately makes recovery from host outage, be it transient or full outage of a host, or an entire cluster, much easier and quicker to recovery from. This alone is a worthwhile reason to consider upgrading to vSphere 5. vSphere 5 HA will make your life much easier and allow you to focus on other activities that add value to your organization.
I hope you enjoyed my story and I hope you don’t have to experience this type of unplanned downtime, at least not as often as I do in my home lab.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Here’s a tip: You don’t NEED to change anything when you do an ESX 4.1 to ESXi 5 upgrade. I know that sounds crazy, but when you pop in the install CD for ESXi 5, it will ask if you want to preserve your ESX 4.1 settings. And it actually works.
Hi Mike. It will actually work (although some settings from 4.1 are not taken across). But this assumes that the existing settings are optimal and the best for vSphere 5, which in many cases they won't be. vSphere HA is one of the most miss-configured features of vSphere.
My data-shed also shares a circuit breaker with the kitchen.
I've trained my wife and daughters to reset the breaker when the kettle stops working
Heh, good story dude, we've done a lot around integrating various UPS vendors and shutdown commands to vsphere, whether it be full ESX or ESXi and the vMA. Let me know if you need any info 🙂
Hi Ben, Anything on Dell UPS's and APC for vSphere 5 would be great. I used to run the APC Auto Shutdown software for vSphere 4. But I don't know if they've updated it.
[…] Perhaps this is a bit of a legacy that is now probably not needed with the changes to VMware HA in vSphere 5? The reason I say it’s a bit of a legacy is that in previous versions of vSphere HA would not record the power state of each VM and power everything back on when the hosts boot up in the case of a complete power outage. In vSphere 5 this is exactly what happens. You can find out more by reading my article vSphere 5 HA Complete Failure Experience. […]
We are currently using Dell UPS to shutdown 2 Esxi 5 Hosts and 1 Vcenter Server and it's working great. The only thing you need are VMA on the Esxi hosts and configure auto shutdown of the VMA followed by the other Virtual Machines when there is a power cut. Works a treat.
ooops forgot that the dell Ups local node manager has to be installed on the esxi hosts as well.
Thanks David. I'll give that a try. I have made my power supply more diverse now and am running it off two circuits so it's much less likely to fail just because my wife is cooking in the kitchen :). The next failure scenario regarding my kids pushing the power buttons in turn is a little harder to solve without some more physical security. I will definitely give the Dell UPS Local Node Manager a shot though.
What I'm confused about, and experiencing in a newly implemented VM environment with an HP P4000 VSA, is that the VSA host powers on after a full power failure, but as designed it stops at the login prompt of the SAN/iQ software, awaiting the START command, and thereby the vSAN is not available, so my VM's fail to start even though VMWare ESXi 5 is remembering which hosts were running so it restarts them. Where have we gone wrong?
Hi Chuck, As Ben mentioned that is expected. But the SAN/iQ Software is still running even though only the prompt is displayed. There is no need to issue the start command. You will most likely need to rescan the datastores, and if the VSA's are on a Distributed Switch you will likely have to restart the management agents on the host. The problem is that the storage would not have been available when the host booted.
That is normal behaviour for the VSA. The console is used for performing initial configuration tasks and resetting the management group and passwords. You'll probably find that it is because your VSA hasnt started in time for the host to see the iSCSI datastores as being available. We had similar issues and found that a manual rescan of the iSCSI adapter brought everything into life. I suppose you'd need to put a restart priority on the VSA and delay the remaining VM's…possibly doing a HBA rescan scripted from a vMA. Anyway, I hope this helps.
So maybe you have an idea on this. I have a vm I want to have come back up automatically, always. Its in HA and that’s all good except when there is a power outage and the ups gracefully shuts things down. HA sees its state as shutdown and it won’t start if there is a host still up and it won’t start when the power comes back on as of 5.0u1.
I supposed I could hack together a script to check its state periodically and restart it if it is shutdown, but that seems inelegant with all this fancy ha hanging about.
Am I missing a setting some where?
There is a bit of a gap there at the moment. Scripting is one way of solving it, but you'd have to have another system handle bringing it back up. The other option is to have the UPS not shut down that particular VM and then HA will automatically power it back up when everything comes back to life. To protect the application and data integrity you'd want to get the OS to not power off the VM when the OS has shut itself down, so basically a halt. So you might have to disable or remove the advanced power management from the OS, or tweak some advanced settings. I've described the situation with 5.0 U1 in my article about Auto Start being broken. You can refer to http://longwhiteclouds.com/2012/03/28/auto-start-….
Wow, and I just learned something I didn't know. Thanks.