vSphere HA is one of the great features that simplifies high availability protection for any supported Guest OS running in a VMware environment. If a vSphere Host fails HA will kick in and restart any VM’s that were running on another available host. HA has gone through a continual evolution and cycle of enhancemnts over the years and in vSphere 5 was greatly enhanced. But some of you may be wondering exactly what is going to happen when your clusters fail for real, and you have a total outage. Well I have experienced a complete outage a couple of times now for real. So I’d like to share with you the experience with HA in vSphere 5, and how that compares to vSphere 4.x.
Before I get started Duncan Epping and Frank Denneman wrote a great book about vSphere 5 and Clustering including HA titled vSphere 5.0 Clustering Technical Deepdive. I highly recommend you buy this book. It covers vSphere 5 clustering and HA in a lot of detail and is basically the bible on the subject. I also have links to Duncan (Yellow Bricks) and Frank’s blogs on my side bar.
Those of you who have read about My Lab Environment know I run some pretty serious kit. But unfortunately this is not housed in a robust commercial data center. It happens to be located in my home office. Now being as it is in my home office it is subject to unplanned downtime on a scale that most commercial environments would never face. Firstly I don’t have swipe card access controls to my office so there are my two young Sons that just love the flashing lights and the buttons on the front of the servers. Then there is the fact that until recently all of the UPS’s came off the same circuit. This mix means I regularly and unexpectedly test HA on either a partial or full scale failure basis, depending if it’s a server that’s been shut down due to a button press, or a circuit breaker that has gone because too many appliances have been turned on in the kitchen (no kidding).
I know by now you are probably rolling on the floor laughing, and that is good as this is meant to be lighthearted. But trust me there is a serious side to this post. With vSphere 4.x if I had a power outage, either by button press or power failure, it would take considerable time to recover, and there would be quite a lot of manual intervention (single host going down wasn’t a problem as HA would kick in and recover provided storage was available). All Paths Down events would occur, and situations such as the one described in my post The Achilles Heel of the VMware Distributed Switch could happen. As the majority of my storage (before the CX500) as well as all of my systems are virtualized it would take a little while to get everything back up and running, and potentially a few reboots. When I upgraded to vSphere 5 I was hopeful that the vast improvements in technology I had experienced over the years would continue. I was more than pleased with the actual results.
This is how the story goes:
One night while I was working late, sometime after I had upgraded my lab to vSphere 5, I got a call from my Wife. She was a little upset. She had been in the middle of cooking dinner and now nothing in the kitchen was working (of course this was my fault). There was also some loud beeping coming from my office, but she didn’t know what. I’m sure you can imagine what it was. It was all of my UPS’s running on battery power, which only lasts for 10 minutes, and I don’t have integrated shutdown (Does anyone make integrated shutdown compatible with vSphere 5? Please tell me). There wasn’t anything I could do about it so we agreed to leave it till I got home. Fortunately we have a gas oven/stove top so dinner wasn’t completely ruined.
When I got home I was expecting the worst. A long time to not only get power restored, but also to restore all of my systems. Little did I know I was about to experience one of the best new enhancements to vSphere HA (in my opinion).
I found and reset the circuit breaker that had tripped and my office sprung back into life. I decided to leave it for 10 minutes and go and make a coffee. When I returned I expected to find the virtual equivalent of the aftermath of a nuclear meltdown. However upon my return all of the hosts had restarted, including all of the virtual machines that were running on them before they powered down (crashed). This included my HP P4000 iSCSI storage that is local to each host but replicated across the network.
This was one of the little features that is easy to gloss over. But vSphere HA remembered the state of all of the virtual machines, whether they were powered on or powered off. Therefore when the power was restored, HA went about power on all the VM’s. Without any human intervention, within the space of about 15 minutes I had everything back up and running to a crash consistent state from where it was left. Luckily the loss of power had not caused any other damage.
So on top of all the great new features in vSphere 5 HA, which include removing the restrictions caused by 5 Primary nodes by moving to FDM, enhancements to admission control and vSphere HA VM Monitoring API’s, you also get auto restarting of the VM’s that were powered on at the time of the crash. vSphere 5 HA when enabled and configured appropriately makes recovery from host outage, be it transient or full outage of a host, or an entire cluster, much easier and quicker to recovery from. This alone is a worthwhile reason to consider upgrading to vSphere 5. vSphere 5 HA will make your life much easier and allow you to focus on other activities that add value to your organization.
I hope you enjoyed my story and I hope you don’t have to experience this type of unplanned downtime, at least not as often as I do in my home lab.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.