High Availability is a key consideration for any VMware vSphere Design. VMware HA is a very easy and effective tool that you should always enable to improve VM availability. vSphere 5 introduces a considerably enhanced mechanism to achieve high availability that removes the limitations of the previous versions. As a result it is much more easily achievable to have clusters that contain a far larger number of hosts.With the enhancements to VMware HA in vSphere 5 there are some considerations that are important to take into account, especially in blade environments, to achieve adequate availability in different failure scenarios. With much larger clusters and also with clusters that will contain business critical workloads it’s important that you consider HA not just in terms of N+1 hosts, but also when N+1 does not equal 1 host.
This will not be a deep dive into VMware HA and Admission Control, but it will contain a brief overview of some of the design considerations that are necessary in all environments and some differences between vSphere 4.x and 5.x. The best reference for VMware HA and DRS is Duncan Epping and Frank Denneman‘s book VMware vSphere 5 Clustering Technical Deep Dive. I highly recommend it and every VMware admin should own a copy.
Firstly some brief background on VMware HA in 4.x and 5 Blade Server environments.
VMware HA in a vSphere 4.x Blade Environment
In vSphere 4.x VMware HA used a concept of primaries and secondaries. There were up to 5 primaries in a cluster that were responsible for ensuring VM’s were restarted in case of host failure. In a blade server environment this meant you should deploy no more than 4 hosts per chassis to guarantee that in the case of a chassis failure at least one primary would survive to coordinate VM restarts. There was no mechanism to guarantee where the primaries would live within the cluster. This had the effect of somewhat limiting flexibility and cluster sizes. Most organizations would split up clusters across multiple blade chassis and always have less than 4 hosts per chassis, in most cases only 1 or 2 hosts per cluster per chassis was deployed to reduce the fault domain. Most clusters were deployed with only up to 8 hosts to limit the number of chassis that were required.
VMware HA in a vSphere 5 Blade Environment
VMware HA in vSphere 5 replaces the previous AAM (Automated Availability Management) module with FDM (Fault Domain Manager). The new FDM agent has completely done away with the multiple primaries concept and instead replaced it with a master and slave arrangement. If a master fails a new master will be re-elected. This means it is much less of a problem to have more than 4 hosts per chassis in the same cluster purely from a VMware HA perspective. In the case of a chassis failure a new master will be elected on a host within the cluster on another blade chassis and any failed VM’s will be restarted. This allows cluster sizes to be much higher, potentially with less blade chassis, and provides much more design flexibility. You may be much more likely to design environments with 4 or more hosts per chassis in the same cluster. Here though I would like to introduce one of my design maxims: just because you can doesn’t mean you immediately should.
When N+1 is not Equal to 1 Host
I have recently completed a couple of engagements where I designed environments for very large financial institutions where their clusters will start off with between 10 and 16 hosts each form day one. Both environments will be using blade servers, and both environments have fairly strict high availability requirements. This design highlighted a situation where N+1 availability in a cluster may not just equal 1 host. In both of these recent cases the customers required that the cluster continue to operate without major performance impact not just in the case of a single host failure, but also in the case of an entire chassis failure. So in this case N+1 equals N+1 chassis, not just N+1 hosts. HA Admission Control then needs to be configured to ensure sufficient resources are available to restart all necessary VM’s in the case of a chassis failure. The diagram below provides an overview of an example layout.
In the above diagram you can see the Management Cluster consists of 4 hosts, and each host is deployed into a separate chassis. The Resource Cluster contains 16 hosts, of which 4 could fail without having a major impact on availability or performance. In the case of this design I chose to specify HA Admission Control equal to a percentage of cluster resources reserved for failure, equal to two hosts, and reserved another two hosts through capacity planning and performance management processes. This allows for sufficient maintenance and growth capacity at the same time guaranteeing availability. Depending on the number of hosts you can afford to loose per cluster you may still want to deploy less hosts per chassis and have more chassis. If you have a large environment that will contain multiple clusters and a large number of chassis it may be possible to have a large cluster and still only have 2 hosts per chassis.
The above should not just be a consideration for very large environments with large numbers of hosts, chassis and many clusters, but should also be a consideration if you have a more moderately sized environment and want to combine some existing clusters into one of fewer larger clusters when you upgrade to vSphere 5.
Another Example Where N+1 is not Equal to N+1 Hosts
Also with the previous project the customers wanted to ensure that if a single rack became unavailable through a power fault or maintenance that the cluster would still survive with limited impact to performance. To achieve this the blade chassis were split into different racks within the datacentre. This would ensure any single rack issue would have the same failure domain as a single blade chassis failure. In this case N+1 equals N+1 Racks, not just N+1 hosts or blade chassis.
DRS Affinity Rule Considerations with Blade Environments
When you have multiple hosts per chassis in the same cluster in a blade environment without additional configuration it is possible that VM’s with a specified DRS Anti-Affinity Rule may be separated across hosts within a single chassis. This will potentially cause you problems as there is no guarantee in this scenario of really keeping the VM’s separated and availability will be impacted in the case of a blade chassis failure. The same problem occurs with Fault Tolerant VM’s, which may run on two hosts within the same blade chassis. It is not possible to assign the primary and secondary VM of a fault tolerant pair to different DRS groups. Also VMware DRS and HA does not currently contain any chassis, rack, or site awareness or tagging capability, which would be very useful to help address advanced availability considerations.
To achieve the separation guarantee across chassis you should specify one or more DRS Host Groups, VM Groups and VM Group to Host Group rules in addition to the Anti-Affinity rules. The DRS Host Groups should be defined horizontally across the hosts in the cluster across blade chassis. This will ensure you guarantee that VM’s that need to be kept separate will not reside on two different hosts in the same chassis. The diagram below provides an overview of this configuration.
Configuring one or more DRS Host Groups will mean that you can assign the relevant VM’s to the correct grouping, but this adds slightly to management overheads. This could be automated during provisioning time with workflows in vCenter Orchestrator, by PowerShell scripts or executed manually by an administrator. This should only be used as the exception and not the rule as it limits the placement options for VMware DRS and may reduce cluster efficiency and flexibility. It will also not work well when using vCloud Director as vCloud has no knowledge of the host groups. In theory this could be mitigated by using blocking tasks and integration again with vCenter Orchestrator.
What about Rack Server Environments?
Similar considerations apply to rack server environments also in terms of rack failure and distribution of hosts and clusters across multiple racks. Outages or maintenance of racks should be taken into consideration when you are designing your clusters. You may also want to define DRS Host Groups that span racks, so you can ensure VM’s are separated across physically different racks, not just hosts within a rack. It will all be determined by your business and availability requirements.
Don’t just consider cluster failure scenarios solely in terms of N+1 hosts. Ensure you consider chassis and rack failure and maintenance scenarios too. This is particularly important in large enterprise and cloud service provider environments. Always consider the three maxims of cloud computing when you are designing your environment – Hardware fails, software has bugs, and people make mistakes. Try and limit the failure domain for each different failure scenario. Configuration problems and problems during upgrades are more likely to impact availability than just random hardware failure. Just because your vendor says the upgrade is non-disruptive doesn’t mean it will be, trust but verify and always limit risk where possible!
Don’t design an elaborate and complex HA cluster just because you can, always consider business requirements, cost objectives and weigh those against availability and risk. There is no one size fits all approach.
Lastly, buy Duncan and Frank’s book – VMware vSphere 5 Clustering Technical Deep Dive. It is very reasonably priced and you won’t regret it. Before you ask, no I don’t get any kick backs or commission from recommending it. It is simply a reference I don’t think any admin should be without.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
[…] http://longwhiteclouds.com/2012/05/13/when-should-n1-not-equal-1-host-with-vmware-ha/ virtualization ← HP Gen 8 Servers – An onsite deep dive part 1 | The Solutions Architect /* */ […]
But… Seriously, how many times have you seen a blade chassis fail?
Sorry I should elaborate. I understand the necessity to design for failure of all types, but personally, I have seen the total site failure occur several times with multiple customers, but I have never seen a blade chassis fail…
I have seen it 3 times so far… Not a pretty sight I can tell you that. It is uncommon indeed though and you will need to decide for yourself if you want to take a failure like that in to account.
Agreed, not every environment will justify or require this level of resiliency, it should be driven by business requirements and also cost consideraions. Hopefully the infrastructure can be more location aware and DRS rules in a future release maybe take some of this location awareness into consideration with HA and DRS. As we drive more efficiency and consolidation through cloud the number of environments that need to consider chassis failure will undoubtedly increase, especially as clusters become ever larger.
Every major customer (1000+ users) I have worked for has experienced at least one chassis failure. Chassis introduce an ugly single point of failure issue that needs to be catered for. If you haven't had this experience then I would say you have been very lucky so far.
A number of times, and for a few different reasons. In the worst case a single management domain was created across two C Class HP Blade Chassis and an upgrade of the management firmware caused both chassis to become unavailable, as well as all hosts within the two chassis. I have also seen serious configuration errors on blade chassis cause complete blade unavailability within the chassis. So although it should be rare, I have seen it quite a few times. I have also seen multiple instances just in the last 6 months where a single rack lost power due to maintenance or configuration issues. It's also not just about hardware failure, but configuration issues, and also maintenance and upgrades. In an environment that requires high availability these considerations will be important. Bottom line, it's not as rare as it should be and we have to plan and design to eliminate as many single points of failure as possible within reasonable cost objectives dependent on customer requirements.
If the rack can expose the chassis information, that would be easier..
Great writeup Mike, well done. It is amazing how you seem to put pen to paper on stuff that is currently floating around in my head.
Thanks for this article. Can you elaborate at all on scrutiny you gave to the storage layer with regards to it also being a failure point? IE, we can arrange a layout on the servers to reduce risk from a chassis failure, but below we may have a storage layout that undermines/contradicts efforts at the chassis layer.
Hi Mark, Ideally the same consideration would be given to the storage layer to ensure availability. In the case of one of my customers there are multiple storage arrays for different parts of the infrastructure. In many cases there may only be separation of storage between management and resource clusters. It will depend on the type of storage array selected, but in a lot of cases the storage layer already has multiple levels of redundancy and resiliency built in. In one customer use case they are deploying their VM's with in guest disk mirroring between two storage arrays to mitigate the risk of failure of one array. This is for an application that doesn't scale horizontally. Configuration risks on the storage still exist and also need to be mitigated. Availability and scalability at all levels of the infrastructure and applications needs to be considered and driven by business requirements.
The bottom line is it's rare that you can address of failure scenario. We can go on and one, etc, and at some point there will be a single point of failure, in most cases, however, that doesn't mean that you don't mitigate risk at each layer.
In my environment I have two Dell Blade m1000e chasis, and have spread my MGMT/Resource clusters accross both enclosures, two cisco 3750's stack and a couple of brocade FC switches, however, connecting to a single VNX5500, although, dual controllers. Even at the storage layer I have my Pools spread accross my DAE's, and have Hot spares, and my EFD's for FAST are spread accross DAE's as well.
This doesn't mean that i'm comletely protected from a Storage failure, but I took the necessary steps where I could to mitigate risk.
Great article and it's exactly where we are at the moment. In reference to the Chassis Failure, we have experienced a chassis failure. The chassis doesnt fail but the backplane needed a swapout due to a bent pin, we also have the same problem in another chassis but am living with that issue at the moment.
My main concern for chassis outage is down to firmware upgrades on the VC cards. If the upgrade doesn't go well or you have a design flaw in your architecture you can suffer from an outage and end up with the split brain personality going on in your farm. Also, the three maxims highlighted are important considerations; people do make mistakes and software may have bugs. To add to this consider your application software and understand that application arhcitecture fits the DRS design and is also tolerant of vmotioning. We have been in that position where an application cluster is tweaked too aggressively and does not survive the vmotion. Application experts will time to time make changes and not apply any logic to what has been implemented at the hardware layer.