The Status of Microsoft Failover Clustering Support on VMware vSphere 5.1

The number of enquiries I’ve been receiving regarding Microsoft Failover Clustering, especially for Microsoft SQL Server Databases has skyrocketed in the past few weeks. I have been receiving a number of enquiries from customers and also from partners including cloud service providers. As a result I thought I’d write this article to help you understand what the current status is of support for Microsoft Failover Clustering on VMware vSphere 5.1 (GA) and with regard to some VMware products.

Background Reading

Firstly there are two main VMware knowledge base article that outline the support statements of Microsoft Failover Clustering and Microsoft Cluster Services on VMware vSphere. They are as follows:

Microsoft Cluster Service (MSCS) support on ESXi/ESX (1004617)

Microsoft Clustering on VMware vSphere: Guidelines for Supported Configurations (1037959)

This article only applies to vSphere 5.1. The rule book has been rewritten with vSphere 5.5, check out my article on vSphere 5.5 Windows Failover Clustering Support.

Clustering and VMware Solutions

In addition to the above there are specific mention of clustering configurations for the VMware technologies that support it, such as for the vCloud Director SQL Database, which was introduced in vCD 5.1 and covered in my article Clustering Support on vCloud Director and vCenter Databases. The golden rule is this. If VMware does not specifically document a clustering solution as being supported then it is NOT supported. vCenter Server from version 4.0 to current 5.1 GA does not support a clustered Database, be it Oracle RAC or SQL Server. It has not been tested by VMware and is therefore not supported. This may well change in the future as VMware recognises the need to provide alternative high availability solutions for the vCenter Database and I will update this article accordingly. However currently the supported high availability solutions for the vCenter and its database are VMware HA, and vCenter Server Heartbeat. Clustering of the vCenter Server itself is also not supported by VMware but is covered by KB article 1024051 – Supported vCenter Server high availability options.

Customers with production support who wish to run Oracle RAC for the DB for vCenter (not SSO, as that doesn’t work) can get support from the VMware Oracle Support Team under VMware’s Expanded Oracle Support Policy. But they will be limited by the capabilities of vCenter itself, if any. I do know a number of customers running vCenter DB (Not SSO) on Oracle RAC in an active/passive service configuration and it has been fine for years. Also I expect the official support statement to change in the future as the testing for vCenter and RAC is completed.

Not supported does not always mean something doesn’t work. But it does mean it hasn’t been tested by VMware and therefore VMware can’t stand behind the configuration as a supported solution. If it’s not documented as supported, then it’s not supported.

The Status of Microsoft Failover Clustering Support on VMware vSphere 5.1

VMware has done a lot of work to enhance support for Microsoft Failover Clustering and its predecessor Microsoft Cluster Services on VMware vSphere 5.1 to support larger cluster sizes. You can now support up to 5 nodes in a virtual Microsoft Failover Cluster on vSphere 5.1. This is great news for environments where two nodes was not enough, even when combined with the additional availability of VMware HA. I’ve implemented a number of solutions where Microsoft Failover Clustering was used successfully in the cases where it was justified and within the limits that were supported. Strong justification and support constraints are two things I’d like you to think about as you read further.

You can still do hybrid Physical and Virtual clusters, and you can also still do cluster-in-a-box with VMDK’s (dev / test of cluster functionality itself not for high availability). VMware Site Recovery Manager is also supported to protect Microsoft Failover Clusters from a DR perspective and there are a number of different configurations you can use such as multi-node to single node, or multi-node to multi-node. This really does make DR for the cluster easy, less error prone, and of the recovery plan itself once it is initiated is automated and provides audit reporting. VMware HA is fully supported, however VMware recommends you implement anti-affinity rules to ensure cluster nodes are prevented from start up on the same physical host.

So what are the gotcha’s or caveats I hear you ask? Well there are a few gaps in support that you should be aware of when developing your solution architecture. I’ll also cover some of the other valid options you have for high availability later as well and some of the impacts of using Microsoft Failover Clustering. This list is in no particular order.

Clustering Across Boxes (i.e. traditional clustering for high availability purposes) is not supported with the use of VMDK’s or Virtual Mode RDM (vRDM). You must use Physical Mode RDM’s (pRDM) due to the requirement of persistent SCSI reservations.
Due to the requirement to use pRDM’s there is no support for doing backups with vSphere API’s for Data Protection (vADP). So you must use in guest agents for backup.
The is no support for vMotion or DRS with Microsoft Failover Clusters as they use shared disks and a shared SCSI bus. Any attempt to migrate a cluster node will be met with an error message. This doesn’t mean you can’t deploy a Microsoft Failover Cluster inside a VMware DRS cluster, because you can and it’s fine, it just means that DRS can’t automatically migrate the Microsoft Failover Cluster nodes automatically because vMotion isn’t supported.
Windows Server 2012 Failover Clustering is not supported currently. ~~Period. Not even with in-guest iSCSI.~~ [Updated 21/06/2013] Except with non-shared disk access, in-guest iSCSI, or in-guest SMB storage access. MS SQL Server 2012 on top of Windows Server 2012 with AlwaysOn Availability Groups is supported as it does not require shared disk. See the VMware KB Microsoft Clustering on VMware vSphere: Guidelines for Supported Configurations (1037959) and my article Windows Server 2012 Failover Clustering Now Supported By VMware With Some Caveats.
There is no support for Native iSCSI (where an RDM is presented via the host iSCSI initiator or iSCSI HBA to a guest)
There is no support for Fibre Channel over Ethernet (FCoE). Even if the FCoE Converged Network Adapter (CNA) presents itself as a normal HBA to the host, the use of this configuration with Microsoft Failover Clustering is not supported. With one exception – Two node cluster configuration with Cisco CNA cards (VIC 1240/1280) and driver version 1.5.0.8 is supported on Windows 2008 R2 SP1 64-bit Guest OS in vSphere 5.1 Update 1.
The use of Round Robin Multipathing for your Path Selection Policy (PSP) is not supported.
If you are deploying a hybrid physical node – virtual node Microsoft Failover Cluster the Physical Node can’t use Multipathing software.
No support for VM snapshots, which is one of the reasons that vADP backups don’t work.
No support for Storage vMotion due to the use of pRDM’s.

Some of the above restrictions, especially lack of vMotion and DRS support make it very difficult for cloud service providers that are using vSphere to offer Microsoft Failover Clusters as a service. The reason is obvious. One of the main benefits of having an Infrastructure as a Service is completely non-disruptive hardware upgrades and maintenance. This is not possible with Microsoft Failover Clusters with the current constraints. If cloud service providers wanted to offer a Failover Clustering option they would need to notify customers to shut down their cluster nodes each time the firmware, drivers, or hypervisor version needs to be updated on their hosts. This of course also applies in your private cloud. Downtime would be required on the nodes each time the hosts need to be updated due to the lack of vMotion and DRS capabilities.

Even with the limitation though the advantages of virtualizing your clusters still outweigh the drawbacks. You still benefit from VMware HA, and the performance and reliability you’ve come to expect. You also get the benefit of being able to use VMware SRM for Disaster Recovery.

Options and Alternatives for High Availability

Failover Clustering is inherently complex. It doesn’t always provide high availability either. There are scenarios where downtime is still required and that downtime might be as much as would be expected just using VMware HA. Because the underlying disks are shared any storage loss or corruption will affect the entire cluster. A cluster if very static and hard to move about between hosts or from your private cloud to a cloud provider if you wished. These are some of the reasons it’s not always the best option.

When considering clustering and you think you’re protecting against Guest OS or Host failure think about the last time you saw a Blue Screen of Death (BSOD) from a VM, or you had a host fail. Hardware reliability is greatly improved and most BSOD’s are caused by drivers. With the standard drivers used when you virtualize your servers you are very unlikely to get a BSOD, at least based on the VMware drivers. There will always be exceptions but this is the case based on my experience for the vast majority of workloads, including those with high availability requirements. Microsoft Failover Clustering is not a DR mechanism (generally), so you need additional measures to provide DR, however some of the alternatives can provide HA and DR in the one solution.

Microsoft Failover Clustering can provide flexibility at times around OS patching. But even this use case has alternatives that provide the same level of availability. I give you an option around rolling patch upgrades below.

If you want to provide high availability to vCenter and the vCenter components then the options are VMware HA and vCenter Server Heartbeat. If you are looking to provide high availability to SQL Databases (ones that are not being used for VMware products that don’t support database clustering) then you have a number of options and alternatives, again these are in no particular order.

In guest iSCSI initiation. This is fully supported and will still allow vMotion and DRS migration to occur. Please Refer to the VMware Guide Titled – Setup for Failover Clustering and Microsoft Cluster Service – Update 1, ESXi 5.1, vCenter 5.1. The guide reads as follows on page 9 “Use of software iSCSI initiators within guest operating systems configured with MSCS, in any configuration supported by Microsoft, is transparent to ESXi hosts and there is no need for explicit support statements from VMware.” Although VMware hasn’t specifically tested this with Windows Server 2012 Failover Clustering there aren’t the same restrictions as this is relying on standard in guest support for the clustering, which as per the guide is transparent to ESXi and does not require any specific support statements from VMware. Provided Microsoft Supports it (Direct Guest Initiated iSCSI for Windows Failover Clustering), which they do, then it’s fine. I would still recommend in guest agents for backups in this case. Incidentally Cisco has a great guide on how to deploy this configuration – Microsoft SQL Server 2012 Failover Cluster on Cisco UCS with iSCSI-Based Storage Access Deployment Guide. This option allows more than the 5 cluster nodes supported by vSphere normally and in fact you could configure a cluster up to the maximum number of nodes supported by Microsoft. This option is not supported with the use of VM snapshots. You can read more about Snapshot Limitations Here.
If you’re wanting high availability above 99.9% for an application such as Exchange or SQL Server you can use the built-in replication technologies such as DAG’s, Database Mirroring, Log Shipping, or Always On Availability Groups depending on the version. These are fully supported by VMware, have full support for vMotion, DRS and HA, and can also provide a DR mechanism. They also are supported with use of VMware vSphere API’s for Data Protection (vADP) for backups. DAG’s and Mirroring or Always On Availability Groups can be used for high availability as well as disaster recovery. The failover can also be completely automated. They also provide additional protection against disk based corruption where clustering would completely fail. You should check if your software vendor supports the Microsoft SQL Client (if using SQL Server) and these automated failover options. Unfortunately VMware doesn’t support Database Mirroring or Always On Availability Groups at this time for SSO or vCenter databases.
You could choose to keep things simple and just rely on VMware HA. This is a very viable solution for up to 99.9% availability. This is a great solution for the vast majority of cases. I know of a single VM with 528GB RAM and 32 vCPU’s being protected by VMware HA and it runs the entire SAP system and Oracle DB for a very large organization and has done so reliably and performed exceptionally well meeting their SLA’s. When at all possible I recommend keeping things simple. Unjustified and unnecessary complexity adds the risk of downtime and higher probability of human error.
If you need more availability than VMware HA alone can provide you could add VM and Application Monitoring and Application HA. This will cover the cases of individual application services failing within the guest.
To cover the use case of failover during in guest patching you can use vCenter Orchestrator in combination with hot add and hot remove and clone operations of virtual disks to patch the OS disks or application disks of a single VM while it’s still running and then fail over to the patched version. This requires some advanced understandings about how the OS and hypervisor work together and would best be done along side VMware PSO, but it is possible. This would achieve very similar availability profile during the rolling patch process as Failover Clustering would.
If you wanted to build a Microsoft Failover Cluster inside of a VMware vCloud Director vApp you could also achieve this. You would need to create the cluster nodes connect them to the shared storage by using in guest iSCSI initiators. They could either connect through an external network out to the iSCSI storage or you could make an iSCSI Target VM as part of the vApp. This would give you Microsoft Failover Clusters as a Service inside an Infrastructure as a Service environment running on top of vCloud Director, with self service, and on demand. There would definitely be some scripting involved but this could be a viable solution, and with orchestration you could also set appropriate anti-affinity rules each time one of these clusters was deployed. No restrictions on HA, vMotion or DRS, it would just work. This option allows more than the 5 cluster nodes supported by vSphere normally and in fact you could configure a cluster up to the maximum number of nodes supported by Microsoft. This option is fully supported by VMware. The idea of using an iSCSI VM inside a vApp came from Andrew Mitchell (@amitchell01 also a VCDX) a colleague from the VMware APJ CoE. This option would not support VM Snapshots. Supportability of vADP for backups is unclear as vADP does get around some of the same limitations for snapshots. But this is not recommended as an option for Production vApps, only development and testing.

Final Word

This article covered the current status as of vSphere 5.1 GA. It is very likely that improvements will be made in future releases to address some of the limitations highlighted above. VMware understands what it needs to do in order to deliver the Software Defined Datacenter and to support Business Critical Applications. I’m sure they’re already working hard to improve platform support for Microsoft Failover Clusters, even if it is only needed in a very small minority of cases. In the meantime my recommendations are to use VMware HA and VM Monitoring or App HA unless there is a very strong justification for something in addition to this. If you have that strong justification then leverage the built in application high availability and protection options. Failover Clustering due to it’s complexities and risks is a last resort and inferior in most cases to application level HA.

[Updated 10/09/2013] As of vSphere 5.5 Microsoft Windows 2012 Failover Clustering is fully supported using Fibre Channel, FCoE, iSCSI or any of the in-guest storage IO access methods. Failover Clustering is also supported for the vCenter Database as of vCenter 5.5. I cover the enhancements to clustering in more detail in a separate article – vSphere 5.5 Windows Failover Clustering Support.

I’d be very interested to get your feedback on this article and hear some of your experiences running Failover Clusters in VMware vSphere.

—

This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.

all things Nutanix, VMware, cloud and virtualizing business critical applications

The Status of Microsoft Failover Clustering Support on VMware vSphere 5.1

Like this:

Share this:

Like this: