A little while ago Duncan Epping posted a great article titled Distributed vSwitches and vCenter Outage What’s the Deal, which generated some good debate. A few people that commented on the article including myself have experienced situations where after a failure the host nukes the VM VMware vNetwork Distributed Switch (vDS) networking on the ESXi Host and does not allow VM’s to connect to the network. This complicated the recovery quite a lot as you can imagine. The reason VM’s couldn’t connect to the network in the cases mentioned in the post comments after the failures is due to a little known problem, which is currently the Achilles Heel of the vNetwork Distributed Switch networking. Fortunately there is an easy solution. (As of Build 716794 this problem is meant to be fixed)
The below is good background for anyone running vSphere prior to build 716794. The problem described is fixed in build 716794, so if you are on the latest patches of vSphere 5 it should not be a problem any longer.
Each vSphere Host relies on data that is stored on the datastore (.dvsData) where the VM’s working directory is located in order to connect VM’s to a vDS when vCenter is unavailable. This is what allows the vSphere Hosts to work completely independently of vCenter. If vCenter is unavailable VM networking will not be impacted. However if .dvsData it is not available at boot time, or when the management services are started, the VM’s cannot connect to the network. You may see in these situations in the vSphere Client GUI when connected directly to the host “Invalid Backing” reported, if you go into Edit Settings. This situation should be rare and you have to basically have a perfect storm of problems for this event and problem to occur, such as a full site power outage or a storage network failure. My article titled When Management NIC’s Go Down is a good example of the type of failure, other than a full power outage, that could cause these situations.
This problem does not effect the vSphere Host VMKernel Ports used for management, as the data for those is stored in the ESXi state, in a file called dvsdata.db. As it turns out though this problem will only manifest or be visible if vCenter is unavailable (but is not caused by vCenter). So what happens if one of the VM’s impacted by this problem is vCenter? It is recommended that you protect vCenter with available and supported high availability solutions such as vCenter Heartbeat, in a WAN configuration to protect against local site power outage. Another option may be to use a stretched cluster design. Be aware that if a host experiences these problems and vCenter is available (say in a Management Cluster or via HA solution) it will mask the symptoms, and your VM’s will continue working.
Duncan has posted a follow up to his original article titled Digging Deeper into the vDS Construct. This outlines the problem again and also a workaround solution. In the past the workaround solution that I implemented was to reboot the host. Provided the storage had been restored before the host had been booted, then any VM’s could then be connected back to the network (including vCenter it it was one of the VM’s impacted). But a reboot of a host can take quite a while. Fortunately Duncan has found it is as simple as restarting the management services on the host by executing “services.sh restart” at the ESXi Shell (5.0), or Tech Support Mode (4.x).
It should be noted that this problem only manifests with vDS Port Groups configured with Static Binding, not Ephemeral. Ephemeral by its very nature means no port binding and therefore no dependency on the .dvsData. So you may wish to consider putting the vCenter Server and it’s DB on a Port Group with Ephemeral Binding if you are using the VMware vDS and only have 2 NIC’s in your hosts. This will prevent this problem affecting your critical management infrastructure. Be aware that Ephemeral port binding is not as scalable as static port binding, so it is not a perfect solution.
It is very likely that this problem will be fixed in upcoming versions of vSphere. Even though it’s a rare problem, it’s impact can be very high. Fortunately now Duncan has found and told us about a very simple recovery method. But in my opinion the best solution is to not even risk having this problem in the first place. Consider the use of Ephemeral Port Binding for your vCenter and vCenter DB and any other dependent VM’s (think AD DC’s) or alternatively keep them on Standard vSwitches (if you have enough NIC ports in your host). Be aware of the management overheads and scalability limits of both of these options. If you have a management cluster (and you should) then either Ephemeral or vNetwork Standard Switch will be manageable.
A rare problem should not be seen as a reason to not use the vDS for the vast majority of VM’s in your environment, but you should be conscious of this when you think about your design and implementation. In the vast majority of cases the vDS is the best solution.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.