A little while ago Duncan Epping posted a great article titled Distributed vSwitches and vCenter Outage What’s the Deal, which generated some good debate. A few people that commented on the article including myself have experienced situations where after a failure the host nukes the VM VMware vNetwork Distributed Switch (vDS) networking on the ESXi Host and does not allow VM’s to connect to the network. This complicated the recovery quite a lot as you can imagine. The reason VM’s couldn’t connect to the network in the cases mentioned in the post comments after the failures is due to a little known problem, which is currently the Achilles Heel of the vNetwork Distributed Switch networking. Fortunately there is an easy solution. (As of Build 716794 this problem is meant to be fixed)
The below is good background for anyone running vSphere prior to build 716794. The problem described is fixed in build 716794, so if you are on the latest patches of vSphere 5 it should not be a problem any longer.
Each vSphere Host relies on data that is stored on the datastore (.dvsData) where the VM’s working directory is located in order to connect VM’s to a vDS when vCenter is unavailable. This is what allows the vSphere Hosts to work completely independently of vCenter. If vCenter is unavailable VM networking will not be impacted. However if .dvsData it is not available at boot time, or when the management services are started, the VM’s cannot connect to the network. You may see in these situations in the vSphere Client GUI when connected directly to the host “Invalid Backing” reported, if you go into Edit Settings. This situation should be rare and you have to basically have a perfect storm of problems for this event and problem to occur, such as a full site power outage or a storage network failure. My article titled When Management NIC’s Go Down is a good example of the type of failure, other than a full power outage, that could cause these situations.
This problem does not effect the vSphere Host VMKernel Ports used for management, as the data for those is stored in the ESXi state, in a file called dvsdata.db. As it turns out though this problem will only manifest or be visible if vCenter is unavailable (but is not caused by vCenter). So what happens if one of the VM’s impacted by this problem is vCenter? It is recommended that you protect vCenter with available and supported high availability solutions such as vCenter Heartbeat, in a WAN configuration to protect against local site power outage. Another option may be to use a stretched cluster design. Be aware that if a host experiences these problems and vCenter is available (say in a Management Cluster or via HA solution) it will mask the symptoms, and your VM’s will continue working.
Duncan has posted a follow up to his original article titled Digging Deeper into the vDS Construct. This outlines the problem again and also a workaround solution. In the past the workaround solution that I implemented was to reboot the host. Provided the storage had been restored before the host had been booted, then any VM’s could then be connected back to the network (including vCenter it it was one of the VM’s impacted). But a reboot of a host can take quite a while. Fortunately Duncan has found it is as simple as restarting the management services on the host by executing “services.sh restart” at the ESXi Shell (5.0), or Tech Support Mode (4.x).
It should be noted that this problem only manifests with vDS Port Groups configured with Static Binding, not Ephemeral. Ephemeral by its very nature means no port binding and therefore no dependency on the .dvsData. So you may wish to consider putting the vCenter Server and it’s DB on a Port Group with Ephemeral Binding if you are using the VMware vDS and only have 2 NIC’s in your hosts. This will prevent this problem affecting your critical management infrastructure. Be aware that Ephemeral port binding is not as scalable as static port binding, so it is not a perfect solution.
It is very likely that this problem will be fixed in upcoming versions of vSphere. Even though it’s a rare problem, it’s impact can be very high. Fortunately now Duncan has found and told us about a very simple recovery method. But in my opinion the best solution is to not even risk having this problem in the first place. Consider the use of Ephemeral Port Binding for your vCenter and vCenter DB and any other dependent VM’s (think AD DC’s) or alternatively keep them on Standard vSwitches (if you have enough NIC ports in your host). Be aware of the management overheads and scalability limits of both of these options. If you have a management cluster (and you should) then either Ephemeral or vNetwork Standard Switch will be manageable.
A rare problem should not be seen as a reason to not use the vDS for the vast majority of VM’s in your environment, but you should be conscious of this when you think about your design and implementation. In the vast majority of cases the vDS is the best solution.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Very good write up. I am currently working on some designs and standards. I really dislike making one-off configurations but if avoid potential downtime it makes it all worth it. I also have to agree that a "workaround" is not really a solution. Even so, I am pretty impressed that only a few things have come to surface for v5. They really spent a good while making this one of the best versions they have every pushed out. Thanks for taking the time and writing this up. Time to go update docs. 🙂
I agree. vSphere 5 has been one of the most solid releases. There are very few issues that have come out, and those that have are either issues that have been around for a long time (such as this one), that have workarounds, or are minor. But this is exactly what we expect from Infrastructure software that is underpinning ever more critical systems.
[…] All Paths Down events would occur, and situations such as the one described in my post The Achilles Heel of the VMware Distributed Switch could happen. As the majority of my storage (before the CX500) as well as all of my systems are […]
[…] The Achilles Heel of the Virtual Distributed Switch Share this:TwitterFacebookLike this:LikeBe the first to like this post. This entry was posted in VMware and tagged business crtical, Distributed Switch, management ports, vCenter, vCenter Outage, vDS, VMware by budgi645. Bookmark the permalink. […]
[…] So before I tell you why this has happened I’m going to tell you why I used Auto Start in the first place in an HA/DRS Cluster. Simple answer, as you can see from the screenshot above, I have HP P4000 VSA’s configured on local storage on each host. I was knowingly using Auto Start even though it wasn’t supported. This was to get my storage back up and running in the case of a host reboot. In an attempt to avoid the problem I described in my article The Achilles Heel of the VMware Distributed Switch. […]
Excellent writeup! Bookmarked too.
I wish had seen this article earlier, I had the call last night "We have a Black Hawk Down!!". We run with a VPLEX environment so a Active/Active data center. There was some power work being down in one DC so we migrated all the VM's to the other site. All good until the power work went wrong and blacked out the site. Vplex worked fine and was working at the other site as planned, but the vDS had a complete meltdown, which we then find vCenter Heartbeat is connected to vDS. Still investigating but it also looks like the hosts weren't looking at the correct site for the datastore???? and hence we had the perfect storm. Our workaround was to reconnect Heartbeat back to a standard switch in order for it to start which the got vDS working again. So all in all a complete cock up.
[…] happened and how to mitigate in the future if it happens again. First via Michael Webster I found this, which is old but still educational, and from Duncan Epping I found this and this. Again very […]