I recently had a situation at a customer site where a physical NIC failure (Read the Critical Advisory I blogged about) caused an All Paths Down (APD) and management network failure. This required a replacement of the NIC, and in our case to a different NIC type, which didn’t have a compatible driver already installed on the host. See if you can guess where this is heading. New NIC installed, Host BIOS, NIC Firmware all updated, and then the real trouble started!
This forced us to move the management vmkernel port over to a different NIC temporarily on a standard vSwitch by force through the DCUI (hence blowing away the network config on the host). Because the NIC that failed (dual port 10Gb/s NIC) had the management and storage access configured on it (iSCSI) the host was not able to read it’s vDS configuration from the SAN. When we finally got the host reconnected to vCenter by relocating it’s vmkernel port this caused all of the vDS configuration to be unavailable. This was compounded by the fact that the failed NIC was replaced being a new NIC of a different type, and the driver didn’t support update manager. So we had to install it manually.
So now we have the host back into vCenter running on another NIC, which fortunately had the correct VLAN’s trunked to it. We’ve updated the NIC driver on the host and the new 10Gb/s NIC is visible. It’s got the same vmnic numbers for both uplinks due to it being installed in the same PCI slot location. Software iSCSI initiator is configured correctly as per the previous state. The host now thinks it’s attached to the vDS. But for some reason the vDS will not sync with the host no matter what we try. We were not even able to remove the host from the vDS, as vCenter thought the port previously occupied by the vmkernel port for management was still operational, which of course it wasn’t.
So the question at this point was how were we to get the host back and configured correctly on the network without completely rebuilding it or configuring all the networking from command line by force? The answer was by using Host Profiles! When we had originally configured this environment we had taken a host profile baseline and had all the hosts in compliance.
Host Profiles to the Rescue! I quickly checked the compliance of the other hosts that were available just to ensure that there was nothing wrong with our baseline host profile. Everything was good. I applied the baseline back to our problem host by puting it into maintenance mode, and then applying the profile. After entering all the relevant vmkernel port IP details for all the vmk ports and waiting a couple of minutes for the configuration everything was looking good.
Finally we rebooted the host to ensure that it started up correctly and could see all the datastores and paths. The host didn’t start! It got caught in an endless loop trying to boot from it’s CDROM and NIC’s. By this stage I was pulling out my hair to figure out why a host that was previously running after the NIC change would suddenly not boot off the internal SD card.
After considering the options for a couple of minutes I decided to enter the BIOS and check all the start order of the devices. Even though the SD card was in the list of start up devices it wasn’t the first in the order. I updated the order so the SD card was first, which in theory should have no baring on it’s ability to start up, as it was previously working. Then I exited the BIOS and with fingers crossed hoped that the host would boot. After it took a couple of minutes to go through post there was a huge relief when the hypervisor started to boot. Once the host had booted we found all the NIC’s and vDS configuration was correct, all the datastores were visible, and finally the host was back up and running.
Lessons learned from this experience:
- This failure would not have happened if there were another 10Gb/s NIC in the host of a different make/model and the storage / management ports were distributed across the different NICs. This was not possible originally due to the cost of ports and host NICs and security / separation requirements for the 1Gb/s NICs in the host. Perhaps two single port 10Gb/s NIC’s would have been better than a single dual port 10Gb/s NIC in this situation. Given the small difference in cost between dual port and single port 10Gb/s NIC’s even having two dual ports of a different type and only using one port on each would have been a better solution as a single port failure could then also be easily fixed.
- Host profiles and Enterprise Plus licenses are really valuable not only when things are going well but also when things go wrong. This experience demonstrates the value of host profiles in addressing configuration faults and bringing a host back into operation, which would have taken significantly longer otherwise.
- Make sure your internal SD card (if you’re using one) is the first in the boot order. If you do a BIOS update, make sure you check this to ensure it’s still correct after the update.
- Before you replace a NIC try and get the new driver loaded into the hypervisor. This will save a lot of time and troubleshooting when you find out you don’t have the right driver installed, which can be especially problematic if it’s on the NIC that has your management vmk port.
I hope you never have to experience a situation quite the same as this. Hopefully you’ll be able to address these types of scenarios in your designs before they become problems in production. But that is not always possible due to customer constraints and requirements. Hopefully this will give you some ideas of what can be done to address these sorts of problems when they arise and also demonstrate the value of Enterprise Plus licenses. Let me know some of the more problematic troubleshooting and failure scenarios you’ve come across and what you had to do to get them fixed in a timely manner.
Prior to this problem all hosts within this environment had experienced NIC failures with the same type of NIC, but not to the same extent as this host. All the hosts had been configured with static power at maximum performance, and the fans configured to enhanced cooling. This had made the problems less frequent, but didn’t really address the root cause.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.