I recently had a situation at a customer site where a physical NIC failure (Read the Critical Advisory I blogged about) caused an All Paths Down (APD) and management network failure. This required a replacement of the NIC, and in our case to a different NIC type, which didn’t have a compatible driver already installed on the host. See if you can guess where this is heading. New NIC installed, Host BIOS, NIC Firmware all updated, and then the real trouble started!
This forced us to move the management vmkernel port over to a different NIC temporarily on a standard vSwitch by force through the DCUI (hence blowing away the network config on the host). Because the NIC that failed (dual port 10Gb/s NIC) had the management and storage access configured on it (iSCSI) the host was not able to read it’s vDS configuration from the SAN. When we finally got the host reconnected to vCenter by relocating it’s vmkernel port this caused all of the vDS configuration to be unavailable. This was compounded by the fact that the failed NIC was replaced being a new NIC of a different type, and the driver didn’t support update manager. So we had to install it manually.
So now we have the host back into vCenter running on another NIC, which fortunately had the correct VLAN’s trunked to it. We’ve updated the NIC driver on the host and the new 10Gb/s NIC is visible. It’s got the same vmnic numbers for both uplinks due to it being installed in the same PCI slot location. Software iSCSI initiator is configured correctly as per the previous state. The host now thinks it’s attached to the vDS. But for some reason the vDS will not sync with the host no matter what we try. We were not even able to remove the host from the vDS, as vCenter thought the port previously occupied by the vmkernel port for management was still operational, which of course it wasn’t.
So the question at this point was how were we to get the host back and configured correctly on the network without completely rebuilding it or configuring all the networking from command line by force? The answer was by using Host Profiles! When we had originally configured this environment we had taken a host profile baseline and had all the hosts in compliance.
Host Profiles to the Rescue! I quickly checked the compliance of the other hosts that were available just to ensure that there was nothing wrong with our baseline host profile. Everything was good. I applied the baseline back to our problem host by puting it into maintenance mode, and then applying the profile. After entering all the relevant vmkernel port IP details for all the vmk ports and waiting a couple of minutes for the configuration everything was looking good.
Finally we rebooted the host to ensure that it started up correctly and could see all the datastores and paths. The host didn’t start! It got caught in an endless loop trying to boot from it’s CDROM and NIC’s. By this stage I was pulling out my hair to figure out why a host that was previously running after the NIC change would suddenly not boot off the internal SD card.
After considering the options for a couple of minutes I decided to enter the BIOS and check all the start order of the devices. Even though the SD card was in the list of start up devices it wasn’t the first in the order. I updated the order so the SD card was first, which in theory should have no baring on it’s ability to start up, as it was previously working. Then I exited the BIOS and with fingers crossed hoped that the host would boot. After it took a couple of minutes to go through post there was a huge relief when the hypervisor started to boot. Once the host had booted we found all the NIC’s and vDS configuration was correct, all the datastores were visible, and finally the host was back up and running.
Lessons learned from this experience:
- This failure would not have happened if there were another 10Gb/s NIC in the host of a different make/model and the storage / management ports were distributed across the different NICs. This was not possible originally due to the cost of ports and host NICs and security / separation requirements for the 1Gb/s NICs in the host. Perhaps two single port 10Gb/s NIC’s would have been better than a single dual port 10Gb/s NIC in this situation. Given the small difference in cost between dual port and single port 10Gb/s NIC’s even having two dual ports of a different type and only using one port on each would have been a better solution as a single port failure could then also be easily fixed.
- Host profiles and Enterprise Plus licenses are really valuable not only when things are going well but also when things go wrong. This experience demonstrates the value of host profiles in addressing configuration faults and bringing a host back into operation, which would have taken significantly longer otherwise.
- Make sure your internal SD card (if you’re using one) is the first in the boot order. If you do a BIOS update, make sure you check this to ensure it’s still correct after the update.
- Before you replace a NIC try and get the new driver loaded into the hypervisor. This will save a lot of time and troubleshooting when you find out you don’t have the right driver installed, which can be especially problematic if it’s on the NIC that has your management vmk port.
I hope you never have to experience a situation quite the same as this. Hopefully you’ll be able to address these types of scenarios in your designs before they become problems in production. But that is not always possible due to customer constraints and requirements. Hopefully this will give you some ideas of what can be done to address these sorts of problems when they arise and also demonstrate the value of Enterprise Plus licenses. Let me know some of the more problematic troubleshooting and failure scenarios you’ve come across and what you had to do to get them fixed in a timely manner.
Prior to this problem all hosts within this environment had experienced NIC failures with the same type of NIC, but not to the same extent as this host. All the hosts had been configured with static power at maximum performance, and the fans configured to enhanced cooling. This had made the problems less frequent, but didn’t really address the root cause.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Good read, but in the end this goes back to the basics, don't create vSwitches teamed to the same physcical nics in any situation. This would have not been in issue if this best practice had been adhered to. Just because 10G E allows the bandwidth to afford just a couple of connections doesn't mean you should do it. a single dual port nic is a single point of failure regardless of the fact it has 2 ports.
I agree in general, but in many situations that is not possible. Either because the customer you are working with has already made the decision to go with just the 2 x 10G E NIC ports on a dual port NIC, or because of convergence of the system architecture or server platform doesn't allow it (or cost). This article documents quite an extreme case, where a cascading failure occurred because of a fault with the NIC's in each host of the cluster. So there is also something to be said about not using identical NIC's in every host in a cluster, as that too is a single point of failure. But you still need to continue to try and keep your hosts as similar as possible in terms of memory, CPU's etc.
If possible in the host platform it would be ideal to have a minimum of 2 physical dual port 10G NIC's, and also 2 physical dual port HBA's. But this is not possible in many cases, and also increases the cost of the infrastructure, when in many situations that might not be justified. At least by reading this I hope people will be conscious of the situation and can make informed decisions that meet their specific business requirements, as every situation is different.
Interesting article Mike.
First off, agree with GMAN, e.g. you should never team on the same physical nic (dual port /quad or not). That is simply poor practice if it was the case (bad customer).
Do not agree with the statement you should have separate types of NIC in a host. Especially as we move towards converged adapters embedded on-board. In some systems (CISCO UCS for example) you only have one choice, you cant have different types, the point of buying enterprise ready hardware is you are paying to mitigate these types of issues, and limiting the number and types of NICs has operational benefits going forward. If you were to take that thinking to the next level, you would end up having AMD based HP’s and Intel based IBM’s “just in case”, yeah OK a little extreme but you get my point.
Interesting about the SD card needing to be the first to boot.. weird, this means if you want to update the bios via CD, ILO etc in the future you will need to change to source, and then back to SD. Never seen this issue myself, so interested in which HP box you were dealing with.
Good article, good to read about others experiences.
The systems we were dealing with were DL380 G7's.
The risk you take with a converged infrastructure with everything on board is if there is a firmware or driver problem you have no solutions available, other than relying on the vendor to fix the problem. Waiting for the vendor could take weeks or months. This has proven time and again to be a very risky proposition as some of the most popular vendors have had major firmware and driver problems with their devices. This can of course be mitigated by very thorough testing of the infrastructure before going into production, but you won't be able to anticipate and test every eventuality. If there is only the one on board multi-port card, then you can't team across cards, which isn't ideal as I think we all agree.
So for the benefit of those operational gains and the lower costs you are increasing the risk and impact of some failures. I definitely would not consider two different CPU architectures, but it might be worth considering two different brands of server in a very large cluster, if you want to mitigate against this risk, and if having multiple different NIC's in the server isn't an option. If the NIC driver and firmware problems were rare and the consequences of them weren't so bad we wouldn't have to worry about it. Unfortunately that's not the reality. If you want to build a robust always on infrastructure to support business critical applications you have to reduce the risk of failure and the impact of failure also.
On the subject of CNA's specifically, if you loose a dual port CNA in the server (assuming there is only one) not only have you lost your network but you've also lost your access to storage. The consequences of this can be catastrophic. So this particular scenario needs careful consideration depending on the workloads being virtualized.
I'm not against any particular architecture or system, as I've deployed most of them, and used converged infrastructures where it met the customers requirements. But you always need to be aware of the risks you are taking and take them knowingly. In many environments the risks or impacts are insignificant compared to the benefits, but that might not work when you're virtualizing business critical applications. Every environment and situation is different and our job is to provide the most cost effective and efficient infrastructure possible that meets the unique business requirements.
Have you deployed cisco UCS yet?
Cisco UCS is gaining quite a bit of traction here in the northern hemisphere, and of course all HP G7 is gearing towards CNA type stuff, plus their are cisco nexus for HP (finally) now….. UCS is great kit, thou more of a "one size fits all" type approach. Unfortunately UCS blades includes only a single slot for CNA in their standard blade offering (note unlike HP, you cant add 2nd mezz HBA or mezz 2nd NIC,… thus the issue described here a very real possibility. Good article, good insight, I'm afraid bean counters read the glossy FCOE to solve global warming type stuf, and dont see the value in separate physical management etc…. as you say, horses for courses.
PS Ive changed my current deployment from CISOC UCS 230M2's, to CISCO 440M2's to mitigate single dual port CNA issues…
[…] problem to occur, such as a full site power outage or a storage network failure. My article titled When Management NIC’s Go Down is a good example of the type of failure, other than a full power outage, that could cause these […]