I have had a number of customers who are running HP server hardware report that their hosts are constantly getting disconnected from the network, including their management NIC’s (sometimes causing isolation events), and also they sometimes are getting Purple Screens of Death (PSOD). As you can probably guess this is causing them some major pain. HP has issued an advisory regarding these problems that you need to review if you have any of the affected NIC’s – NC522SFP, NC523SFP, NC375T, NC375i, NC522m, CN1000Q.
I have previously written about the problems I experienced with a few customers running the NC522 and NC523 NIC’s in my article HP Critical Advisory – NC522 and NC523 10Gb/s Server Adapters. The customer that I was working with when I came across this problem originally (well before the advisory went out) had a particularly serious problem as the NIC’s were also used for storage access and management. This lead partially to me writing When Management NIC’s Go Down. Fortunately for my customer they now have a stable environment, but they went through dozens of firmware and driver updates, and eventually had to get the cards replaced.
Now there is a new advisory as of December 2012 regarding a broader set of NIC’s and systems that are having some serious problems and causing VMware vSphere hosts to become disconnected from the network and causing PSOD’s. You can find the HP Advisory Here – HP ProLiant and HP StorageWorks Systems: HP NC375i, NC375T, NC522m, NC522SFP, NC523SFP, CN1000Q Network Adapters – FIRMWARE UPGRADE REQUIRED to Avoid the Loss and Automatic Recovery of Ethernet Connectivity or Adapter Unresponsiveness. The title of the advisory really says it all. VMware has issued KB 2012455 regarding this problem. Note that this is not a VMware issue, it’s a hardware issue, and you should upgrade to the firmware / driver combination that resolves the problem as soon as possible.
I hope that once you upgrade the firmware / drivers your environment will become stable as you would normally expect. When working with HP on these types of issues I have found them to be generally responsive when you get to the right people. I would encourage you to work with your account manager and the HP technical support teams to get these issues resolved. If the problems persist after upgrading the firmware as advised then I would strongly recommend you consider replacing the NIC’s with an alternative model after discussions with HP.
NIC disconnections and PSOD’s of this type should be extremely rare in the overall scheme of things. I have not come across many of these types of situations in the 10 years I’ve been working with VMware solutions. But when you come across these types of problems they need to be resolved as soon as possible. The best way to approach it is to log support requests with both VMware and your hardware vendors. Hopefully you strike these types of hardware problems during QA testing before your infrastructure goes live into production, but that is not always the case. If you don’t have a QA process for your hardware that includes burn in then I would recommend you consider it.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.