I have had a number of customers who are running HP server hardware report that their hosts are constantly getting disconnected from the network, including their management NIC’s (sometimes causing isolation events), and also they sometimes are getting Purple Screens of Death (PSOD). As you can probably guess this is causing them some major pain. HP has issued an advisory regarding these problems that you need to review if you have any of the affected NIC’s – NC522SFP, NC523SFP, NC375T, NC375i, NC522m, CN1000Q.
I have previously written about the problems I experienced with a few customers running the NC522 and NC523 NIC’s in my article HP Critical Advisory – NC522 and NC523 10Gb/s Server Adapters. The customer that I was working with when I came across this problem originally (well before the advisory went out) had a particularly serious problem as the NIC’s were also used for storage access and management. This lead partially to me writing When Management NIC’s Go Down. Fortunately for my customer they now have a stable environment, but they went through dozens of firmware and driver updates, and eventually had to get the cards replaced.
Now there is a new advisory as of December 2012 regarding a broader set of NIC’s and systems that are having some serious problems and causing VMware vSphere hosts to become disconnected from the network and causing PSOD’s. You can find the HP Advisory Here – HP ProLiant and HP StorageWorks Systems: HP NC375i, NC375T, NC522m, NC522SFP, NC523SFP, CN1000Q Network Adapters – FIRMWARE UPGRADE REQUIRED to Avoid the Loss and Automatic Recovery of Ethernet Connectivity or Adapter Unresponsiveness. The title of the advisory really says it all. VMware has issued KB 2012455 regarding this problem. Note that this is not a VMware issue, it’s a hardware issue, and you should upgrade to the firmware / driver combination that resolves the problem as soon as possible.
I hope that once you upgrade the firmware / drivers your environment will become stable as you would normally expect. When working with HP on these types of issues I have found them to be generally responsive when you get to the right people. I would encourage you to work with your account manager and the HP technical support teams to get these issues resolved. If the problems persist after upgrading the firmware as advised then I would strongly recommend you consider replacing the NIC’s with an alternative model after discussions with HP.
Final Word
NIC disconnections and PSOD’s of this type should be extremely rare in the overall scheme of things. I have not come across many of these types of situations in the 10 years I’ve been working with VMware solutions. But when you come across these types of problems they need to be resolved as soon as possible. The best way to approach it is to log support requests with both VMware and your hardware vendors. Hopefully you strike these types of hardware problems during QA testing before your infrastructure goes live into production, but that is not always the case. If you don’t have a QA process for your hardware that includes burn in then I would recommend you consider it.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
this is not a new issue?
I find it funny if not even bizarre that you recommend to "every customer" to conduct QA and burn-in tests on hardware that's been qualified by VMware and/or HP. Isn't it the task of VMware to make sure that they in fact DID make burn-in tests before they qualifiy such hardware with vSphere? Indeed I do know customers who can afford keeping newly bought hardware for months out of production cycle just to make sure it all works as expected, but this is certainly not for everyone, because time is money and usually they buy hardware qualified by VMware for a reason.
Hi Jonas, VMware relies heavily on the OEM vendors of the hardware to qualify their products for use with VMware vSphere. The qualification tests may well include burn in testing. But this does not mean every server gets burned in before it is shipped. Also customer burn in tests don't need to take months. 48 hours would be normal.
It's a reality that not all servers that come out of the factor are defect free or have the most up to date firmware or drivers. If you want to ensure your environment is reliable, some measure of QA testing of your hardware is important prior to putting it into production. Also hardware these days is largely software and anyone that's been in the software business or is a user of software knows that it has bugs. So why would modern hardware be any different?
I've been burned too many times by hardware bugs not to do the necessary brief testing prior to production use. Even today (literally today) a batch of new servers for a customer would not perform vMotion due to firmware bugs with their CNA's. The bugs were only fixed in a fairly recent combination of firmware and drivers. So it's up to you if you follow this advice or not. But I would still recommend it. I would agree it shouldn't be required. But in my opinion it is.
I have been dealing the HP engineers on Emulex OneConnect based NICs since Apr 2012. Both HP and VMWare are not responsible in writing firmware and drivers for them, Emulex is responsible. And so far everything released up til Nov 2012 from Emulex had stability issues. Dec 2012 and the more recent Feb 12th release are more stable (at least on HP infrastructures).
Everyone who picked Emulex as supplier has been burnt by this. IBM, Dell and HP all use them. At least with HP, their G8 lines no longer force you to take on Emulex and you can actually choose to go back to Broadcomm.
The problem is worse if you use these Emulex NICs for IP-base storage.
The QLogic CNA's have also had the same type of problems. It's a big concern when these two big manufacturers both have stability and reliability issues at the same time. My Broadcom and Intel NIC's have been flawless though.
Hi IHAC with this problem, as a matter of a fact they are scare to move from 5 to 5.1, it's recommend to have a bur in test in this scenario? Or even thougth this could happen again?? I mean PSOD
Hi Jhonny, If their servers are already running in production and have done for a long time there is minimal risk and they shouldn't require a burn in test just to do an upgrade. But what I would suggest as part of any upgrade project is that you test the upgrade process and you test the new version of the software in your environment in a test lab or non-production environment, to make sure it meets your needs and that your design meets all of your requirements. It doesn't have to be long or laborious. It just has been to be so you understand the changes. It could be for as little as a couple of weeks. I've upgraded quite a few customers environments to 5.1 and they've had no issues. A little bit of QA can go a long way.
Good post, though that HP advisory is a bit of a joke. They still continue to reference an ancient version of the ESXi version (5.0.601) of the nx_nic driver/firmware combo. They are now up to version 5.0.626 on VMware's site and it is still not at all stable. We have the NC522SFP NICs that frequently just go "offline" for lack of a better word. Sometimes that is just really high latency, sometimes it is many dropped packets, and sometimes it is completely loss of connectivity (though we still have a physical link). Only a reboot resolves this. We've got an escalated case open with HP and VMware trying to make some progress, but I'm 100% of the opinion the issue is with the nx_nic driver, written by QLogic. Because the problem also occurs with the integrated NICs on our HP DL580 G7 servers (the NC375i). This is the fourth version of the driver we've used and they all have been awful. Running ESXi 5.0 U2 here. Not sure if it works better with 5.1 or not.
Hi Allen, I have a customer that did manage to get stability out of their NC522SFP's after the latest firmware and driver update. But in the end they still got the cards replaced. Other things I've learned to improve the situation include changing the server BIOS to Static High Performance or Maximum Performance, Enhanced Cooling, and turning off C States. These are normally recommended settings for vSphere servers, but the enhanced cooling really seemed to make a difference to the NC522SFP's as they were prone to getting very hot and more problems would then occur.
Well, we've got those very same BIOS settings implemented already (though I need to verify "enhanced cooling" I suppose). Unfortunately, they are still not stable for us and considering the integrated NICs have the same issues, I am just struggling to believe it is a hardware issue instead of a poorly written driver.
Hi Allen, I assume you've upgraded the on board and add on NIC cards to the latest firmware as well? You might well be right, or it could just be a manufacturing fault with the batch of cards. I've seen it before where a series of components manufactured around the same time all had faults. I hope you're able to get this resolved to your satisfaction.
Yep, we're running the latest SPP from HP, though the firmware is included with the VMware driver and loaded at runtime. So the firmware you see during POST will not match the running firmware if you are using the 5.0.626 driver as it includes a newer version. HP is spot checking some servers as well for the "bad batch" issue but we're still in the middle of that process.