Just wanted to give my readers a heads up that there appears to be a bug with vSphere 5.5 U1 that impacts any NFS connected storage and can cause random disconnects. The timing of this is a coincidence that it comes right on the coat tails of my previous article Hardware Fails, Software Has Bugs and People Make Mistakes – Usually You Get All At Once! During the disconnects VM’s will appear frozen and the NFS datastores may be greyed out. This appears to impact all vendors and all environments on 5.5. U1 accessing NFS. UPDATED: There is a public KB on this problem from VMware: KB 2076392 – Frequent NFS APDs after upgrading ESXi to 5.5 U1, and VMware is working on it. According to the KB you may experience blue screens of death in Windows, and read only file systems in Linux. I’ve also experienced kernel panics and reboots in Linux as a result of this bug.
NetApp was first to report it. The recommendation from most vendors (Including Nutanix as per Field Advisory FA-17) at this point is to not upgrade to vSphere 5.5 U1 and stay on vSphere 5.5 GA. If you have upgraded to 5.5 U1 then you may need to downgrade back to 5.5GA. I have tested and been able to reproduce this problem on vSphere 5.5 U1 on two completely different vendors systems.
For further information you can see Datacenter Dude Nick Howell’s article – NFS Disconnects in VMware vSphere 5.5 U1, and keep an eye out on the VMware knowledge base article. It is recommended you subscribe to the KB article so that you are made aware of any updates.
[Updated 12/06/2014] There is an updated KB article regarding the NFS Disconnection Bug – KB 2076392, and VMware has issued a patch to correct the problem. vSphere 5.5 Patch 04. It is strongly recommended that you apply patch 04 if you use NFS and wish to leverage the fixes and improvements in vSphere 5.5 Update 1.
If you’re using block based storage, i.e. VMFS, and a EMC VMAX then you might want to read Cormac Hogan’s article with regard to a situation with VMAX, VAAI and the Unmap command.
—
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
[…] Random NFS disconnections after vSphere 5.5 U1 […]
[…] Michael Webster: Heads Up Alert: vSphere 5.5 U1 NFS Random Disconnection Bug! […]
Interesting: I've noticed dropouts in my lab with iSCSI targets (FreeNAS) since upgrading to 5.5 U1 – not done any real investigating as yet, but it may be related. All was fine on 5.5 GA, and I don't see these dropouts (usually only 3-8 pings worth, but enough to cause concern) when using local storage. Don't have NFS to test with I'm afraid 🙂
[…] Please check out this advisory here. […]
Interesting: Probabily not related but when upgrading my small environment I shutdown a bunch of servers including my NTP server like I always do to conserve memory in my small cluster. Anyway the NFS started being flaky and not connecting reliably, then finally not connecting at all for day. Turnes out it was a time sync issue, and my hosts were just far enough off to matter. I didn't think NFSv3 would fail just because of time issues, but on ESXi v5.5 it does.
Thanks for the feedback Lee. That is very interesting. I wouldn't have thought NFS would fail due to time issues either.
FYI, my storage at this point was on SmartOS:ZFS using a global zone NFS mount, but another unrelated hardware NAS unit and a virtual Nexenta NAS also failed to mount NFS on ESXi 5.5 ( but iSCSI worked )
I have to say that we have recently updated multiple sites to 5.5u1 and are using NFS with NetApp at all of them and we only had this problem at 1 site. The symptoms were exactly the same as the ones referenced in the KB with NetApp in Nick's article. This was only affecting attaching to one of our filers while the second one had no issues. The site that had the issue is a busier site and even though we are running a version of OnTAP that is bug fixed when I changed the queue length on the hosts and rebooted the issues stopped. So even though it is affecting multiple storage vendors kind of wondering if it could be near the same issue. Now another thing could be is that only the site affected has been upgraded to 10GB and JF so far.
Thanks for the feedback. I run 10G and Jumbo Frames in my lab, so this is something I can easily test. What value did you change the queue depth to? I'm thinking I might change the Queue Depth down to 256 and then retest in my lab environment to see if it happens again. At the moment I can reproduce it when my system is under load very easily.
I had actually set it down to 64 per the other kb. I haven't had any more problems since but we will see. I only have NetApp storage though so had nothing else to test on.
The queue depths issue was lingering from an issue in 5.1/5.5 where they changed the max queue depths from 64 to 4Billion and we weren't ready for that. Hence, why this has happened before. We're not seeing a relation between queue depths and the current issue. Just wanted to be sure we're not getting our wires crossed between KB's! You should definitely have the NFSQueuesMax setting to 64. Grab the latest version of VSC (4.2.1 or 5.0) and we set these for you on all hosts, if I'm not mistaken.
I can confirm that I've changed queue depth and it's made no difference. Under load I just get constant APD's and disconnects. The network stack or NFS client implementation appears broken in 5.5 Update 1. Hopefully it is fixed soon.
[…] had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been […]
[…] 4/19/14 1049 – just saw this – Michael Webster’s post on this issue – and it has other info – thanks […]
[…] también informa del problema en su blog Yellow Bricks Michael Webster informó de ello en su blog Long White Virtual Clouds Marcel Van Den Berg lo comentó en su blog up2v.nl con varios ejemplos de gente […]
[…] Up Alert: vSphere 5.5 U1 NFS Random Disconnection Bug! longwhiteclouds.com/2014/04/18/hea… 00 […]
[…] had already reported on this on twitter and the various blog posts but I had to wait until I received the green light from our KB/GSS team. An issue has been […]