I don’t like it when I get Purple Diagnostic Screens a.k.a. Purple Screen of Death or PSOD for short. Fortunately these are fairly rare. However there is one I came across just recently with a customer running vSphere 5.1 U1 and it is quite nasty. The PSOD was caused by TCP Heap exhaustion on an ESXi 5.1 U1 host. The host has the recent patches, and the usual search of the knowledge base didn’t really turn much up. The customer is running NFS, although the symptoms may not be tied only to NFS, any host based IP storage protocols (NFS or iSCSI) could be impacted. I’ll briefly tell you what we have found out, the logs to watch out for and some KB’s that will be helpful and steps you can take to prevent this from happening.
When the PSOD first occurred we collected all the logs as you usually do. We noticed many log entries just before the PSOD as follows:
cpu35:65532)WARNING: Heap: 3057: Heap_Align(tcpip, 256/256 bytes, 256 align) failed. caller: 0x418ff888cf59
cpu35:75545)WARNING: Heap: 2977: Heap tcpip already at its maximum size. Cannot expand.
cpu3:61689)WARNING: SunRPC: 3583: marshalProc failed for 0xab17fe1f Proc 7
cpu3:41691)WARNING: SunRPC: 4860: SunRPCMarshallRPCData failed: Failure!
These messages were consistent every time there was a PSOD. Yes there was more than one. At one point, there was a semi-cascading failure. Once one host went down, another host went down. This was largely due to HA restarting VM’s and then those VM’s causing the same problem on another host. Eventually the datastore became unavailable HA stopped working and the hosts that hadn’t had a PSOD stopped responding.
Before we go any further here are a couple of helpful VMware KB Articles:
Although the KB articles don’t directly relate to our problem they do mention TCP Heap pressure and there have been documented cases of TCP Heap exhaustion and PSOD’s with NFS in the past. The log messages helped make it more clear what was happening. VMware Support also agreed with our conclusions.
Recommendations to address these problems:
Upgrade to vSphere 5.5 and set the advanced parameter NET.TcpipHeapMax = 512MB, as 128MB is the maximum in vSphere 5.1. (I also have my NFS Send and Receive Buffers = 512KB).
Decrease the NFS.MaxQueueDepth to 128 (Common NFS Slot Size Max) – In this case it didn’t make any difference
Decrease the number of VM’s accessing the same NFS datastore on the same host (90 VM’s on the host were pounding the same NFS datastore under IO stress tests, we have yet to prove this as a contributing factor, but it was the only significant thing that had changed form when the environment was stable). This is not applicable if you’ve followed at least the first suggestion.
If you must stay on vSphere 5.1 make sure you’re on the latest patches. vSphere 5.1 Update 3 has just been released and the release notes are available here.
[Updated 14/10/2014] Use LSI SAS or LSI Parallel instead of PVSCSI. The latest testing has shown that the problem does not occur unless PVSCSI is being used. This will be raised with VMware support to investigate further.
There are apparently a number of fixes coming in vSphere 5.1 Update 3, and this issue may be addressed when that update is released. For right now prevention is the best cure. Hopefully you haven’t suffered this, and on the rare outside chance that you do, hopefully the above suggestions will help. Right now there are still a number of factors that we need to narrow down further. We know the TCP Heap got exhausted and this caused the PSOD. What we don’t know yet for certain is why the TCP Heap was exhausted. If we are able to get to a final root cause and resolution I will update this article. In the meantime your comments and feedback welcome.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.