There is (was) a problem with svMotion in vSphere 5 when the VM is on a vDS that causes HA to be unable to restart the VM in case of a failure. This has implications for anyone using svMotion and especially for anyone making use of Storage DRS. If you are currently using vSphere 5 or are planning to use vSphere 5 you need to know about this problem and how to work around it. Now that this bug has been fixed in vCenter 5.0 U1a you will should upgrade as soon as possible.
Updated 15/07/2012: Good news, the problem described below is now fixed. vCenter 5.0 U1a contains the fix and you should upgrade to it. Check out the vCenter 5.0 U1a Release Notes.
Duncan Epping has published an article on his blog recently titled “HA fails to initiate restart when a VM is SvMotioned and on a VDS!” about a problem with HA failing to restart a VM if said VM was deployed on a vDS and had be Storage vMotioned to another datastore. Duncan has followed this up with another post “Clarifying the SvMotion / VDS problem“. This problem affects vSphere 5 environments, including Update 1. It’s a known problem and VMware has released KB 2013639 – HA/FDM fails to restart a virtual machine with the error: Failed to open file /vmfs/volumes/UUID/.dvsData/ID/100 Status (bad0003)= Not found. You should read Duncan’s blog and the associated KB and be very careful when using svMotion with vSphere 5.
The VMware KB article recommends the following workarounds:
- Disable Storage DRS
- Do not perform storage vMotion
I have an alternative workaround approach to that recommended by VMware:
- Still use Storage DRS, but only in Manual Mode.
- Implement an operational process to monitor the environment for SDRS recommendations.
- When SDRS makes a recommendation evaluate it and consider if it is necessary to apply.
- When you apply SDRS recommendations or execute Storage vMotion Operations Implement a post migration process as follows:
- Connect the VM(s) to a vSS port group or another port group on the vDS
- Connect the VM(s) back to the old port group on the vDS
I have tested the above process in my lab and reproduced the problem before applying the solution. The entire process above could be scripted. If you go with the vSS option it does require a vSS be available and have an uplink with all necessary VLANs present. I would recommend you use the option of connecting to another vDS port group instead.
Using SDRS in Manual Mode is what I have been recommending in most cases (dependent on customer requirements) even before this problem came to light. Most customers are not yet comfortable with letting SDRS move VM’s on storage fully automatically in my experience, but do heavily use it for initial placement and to help with capacity monitoring. I understand VMware has a hotfix available for this problem if you log a Support Request and should hopefully have a public patch available shortly.
Update: A number of people have asked me offline if this impacts customers running Cisco Nexus 1000v vDS. The answer is yes, this is known to impact customers running Cisco Nexus 1000v vDS as well as the normal VMware vDS. This is due to the underlying cause, which is the .dvsdata information is not created on the destination datastore when the Storage vMotion completes.
Two scripts have been developed to detect this issue by William Lam and Alan Renouf respectively. Information regarding their scripts can be found at the following locations:
Further information is also available on Duncan Epping’s blog here.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.