I’ve started to see reports recently of I/O errors when running very high I/O workloads on Windows 2008 and Windows 2008 R2 VM’s. Mostly this was during artificial benchmark tests run against MS SQL and Exchange 2010 with Jetstress. However it could impact production workloads. Upon further investigation it appears these I/O errors are a known defect with a certain version of the PVSCSI driver that comes with VMware Tools and can affect vSphere 4.0 U1, 4.1 and 5.0. Here I’ll cover more about this potentially serious issue and how to fix it.
This problem is described in VMware KB 2004578 – Windows 2008 R2 virtual machine using a Paravirtual SCSI adapter reports the error: Operating system error 1117 encountered along with the versions of the PVSCSI driver that are impacted and a link to the fix. Microsoft has also included a knowledge base article on their site with regard to this, refer to MS KB 2519834 – SQL Server reports “Operating system error 1117 (I/O Device Error)” on VMware ESX environments that are configured to use PVSCSI adapters.
Although the referenced KB articles describe a situation with SQL it is possible for this to happen under any high I/O workload on the impacted versions of Windows, including for example Exchange. The information available right now doesn’t mention Windows 7 VM’s. But Win7 VM’s are generally less susceptible to the same high I/O workloads as Exchange and SQL servers. Even though Win7 VM’s are less susceptible to the same load conditions that would cause this issue the PVSCSI driver in Win7 is still affected by this problem and should be updated. In the case of VDI desktops could be re-provisioned if they experienced this issue.
What makes this issue potentially serious is that in the worst case (rare) scenario this problem could lead to data corruption. This makes it very important that you upgrade or patch your vSphere environment to address this defect. With ESXi 5.0 the patch is included with Update 1. For ESXi 4.1 you should deploy Patch 04 described in VMware KB 2009144 – VMware ESXi 4.1 Patch ESXi410-201201402-BG: Updates VMware Tools.
On completion of the ESXi Patch process, you will be required to update VMware Tools (System Restart Required) on both required VMs and Templates. You should then confirm the PVSCSI StorPort device driver version has been updated to 188.8.131.52 or later. For more information about the VMware Paravirtual SCSI adapter and supported VMs, refer to VMware KB 1010398 – Configuring disks to use VMware Paravirtual SCSI (PVSCSI) adapters.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
[…] also read this post on a Critical issue with the PVSCSI on Windows 2008 (R2) Server , from blogger Michael Webster. […]
Thank you for the helpful post! I'm looking into standardizing a Windows Server 2012 Datacenter template and was curious about this adapter. I may steer clear of this for now. Data integrity comes before performance any day of the week in my book, and I would assume this might not be the only risk from using the PVSCSI adapter.
I agree data integrity comes before performance. In this case the bug has been fixed and I've tested it and have a number of customers who have deployed PVSCSI successfully without any issues. Given that Windows Server 2012 isn't supported on any version of vSphere that has this bug you're pretty safe to use it. But as with everything you should do some testing yourself and also ensure the benefits are what you're expecting. The results I posted on my Fusion IO testing were with Windows 2008 R2 using the PVSCSI driver. Without PVSCSI the latency was 3x slower. But we are talking something that was 300 microseconds being 1 millisecond. This bug was definitely a surprise from left field as I'd used PVSCSI quite a bit in the work I do with business critical applications. Fortunately it wasn't a problem for long and VMware got the fix out. I didn't hear of any actual instances of data corruption in production environments caused by this bug either, as it was picked up during customer project test phases.
Actually, we discovered this bug back in June of 2011 and we were one of the original cases reporting this. VMware stated that it was a problem with our configuration at the time, but we proved it was PVSCSI by jetstressing a VM on SSD disks, generating the errors, but when we switched to LSI Logic SAS, the errors went away. We encountered this issue in production with our Exchange 2007 VM's. Further more, the likely hood of the issue increased when we added more than 7 VMDK's to an individual PVSCSI adapter. We ended up using 1 LSI logic adapter for C: (due to support statements by Microsoft at the time) and 3 PVSCSI adapters with all of our VMDK's spread across it to alleviate the problem, but that just alleviated the frequency not the problem itself. We rolled back to LSI on our exchange servers and haven't switched back yet, but we do have a large number of SQL and Indexing Crawlers using PVSCSI with I/O loads in excess of 2500 iops on SSD (as seen by the guest VM) without any issue. On our next maintenance window, we are now comfortable with the PVSCSI adapter, and will be rolling our exchange servers back.
Our Oracle Database Instances (4) environment runs on a w2k8r2 VMware virtual machine. Data is stored on a HP eva SAN.
Since two weeks every other day one or more oracle instances get terminated. At first it seemed like the issue is related to a disk issue. I did some research and found the articles on paravirtual adapters of VMware seemed to be causing the issue.
We changed the scsi controllers back from Paravirtual to LSI logic SAS for the disks that contain Oracle datafiles, Logfiles and Controlfiles etc. but not the disk of the OS.
However we are still experiencing the terminations.
Should we also change the controller of the disk where windows is installed?
What version of vSphere are you running? Not all versions have this problem. In fact only one build has this problem. So if you have the most recent patches it will likely be another cause. Have you looked in the VM and host logs and have you logged a support request with VMware Support?
We dont have the most recent patches the article says we have to go to 4.1 but then we have to patch our SAN as well. Thats why we choose to change back to LSI Logic SAS instead. It might not be enough.
ESX 4.0.0, 261974
To rule out the vSCSI adapters I'd recommend you change them all back to LSI Logic SAS. But I think it's likely you're not being impacted by this specific problem and that your terminated process is happening for another reason. Think about when the problems first started and what might have been changing at the time. Given that VMware Supports the full stack, including the Oracle components, you should log a support request with VMware Support.
[…] corruptions being reported in the windows event logs and SQL logs. Similar to those reported in Win 2K8 with PVSCSI Critical Issue. Although I don’t have conclusive evidence that the PVSCSI driver was the cause, in all cases […]