Some of you may be shocked to know that by default a lot of disk devices in Windows will by default have disk write caching enabled (Better Performance Policy). This can cause data integrity issues if there is a sudden loss of power, or the sudden removal of the device, as most devices are hot plug. Most well written applications should not be impacted by this as they will themselves use forced unit access (FUA) to bypass any write cache and cause the operating system to flush the IO’s before returning to the application and saying it is committed. But experience over the last few years has shown that this can be inconsistently implemented. In my opinion there is no amount of data loss acceptable to improve performance and IO should always be persistent when it is acknowledged back to an operating system or application. With this in mind I decided to do some testing to find out what the impact is of setting the disk devices to Quick Removal, and thereby disabling the default write cache.
Firstly if you don’t know how to check or change the disk removal / disk cache policy there are two places. You can use diskmgmt.msc and go into the disk properties view and then into the hardware tab, which then allows you to view all disks on a system and change the properties and policy there, or you can go into device manager and do the same thing. Here are examples of these two options:
Here is what the Disk Policy should look like when the disk write cache is disabled, and the quick remove policy is selected.
If you have a lot of VM’s or a lot of SQL Server Databases or Exchange systems, that each have a lot of disks it could be quite a major task to change the policy. Fortunately Microsoft has a way using PowerShell to change the policy on multiple disks and multiple remote servers all at the same time. Take a look at Change write caching policy (enable or disable) for multiple disks remotely on Technet.
Now for something I found a bit puzzling, when I turned off the disk write cache, i.e. set the policy to quick removal, I found that performance improved. My theory as to why is that without having to do a cache lookup on read it saves some time, and without having to store and constantly overwrite the write cache, which also takes some time, the IO path is more efficient when quick removal is selected. If anyone knows for sure I would be interested to hear. As for the performance results here are some tests I ran on an all flash setup. The configuration had 8 VM’s each on it’s own host using Nutanix NX9060 All Flash Nodes (6 x SATA-SSD per node, 4RU in total).
8KB IO Size, 64 Outstanding IO, 100% Random Write CacheOn vs CacheOff
8KB IO Size, 64 Outstanding IO, 100% Random Read CacheOn vs CacheOff
For those that just like big numbers for throughput and IOPS here is a similar test with cache off and 32KB IO size, 100% Random Read, on the same 8 nodes..
18GB/s total or just over 2.25GB/s per VM, in just 4RU, without any NVMe.
Changing the default disk policy in Windows to Quick Removal to disable the Windows disk write cache is a good idea not just for data integrity but also for performance. Even though the tests I ran are not application realistic as they used a synthetic IO generator tool, they were consistent and show the difference between the two settings that I was interested in measuring. The change of disk policy was the only change between the various test runs, of which there were many. Your milage in your environment may vary. But in my book it is never acceptable to have any chance of data loss or data corruption and changing the Windows disk policy to quick removal just makes sense. During the read tests there was no network bandwidth required due to the Nutanix Data Locality feature making all reads local to the server where the application / test VM was running. This allows the platform to scale linearly and to delay investing in higher spec network equipment even when adding a lot more flash or taking advantage of newer flash technologies in the future. NVMe was not used in any tests, when I have some test systems available with NVMe I will rerun some of these test and share the results. This will save you having to do all the hard work yourselves.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2016 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.