This article is relevant to all virtualization platforms, including traditional Unix virtualization. The reason for this is that 80% of problems in virtualized and consolidated environments are caused by or contributed to in part at least by storage. Storage is the single most important achilles heel of virtualized and consolidated environments. If you don’t have storage access nothing else matters and if your data gets corrupted and can’t be recovered, nothing else matters, your servers are kaput. Hopefully these tips can help you prevent some of the common traps and issues I see in virtualized environments and result in a better overall user experience.
Before I get started I want to give you a suggestion that will help save you countless hours troubleshooting problems if you’re using VMware vSphere 5.x. (Blatant Plug Follows and no I don’t get royalties or commission from this) If you really want to understand storage for a VMware vSphere 5.x environment you should Buy Mostafa’s book – Storage Implementation in vSphere 5.0. I highly recommend this book as an essential tool for all vSphere administrators and architects. Every VMware customer and partner (all 480,000 of you) should have a copy of this book. If you don’t buy it Mostafa might not write another one, and that would be a terrible shame as I want to read his book about the next major release of vSphere. So please support Mostafa Khalil (VCDX-002) and VMware Press by getting a copy of this book.
Tip 1 – Queues are everywhere, understand how they impact performance, especially when multiple VM’s are deployed on each LUN
Queues are at every layer of the IO subsystem from the Guest OS, the storage network, the storage processors, and the storage devices or spindles that storage the data. Chad Sakac wrote a great article a while ago that is very relevant to the whole queue discussion. I recommend everyone reads this – VMware I/O queues, “micro bursting” and multipathing. The optimization of queues becomes even more important when you are virtualizing business critical applications. Most high IO database systems or high IO applications will when deployed natively on a server have many LUN’s presented to them. This is not by accident. This is done to increase the number of IO queues that are available to the OS and thereby increasing the parallelism of IO’s to the underlying storage. Assuming the workload actually needs all of those queues, if you take it and virtualize it with the same number of virtual storage devices to the guest instance, but with only a single underlying LUN, you will quickly find yourself in a situation where the performance is well below acceptable levels. This is even though the underlying physical storage is not stressed, and physical service times are within acceptable ranges.
Having far fewer underlying LUN’s is one of the great benefits of virtualization however, so you want to reduce the number of LUNs to increase managability, but without sacrificing performance. If you are going to virtualize workloads that require a high degree of parallelism of IO’s then you will need to look at modifying the underlying queue depths in your storage IO path to ensure that the queues can handle the peak required level of parallel IO’s. But be careful modifying queue depths to be too high can have negative consequences (QFULL situations on your storage), but in most cases if you are not increasing the total number of queues from hosts through to the storage and you take care and discuss any changes with your storage team and storage vendor you should be ok (take all necessary care). An example from a recent customer discussion, we decided to double the queue depths of the HBA’s to 64 from 32 (default at the time based on vSphere 4.1 see KB 1267 for details across different vSphere versions) and put two virtual disks per underlying LUN, while keeping the overall virtual SCSI device numbers the same as the system had when it was deployed natively. They had two virtual disks on each datastore and to the guest OS each virtual disk had a queue depth of 32 (LSI Logic), so overall there was a 1:1 ratio of virtual disk queues to the guest OS and to the datastore, down to the underlying storage. If your applications or VM’s aren’t going to use all the queues all at the same time you can overcommit your queues by having more virtual disks per datastore. Just don’t push it too far with high performance VM’s that actually do need a high degree of parallelism of IO. Note that the default queue depth for PVSCSI per device is 64, and per PVSCSI adapter is 255, this can be increased to 255 and 1024 respectively see KB 2053145.
The underlying reason to modify the queue depth per device at the hypervisor level is that you can then put multiple virtual disks on the datastore, each which has it’s own queue depth with the Guest OS, to make use of the additional queues. The Guest OS may not have the ability to modify the default SCSI queue depth per device (usually 32), so by increasing the number of virtual SCSI devices (virtual disks) the Guest OS has access to you increase the parallelism of IO’s the Guest can access, without increasing the number of underlying LUNs. The real trick is about balancing all the various queue depths and performance and size of LUNs to find the sweet spot.
Where you have multiple virtual disks and multiple virtual machines per VMFS datastore I strongly recommend that you enable Storage I/O Control (if you have Enterprise Plus Licenses). This will allow you to ensure protection against noisy neighbour VM’s and that each VM gets it’s fair share of the IO queue slots available.
Tip 2 – Know the limits of the hypervisor technology that you’re working with and stay within them
I wrote an article a while ago titled The Case for Larger Than 2TB Virtual Disks and The Gotcha with VMFS about some of the limits within VMware vSphere, some of which are documented in KB’s but not in the product documentation. Whatever hypervisor technology you’re working with you need to understand the limits of it’s storage stack and stay a safe distance from them in your environments. As soon as you exceed the limits you are bound to run into trouble. This has changed significantly with vSphere 5.5, so you might want to check out my article vSphere 5.5 Jumbo VMDK Deep Dive.
Tip 3 – Size your virtual machine storage not just for performance but also for your RTO and RPO
Size isn’t everything. But sizing your datastores or virtual machine storage so they are too big can have a big impact on the time it takes to recover a datastore in case of a disaster, and also what type of protection mechanisms you can use to reduce data loss. You should get an understanding of what your recovery time objectives (RTO – how long it takes you to recovery) and recovery point objectives (RPO – how much data loss is acceptable) and ensure your datastores and virtual disks and your overall protection strategies match with those requirements. You might be able to create a 64TB datastore and put hundreds of VM’s on it, but the risk associated with that strategy and the time it would take to protect and recovery might not meet your requirements. When thinking about backup and IO performance you should make sure that your sizing calculations take into consideration the amount of IO required for your backups. Backups are a BAU operations and their IO needs to be included in your workload models. When using auto tiering arrays and array replication you may find that if you do a failover or failover test that your performance suffers and is vastly different from your protected or primary site. This is often because the replication technology doesn’t produce the same IO workload pattern on your recovery site and therefore the array doesn’t put all of the blocks in the correct tier of storage immediately. This can cause hours of performance issues if you ever need to recover for real while the array sorts out the different storage tiers. Some new smart replication technologies work with the auto tiering of the arrays and keep the blocks in the correct storage type at both the protected and recovery sites. This is something you should ask your storage vendors about when considering these types of solutions.
Tip 4 – Beware of the blended peak IO workloads and concurrent IO events
One of the biggest potential impacts of moving many servers to a shared storage environment is the impact that concurrent peak workloads can have on the server and applications. Especially during events such as virus definition updates, backups, database replication activities or of someone decides to kick off a defrag operation or sync operation across hundreds of systems. These types of concurrent peak IO workloads can completely kill the performance of your storage and in the process kill the response time of your applications. In very bad cases you whole environment could appear to freeze up. Fortunately if you’re running a smart hypervisor like VMware vSphere (4.1 or above) you can use Storage IO Control to provide a quality of service mechanism to reduce the impact of these types of situations. You might also be able to provide QoS from your arrays. But the best solution is to avoid these types of problems as much as possible or include them in the planning and have sufficient IO performance capacity to deal with them. For the virus definition update problem you could leverage the new smart hypervisor integrated security products that eliminate the concurrent update problem altogether. As far as defrag goes, don’t do it. Your data is fragmented already all over the SAN. The IO requirements of the defrag operations are far worse in most cases than any gain you’re likely to get from the operation.
Tip 5 – Even sequential IO will appear random make sure you align your disks and plan for random write biased IO workload profiles
When you place multiple virtual disks on a shared LUN or datastore even sequential IO from each of the virtual disks will appear completely random to the underlying storage. If you also size your VM’s correctly from a memory point of view you’ll also notice that for a lot of workloads there isn’t a lot of read IO and that your IO patterns are very write biased. This is due to the way most modern OS’s use memory as filesystem IO cache when it’s available. Completely random write biased IO workload patterns will generally require more high performance storage to satisfy them while keeping IO latency within acceptable levels (<10ms service times). Especially when using some RAID types (RAID5) there is a high penalty for write IO operations. In many cases it could be cheaper to use RAID10 or another data protection mechanism (RAID DP etc) that has less of a write IO penalty, even if the capacity efficiency is lower, as the performance efficiency is much higher. If you have high performance requirements if you plan for performance the capacity will generally take care of itself. You need to make sure that you align your disks and your partitions within your guest OS so that you don’t suffer performance degradation due to split IO’s. Split IO’s occur when the OS reads or writes a block but because of where the block sits on the underlying storage multiple IO operations are actually needed to service the IO. In very bad cases this can reduce your performance by 50% and have a big negative impact on response times. Partitions on Windows versions prior to 2008 need to be manually aligned and so do most Linux and Unix operating systems. If you think of this in storage investment terms it’s like spending twice as much to get the performance you require or getting 50% utilization from your storage performance investments.
I wrote a previous article titled Storage Sizing Considerations when Virtualizing Business Critical Applications that covers some of these topics in more detail. I would highly recommend you review this as well.
Again I would like to repeat that if you really want to understand storage for a VMware vSphere 5.x environment you should Buy Mostafa’s book – Storage Implementation in vSphere 5.0. This book is an essential tool for all vSphere administrators and architects.
I hope this has been of some help and that you can avoid some of the common pitfalls related to storage that can have a dramatic impact on your virtualiztion environments. I’m not 100% sure if storage is 80% of virtualization problems or if it’s more than that. In any case I’ve heard that 95% of statistics are made up. In my experience storage related issues cause the vast majority of preventable problems in virtualized or consolidated environments. Whichever way you cut it these things need to be considered and dealt with during your design, operations and capacity planning. Do whatever you can to provide quality of service to your environment and set expectations appropriately and you should be able to concentrate more on innovation and adding value than just keeping the lights on.
Don’t forget to Buy Mostafa’s book – Storage Implementation in vSphere 5.0. If you’re running VMware vSphere 5.x this is a must have resource.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Appreciate you taking time to write this stuff….
How much of these problems will be solved by all-flash arrays ?
Hi Jayadeep, even all Flash arrays still have limits. But the limits are higher, they can handle more concurrent IO's and their service times are lower. The decisions around RPO and RTO for LUN size are still very relevant. Depending on the actual technology implications the issues around write bias and randomness may be reduced. But some flash based storage has fairly poor characteristics with highly random write workloads, so it will depend on the actual implementation. Even the most powerful system can be overloaded if not designed and operated correctly. Once you hit the wall of performance often the consequences are very quick and very severe.
Hi, Thanks for writing!!!
Please help me with this
"Having far fewer underlying LUN’s is one of the great benefits of virtualization however, so you want to reduce the number of LUNs to increase manageability, but without sacrificing performance. "
Size isn’t everything. But sizing your datastores or virtual machine storage so they are too big can have a big impact on the time it takes to recover a datastore in case of a disaster, and also what type of protection mechanisms you can use to reduce data loss.
With the first statement to have manageability and performance it’s good to have fewer lun, if we need to have fewer lun then, i need to go for big luns, and again if we have big luns it comes to 3rd statement of considering RTO.. Most sizing is done based on RTO, and always decided to go with smaller once and since the lun size is smaller, definitely the number of luns increase’s. How to accompany this with first statement for performance and manageability.
So when I have more smaller LUN how to decide the queue depth size.. Basically I have multiple RDMs for MSCS solution, in this case how to make a decision on the queue depth size.
As you can probably guess it is a balancing exercise. What you're trying to achieve as the architect of the environment is finding the 'sweet spot' that balances the size of the LUNs with the RPO/RTO requirements and also the performance requirements. This will be different in different environments. Also many high performance database systems might have a very small standard LUN size when physical, i.e. 50GB or 70GB or something to meet performance requirements. Whereas in a virtual environment your smallest LUN size might be 300GB or 500GB, quite a bit different. In order to meet the same performance as multiple 50GB LUNs on a physical system when you move the VM to a 500GB LUN size you may need to adjust the queue depth, if the application needs all the queues. In order to determine what adjustments you need to make you'll need to monitor your array, monitor your physical system (source system) and then monitor your virtual environment to determine what the appropriate queue depth is. All without overloading the back end storage or causing bottlenecks in the Hypervisor and Guest OS. There is no point modifying the queue depth of the RDM devices if the guest OS can't make use of the queues. You'd also need to modify the queue depth inside the Guest OS to use the same number of queues. Which is possible by making registry entry changes in Windows. But would need to be carefully tested.
Great post Michael. Very timely for me as well, as we were scaling up our compute for a very specific use case, and unfortunately the storage had been cut from the budget. I described the results of this at: http://vmpete.com/2013/03/11/vroom-scaling-up-vir… Right now they are paying the price for not investing in the storage.
That book is 44$ on Amazon.. expensive 😀
Nowhere near as expensive as having storage related problems in your environments will be. It's half the price of other similar quality technical manuals. It's a very thick and detailed manual. I'd call it the VMware Environment Storage Bible.
[…] storage devices. When you overload your storage queues bad things happen. As I explained in “5 Tips to Help Prevent 80% of Virtualization Problems” we need to understand the queues, how they are used by applications, and how to optimize the […]