The Case for Larger Than 2TB Virtual Disks and The Gotcha with VMFS
Hypervisor competition is really starting to heat up. VMware just released vSphere 5.1 and Microsoft has recently released Windows Server 2012 and the new version of Hyper-V. A significant new feature available now in Hyper-V / Windows 2012 is a new disk format VHDX, which has a maximum size of 64TB. With the new filesystem in Windows Server 2012 (ReFS) the maximum volume size increases to 256TB ( NTFS was limited to 16TB @ 4K cluster size). So how does vSphere 5 and 5.1 compare and what are the key considerations and gotchas? What are the implications for business critical applications? Read on to find out.
Before we get started I’d like to say this article isn’t going to cover performance of large volumes. But rather the argument for supporting larger than 2TB individual virtual disks and large volumes. There are many considerations around performance, and I will cover some of the implications when you start to scale up volume size, but for particular performance design considerations I’d like to recommend you read my article titled Storage Sizing Considerations when Virtualizing Business Critical Applications.
The Case for Larger than 2TB Virtual Disks
Recently I have been having an interesting debate with some of my VCDX peers on the merits and reasons for having larger than 2TB virtual disk support in vSphere. As of vSphere 5 VMware supports 64TB VMFS5 datastores, and 64TB Physical Mode (Pass-through) Raw Device Maps (RDM’s), but the largest single VMDK file supported on a VMFS5 volume is still 2TB-512b (hereon after referred to as 2TB). The same 2TB limit applies to virtual mode RDMs also. In this debate I’ve been suggesting that for now “most” applications can be supported with the 2TB virtual disk limit. If larger than 2TB volumes are required for a VM that is very easily accommodated with in guest volume managers and device concatenation of multiple 2TB disks, or using an alternative to VMFS. However realistically this can only go so far. I plan to cover both the pros and the cons as I see them.
- Support for an individual VM with larger than 120TB storage requirements, which is the theoretical limit with 4 x vSCSI controllers, each with 15 disks (60 disks total) at the maximum size of 2TB each. You’ll find out why it’s a theoretical limit later.
- Easier to manage less devices and less volumes and space can potentially be more efficiently utilised.
- No need to use in guest volume managers for very large volumes.
- Easier to support very large individual files >2TB without the use of in guest volume managers.
- It could be argued that losing one 2TB device from a in guest managed volume has the same risk profile as losing a single large volume of the same size as in both cases the entire volume is potentially lost.
- Larger individual devices and volumes take longer to backup and restore. This may require a major change in data protection architecture.
- Larger volumes will potentially take longer to replicate and recover in a DR scenario.
- The risk profile of losing a large volume or device is significantly higher than losing a smaller device or volume. Losing a single smaller device where no volume manager is being used results in only the small device having to be recovered instead of everything.
- Larger individual devices still have the same number of IO queues to the vSCSI controller which effectively limits their performance. This increases the risk of running out of performance before running out of capacity (until ultra low latency solid state flash storage is of massive capacity and abundantly available anyway).
- Significantly harder to take snapshots. A snapshot could still grow to be equally as large as the original virtual disk. This is probably one of the more significant reasons that VMware hasn’t yet introduced VMDK’s above 2TB.
- Significantly longer to check disk for integrity if there is any type of corruption, how will it be recovered if it’s very large?
- Impact on Storage vMotion times.
In my opinion the arguments are pretty even. But as I always err on the side of performance, and I think having more devices of a smaller size in a lot of cases is a better option as this gives you far more access to more queues and more parallel IO channels. However this is only relevant for some applications, mostly OLTP and messaging type applications. File servers, data warehousing, big data and the like may well benefit greatly from larger volume sizes, and it would make those applications significantly easier to manage. But the requirements will all be driven by the applications and at the moment I only see a very small minority of workloads require storage capacities that would justify very large individual SCSI devices and where the performance tradeoffs from an IO parallelism perspective are acceptable. Most of those corner cases have a suitable alternative for now (discussed below). I agree with my friend Alastair Cooke that I don’t want hypervisor limitations dictating my designs. Yet all designs have constraints we have to work within. Alastair has posted a good article on this topic in response to this titled VM Disks Greater Than 2TB and I recommend you read it.
Options for Larger than 2TB Volumes
So if you’ve looked at the requirements for your application and you decide that you need a volume larger than 2TB, what are your options with vSphere 5.x?
- Using one or more VMFS volumes with virtual disks up to 2TB and in guest volume managers to concatenate them. Implications: The more devices the more storage IO queues and potentially the more performance. Oracle RAC vMotion Supported. Theoretically supports up to 120TB storage per VM.
- Physical Mode RDM – Support up to 64TB individual device, more than 3PB per VM. Implications: No Storage vMotion, No Hypervisor Snapshot Support, No Cloning, No vSphere API’s for Data Protection Support (vADP), No vCloud Director Support, No FT Support, No Oracle RAC vMotion Support, No Clustering vMotion Support.
- In Guest iSCSI – Supports up to 16TB or greater individual devices depending on iSCSI target. Implications: No Storage vMotion (of iSCSI devices), No Hypervisor Snapshot Support (of iSCSI devices), No Cloning (of iSCSI devices), No vSphere API’s for Data Protection Support (vADP) (of iSCSI devices), vCloud Director Supported, FT Supported, vMotion Supported, Clustering vMotion Support, higher CPU utilization.
- In Guest NFS – Supports very large volumes depending on the array. Implications: No Storage vMotion (of NFS devices), No Hypervisor Snapshot Support (of NFS devices), No Cloning (of NFS devices), No vSphere API’s for Data Protection Support (vADP) (of NFS devices), vCloud Director Supported, FT Supported, vMotion Supported, Oracle RAC vMotion Support, higher CPU utilization.
- VMDirectPath/IO – Supports assigning an HBA or NIC directly to a VM and is not impacted by VMFS Heap Size limitations. Implications: No Storage vMotion (of attached LUN’s), No Hypervisor Snapshot Support (of attached LUN’s), No Cloning, No vSphere API’s for Data Protection support (vADP), No FT Support, No vCloud Director Support, No vMotion Support.
You can’t evaluate the alternatives in isolation and to be fair they are workarounds that you wouldn’t even have to consider if larger than 2TB VMDK’s were possible. Physical Mode RDM’s in particular have operational implications, especially as you can’t use hypervisor snapshots, cloning, and no backup API integration, just to name a few. So any alternative you choose needs to be thoroughly considered.
The Gotcha with VMFS
If you are going to have databases or systems with large disk footprints (and have multiple per host) you may need to modify the ESXi VMFS Heap Size by changing the advanced setting VMFS3.MaxHeapSizeMB. Review KB 1004424, Jason Boche’s article Monster VMs & ESX(i) Heap Size: Trouble In Storage Paradise and Virtual Kenneth’s article VMFS3 Heap Size. VMFS5 was limited to a maximum of 25TB of virtual disks open per host (Yes per host). This has increased form when I first published this article as a result of ESXi 5.0 patch ESXi500-201303401-BG. With a old default setting allowing only 8TB of VMDK’s to be open per host, the new default is increased to 60TB per host once you’ve applied the latest patch. This means even if it is acceptable to you for a single VM to have multiple virtual disks of 2TB and using in guest volume managers you would not be able to configure or open more than 60TB total (up from 25TB prior to ESXi patch ESXi500-201303401-BG) maximum on a single host (was 32TB with VMFS3). This is why the limit of 120TB per VM on VMFS is at this point purely theoretical.
If you want to work around this limitation you will need to adopt option 1 (on NFS Datastore Only), 2, 3 or 4 above or use virtual/physical mode RDMs. The reason is this limit is purely with VMFS and doesn’t impact RDM’s (physical or virtual), VMDK’s on an NFS Datastore, or in guest iSCSI or NFS.
[Updated 20/09/2012] A great example where it would be good to be able to support > 25TB VMDK’s per host and > 2TB per VMDK is where a customer has a requirement such as virtualizing 20 x 4TB File Servers. Each fileserver may not need much in the way or RAM or CPU, but does need a decent amount of storage. In theory these 20 VM’s could easily be consolidated on a single host (although wouldn’t be for availability requirements), but because the VMFS limitation this is not possible, and due to the limit of 2TB per VMDK limit you will require a minimum of 2 VMDK’s per VM. It may be more convenient to have a single 4TB VMDK for these types of servers. One option is to design for a consolidation ration of 5:1 and size the physical hosts accordingly, making sure to increase the default VMFS heap size. However this would introduce additional operational costs and effort. This brings us back to option 2, 3 and 4 above again. In this case vRDM may be a better option than pRDM even with the 2TB limit as it allows easy migration to VMFS / VMDK’s in the future. pRDM would have the advantage of reducing the number of LUNs in total required for the VM’s, which might be 60 LUNs in total, not taking into account other VM’s and LUNs in the cluster (which could bring them close to the 256 LUN limit per host), but with a tradeoff of a harder migration path in the future.
[Updated 04/04/2013] On 28th March VMware released patch ESXi500-201303401-BG which increased the default heap size for VMFS to 640MB and the maximum open VMDK storage per host to around 60TB as mentioned in the original KB article. This patch also addresses a problem where a VM configured with 18 or more VMDK’s where the VMDK’s are above 256GB would also report a VMFS heap size issue. This is great news for customers that want to run Monster VM’s with large amounts of storage per host. This new patch is currently for ESXi 5.0 only, not 5.1. But I would expect that when the latest patches for ESXi 5.1 are made available they will also allow up to 60TB per host of open VMDK files. I would like to thank Marcel van den Berg and his excellent article covering this problem titled A Small Adjustment and a New VMware Fix will Prevent Heaps of Issues on vSphere VMFS Heap. This article alerted me to the new adjustment of the default heap size that I initially missed when reviewing the release notes. This is yet another reason to ensure you keep your vSphere environments up to date with patches. Great news from VMware.
Microsoft appears to have put the cat squarely among the pigeons in terms of large virtual disk storage support with their latest release of Windows 2012 and Hyper-V. In this respect VMware is indeed playing catch up. But are greater than 2TB virtual disks really required right now for most applications? In my opinion no. For the majority of applications the existing vSphere hypervisor can adequately cater for their size and performance needs. But this is only going to last so long. There are some good use cases documented in Cormac Hogan’s blog article How Much Storage Can I Present to a Virtual Machine.
Most applications in my experience, especially the performance and latency sensitive messaging and OLTP database applications would benefit more from a greater number of SCSI devices and queues. In their case supporting more than 256 datastores per host would be of benefit, especially if there are multiple of them all grouped in a cluster. The benefits of using VMFS and virtual disks are compelling and not being able to support very large virtual disks is definitely going to be a major problem in the future, considering VMFS5 already supports 64TB volumes. Especially considering the explosive growth of data. But do we want larger virtual disks and to sacrifice functionality, such as snapshots? I don’t think so. I hope that VMware will support larger virtual disks, even if they increase it up to 4TB or 16TB, and without sacrificing functionality. However in the meantime the alternatives such as RDMs and in guest storage access will fill the gap for some of the minority of workloads that need it, with the resulting trade offs in functionality. For those workloads where the workarounds are unacceptable they may not be virtualization candidates, at least on vSphere anyway, till some of these problems are solved.
Just because you can do something doesn’t mean you necessarily should. The back end array architecture needs to be considered and so does the data protection and disaster recovery protection aspects of the solution. It’s not good having a massive volume and a massive amount of storage per VM if you can’t protect that data and recover it in a reasonable timeframe when required. I would like to know of your use cases that require greater than 2TB virtual disks and of your very large data Monster VM’s. Hopefully if there are enough customers that require larger than 2TB VMDK’s VMware will implement the necessary changes.
Here is what I’d like to see from VMware (In no particular order):
- Larger than 2TB VMDK Support
- More than 4 vSCSI Controllers per VM
- More than 256 SCSI Devices per Host
I would be very interested to get your feedback on this.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.