Home > Business Critical Applications, VMware > The Case for Larger Than 2TB Virtual Disks and The Gotcha with VMFS

The Case for Larger Than 2TB Virtual Disks and The Gotcha with VMFS

Hypervisor competition is really starting to heat up. VMware just released vSphere 5.1 and Microsoft has recently released Windows Server 2012 and the new version of Hyper-V. A significant  new feature available now in Hyper-V / Windows 2012 is a new disk format VHDX, which has a maximum size of 64TB. With the new filesystem in Windows Server 2012 (ReFS) the maximum volume size increases to 256TB ( NTFS was limited to 16TB @ 4K cluster size). So how does vSphere 5 and 5.1 compare and what are the key considerations and gotchas? What are the implications for business critical applications? Read on to find out.

Before we get started I’d like to say this article isn’t going to cover performance of large volumes. But rather the argument for supporting larger than 2TB individual virtual disks and large volumes. There are many considerations around performance, and I will cover some of the implications when you start to scale up volume size, but for particular performance design considerations I’d like to recommend you read my article titled Storage Sizing Considerations when Virtualizing Business Critical Applications.

The Case for Larger than 2TB Virtual Disks

Recently I have been having an interesting debate with some of my VCDX peers on the merits and reasons for having larger than 2TB virtual disk support in vSphere. As of vSphere 5 VMware supports 64TB VMFS5 datastores, and 64TB Physical Mode (Pass-through) Raw Device Maps (RDM’s), but the largest single VMDK file supported on a VMFS5 volume is still 2TB-512b (hereon after referred to as 2TB). The same 2TB limit applies to virtual mode RDMs also. In this debate I’ve been suggesting that for now “most” applications can be supported with the 2TB virtual disk limit. If larger than 2TB volumes are required for a VM that is very easily accommodated with in guest volume managers and device concatenation of multiple 2TB disks, or using an alternative to VMFS. However realistically this can only go so far. I plan to cover both the pros and the cons as I see them.

Pros:

  • Support for an individual VM with larger than 120TB storage requirements, which is the theoretical limit with 4 x vSCSI controllers, each with 15 disks (60 disks total) at the  maximum size of 2TB each. You’ll find out why it’s a theoretical limit later.
  • Easier to manage less devices and less volumes and space can potentially be more efficiently utilised.
  • No need to use in guest volume managers for very large volumes.
  • Easier to support very large individual files >2TB without the use of in guest volume managers.
  • It could be argued that losing one 2TB device from a in guest managed volume has the same risk profile as losing a single large volume of the same size as in both cases the entire volume is potentially lost.

Cons:

  • Larger individual devices and volumes take longer to backup and restore. This may require a major change in data protection architecture.
  • Larger volumes will potentially take longer to replicate and recover in a DR scenario.
  • The risk profile of losing a large volume or device is significantly higher than losing a smaller device or volume. Losing a single smaller device where no volume manager is being used results in only the small device having to be recovered instead of everything.
  • Larger individual devices still have the same number of IO queues to the vSCSI controller which effectively limits their performance. This increases the risk of running out of performance before running out of capacity (until ultra low latency solid state flash storage is of massive capacity and abundantly available anyway).
  • Significantly harder to take snapshots. A snapshot could still grow to be equally as large as the original virtual disk. This is probably one of the more significant reasons that VMware hasn’t yet introduced VMDK’s above 2TB.
  • Significantly longer to check disk for integrity if there is any type of corruption, how will it be recovered if it’s very large?
  • Impact on Storage vMotion times.

In my opinion the arguments are pretty even. But as I always err on the side of performance, and I think having more devices of a smaller size in a lot of cases is a better option as this gives you far more access to more queues and more parallel IO channels. However this is only relevant for some applications, mostly OLTP and messaging type applications. File servers, data warehousing, big data and the like may well benefit greatly from larger volume sizes, and it would make those applications significantly easier to manage. But the requirements will all be driven by the applications and at the moment I only see a very small minority of workloads require storage capacities that would justify very large individual SCSI devices and where the performance tradeoffs from an IO parallelism perspective are acceptable. Most of those corner cases have a suitable alternative for now (discussed below). I agree with my friend Alastair Cooke that I don’t want hypervisor limitations dictating my designs. Yet all designs have constraints we have to work within. Alastair has posted a good article on this topic in response to this titled VM Disks Greater Than 2TB and I recommend you read it.

[Updated 04/09/2013] The good news is as of vSphere 5.5 we have support for 62TB VMDK’s, so you are pretty much free to choose whichever size VMDK you like up to this limit provided you’re running VMFS 5 and vSphere 5.5. 

Options for Larger than 2TB Volumes

So if you’ve looked at the requirements for your application and you decide that you need a volume larger than 2TB, what are your options with vSphere 5.x?

  1. Upgrade to vSphere 5.5.
  2. Using one or more VMFS volumes with virtual disks up to 2TB and in guest volume managers to concatenate them. Implications: The more devices the more storage IO queues and potentially the more performance. Oracle RAC vMotion Supported. Theoretically supports up to 120TB storage per VM.
  3. Physical Mode RDM – Support up to 64TB individual device, more than 3PB per VM. Implications: No Storage vMotion, No Hypervisor Snapshot Support, No Cloning, No vSphere API’s for Data Protection Support (vADP), No vCloud Director Support, No FT Support, No Oracle RAC vMotion Support, No Clustering vMotion Support.
  4. In Guest iSCSI – Supports up to 16TB or greater individual devices depending on iSCSI target. Implications: No Storage vMotion (of iSCSI devices), No Hypervisor Snapshot Support (of iSCSI devices), No Cloning (of iSCSI devices), No vSphere API’s for Data Protection Support (vADP) (of iSCSI devices), vCloud Director Supported, FT Supported, vMotion Supported, Clustering vMotion Support, higher CPU utilization.
  5. In Guest NFS – Supports very large volumes depending on the array. Implications: No Storage vMotion (of NFS devices), No Hypervisor Snapshot Support (of NFS devices), No Cloning (of NFS devices), No vSphere API’s for Data Protection Support (vADP) (of NFS devices), vCloud Director Supported, FT Supported, vMotion Supported, Oracle RAC vMotion Support, higher CPU utilization.
  6. VMDirectPath/IO – Supports assigning an HBA or NIC directly to a VM and is not impacted by VMFS Heap Size limitations. Implications: No Storage vMotion (of attached LUN’s), No Hypervisor Snapshot Support (of attached LUN’s), No Cloning, No vSphere API’s for Data Protection support (vADP), No FT Support, No vCloud Director Support, No vMotion Support.

You can’t evaluate the alternatives in isolation and to be fair they are workarounds that you wouldn’t even have to consider if larger than 2TB VMDK’s were possible. Physical Mode RDM’s in particular have operational implications, especially as you can’t use hypervisor snapshots, cloning, and no backup API integration, just to name a few. So any alternative you choose needs to be thoroughly considered.

The Gotcha with VMFS

If you are going to have databases or systems with large disk footprints (and have multiple per host) you may need to modify the ESXi VMFS Heap Size by changing the advanced setting VMFS3.MaxHeapSizeMB. Review KB 1004424Jason Boche’s article Monster VMs & ESX(i) Heap Size: Trouble In Storage Paradise and Virtual Kenneth’s article VMFS3 Heap Size. VMFS5 was limited to a maximum of 25TB of virtual disks open per host (Yes per host). This has increased form when I first published this article as a result of ESXi 5.0 patch ESXi500-201303401-BG. With a old default setting allowing only 8TB of VMDK’s to be open per host, the new default is increased to 60TB per host once you’ve applied the latest patch. This means even if it is acceptable to you for a single VM to have multiple virtual disks of 2TB and using in guest volume managers you would not be able to configure or open more than 60TB total (up from 25TB prior to ESXi patch ESXi500-201303401-BG) maximum on a single host (was 32TB with VMFS3). This is why the limit of 120TB per VM on VMFS is at this point purely theoretical.

If you want to work around this limitation you will need to adopt option 1 (on NFS Datastore Only), 2, 3 or 4 above or use virtual/physical mode RDMs. The reason is this limit is purely with VMFS and doesn’t impact RDM’s (physical or virtual), VMDK’s on an NFS Datastore, or in guest iSCSI or NFS.

[Updated 20/09/2012] A great example where it would be good to be able to support > 25TB VMDK’s per host and > 2TB per VMDK is where a customer has a requirement such as virtualizing 20 x 4TB File Servers. Each fileserver may not need much in the way or RAM or CPU, but does need a decent amount of storage. In theory these 20 VM’s could easily be consolidated on a single host (although wouldn’t be for availability requirements), but because the VMFS limitation this is not possible, and due to the limit of 2TB per VMDK limit you will require a minimum of 2 VMDK’s per VM. It may be more convenient to have a single 4TB VMDK for these types of servers. One option is to design for a consolidation ration of 5:1 and size the physical hosts accordingly, making sure to increase the default VMFS heap size. However this would introduce additional operational costs and effort. This brings us back to option 2, 3 and 4 above again. In this case vRDM may be a better option than pRDM even with the 2TB limit as it allows easy migration to VMFS / VMDK’s in the future. pRDM would have the advantage of reducing the number of LUNs in total required for the VM’s, which might be 60 LUNs in total, not taking into account other VM’s and LUNs in the cluster (which could bring them close to the 256 LUN limit per host), but with a tradeoff of a harder migration path in the future.

[Updated 04/04/2013] On 28th March VMware released patch ESXi500-201303401-BG which increased the default heap size for VMFS to 640MB and the maximum open VMDK storage per host to around 60TB as mentioned in the original KB article. This patch also addresses a problem where a VM configured with 18 or more VMDK’s where the VMDK’s are above 256GB would also report a VMFS heap size issue. This is great news for customers that want to run Monster VM’s with large amounts of storage per host. This new patch is currently for ESXi 5.0 only, not 5.1. But I would expect that when the latest patches for ESXi 5.1 are made available they will also allow up to 60TB per host of open VMDK files. I would like to thank Marcel van den Berg and his excellent article covering this problem titled A Small Adjustment and a New VMware Fix will Prevent Heaps of Issues on vSphere VMFS Heap. This article alerted me to the new adjustment of the default heap size that I initially missed when reviewing the release notes. This is yet another reason to ensure you keep your vSphere environments up to date with patches. Great news from VMware.

 [Updated 04/09/2013] As of vSphere 5.5 the Default Heap size allows for up to 64TB of open VMDK files per host and the maximum setting allows for up to 128TB of Open VMDK’s per host. The way the heap is used has changed significantly and it is now much more efficient.

Final Word

Microsoft appears to have put the cat squarely among the pigeons in terms of large virtual disk storage support (prior to vSphere 5.5) with their latest release of Windows 2012 and Hyper-V. In this respect VMware is indeed playing catch up. But are greater than 2TB virtual disks really required right now for most applications? In my opinion no. For the majority of applications the existing vSphere hypervisor can adequately cater for their size and performance needs. But this is only going to last so long. There are some good use cases documented in Cormac Hogan’s blog article How Much Storage Can I Present to a Virtual Machine.

Most applications in my experience, especially the performance and latency sensitive messaging and OLTP database applications would benefit more from a greater number of SCSI devices and queues. In their case supporting more than 256 datastores per host would be of benefit, especially if there are multiple of them all grouped in a cluster.  The benefits of using VMFS and virtual disks are compelling and not being able to support very large virtual disks is definitely going to be a major problem in the future, considering VMFS5 already supports 64TB volumes. Especially considering the explosive growth of data. But do we want larger virtual disks and to sacrifice functionality, such as snapshots? I don’t think so. I hope that VMware will support larger virtual disks, even if they increase it up to 4TB or 16TB, and without sacrificing functionality. However in the meantime the alternatives such as RDMs and in guest storage access will fill the gap for some of the minority of workloads that need it, with the resulting trade offs in functionality. For those workloads where the workarounds are unacceptable they may not be virtualization candidates, at least on vSphere anyway, till some of these problems are solved.

Just because you can do something doesn’t mean you necessarily should. The back end array architecture needs to be considered and so does the data protection and disaster recovery protection aspects of the solution. It’s not good having a massive volume and a massive amount of storage per VM if you can’t protect that data and recover it in a reasonable timeframe when required. I would like to know of your use cases that require greater than 2TB virtual disks and of your very large data Monster VM’s. Hopefully if there are enough customers that require larger than 2TB VMDK’s VMware will implement the necessary changes.

Here is what I’d like to see from VMware (In no particular order):

  • Larger than 2TB VMDK Support (supported as of vSphere 5.5)
  • More than 4 vSCSI Controllers per VM (AHCI Controller allows up to 120 Devices as of vSphere 5.5)
  • More than 256 SCSI Devices per Host

I would be very interested to get your feedback on this.

This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.comby Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.

  1. Simon Williams
    September 18, 2012 at 9:47 am | #1

    Mike,

    Can you suggest any names / contacts for meetings with my CEO & I in Wellington on the 27th & 28th?

    So far we are meeting BNZ and Weta… Cheers.

    Simon

    Simon Williams

    Sales Director – Australia & New Zealand
    Fusion-io
    Ph. +61 488 488 328
    Twitter: @simwilli
    Email: swilliams@fusionio.com

  2. September 18, 2012 at 9:42 pm | #2

    Alternatives have worked alright for now but my main concern with this is how it effects the performance. Moreover, backups and restores take ridiculously long time which could be a problem depending on company's RTO/RPO requirements.

  3. November 5, 2012 at 1:13 pm | #3

    Thanks Mike. Always enjoyed reading your article. Just adding some points, and do correct me if I'm wrong:

    – doing in-guest means the IP storage traffic is not visible to ESXi (and hence vCenter). So it can't be monitored using the standard (built-in) tools. vCenter Operations will also "miss" this data as it won't classify it as Storage. For example, high workload on this vmnic will not impact the Workload badge of the corresponding VM.

    – doing in-guest means the VM sees the storage network. This creates complexity and security should be incorporated to address this, because VM local admin is typically given to the VM owner (the Sys Admin of that VM). Personally, I'd like to keep the separation clean, so it's easier operationally.

    – I'm not 100% certain if doing concatenation at software level (be it hypervisor OS or Guest OS) is a potential of bottleneck. I thought it's always the physical spindle. An EMC Resident Engineer told me that's the bottleneck when we were discussing a storage issue at a large client.

    I also agree with Alastair. Well said mate :-)

    • November 6, 2012 at 12:27 am | #4

      Hi Iwan, You've raised some good and valid points that should also be considered when looking at guest storage design. The back end storage isn't always the bottleneck though. Often the Guest OS configuration is also a bottleneck. The bottlenecks will vary greatly between different customers, different workloads and different designs or configurations. For example if you have SSD's backing the VM that can easily handle a queue depth of 255 and you're using a VM with a single virtual disk and a queue depth of 32 your Guest VM config could be a major bottleneck. But even with concatenation it's not a silver bullet solution if all the IO's happening on a single virtual disk that makes up the larger volume. It'll all depend on the workload. Most solutions are never perfect as there are always constraints and compromises that need to be made.

  4. Jim Nickel
    December 11, 2012 at 12:41 am | #5

    I recently had to use in guest disk managers to build 1 20 tb file server. Then i also made 2 20 tb Exchange mailbox servers.

    Both of these for a fairly large client. While this works today, I can see potential problems with this in the future.

    I would very much like to see >2TB VMDK support soon.

    Jim

  5. Troy MacVay
    December 13, 2012 at 8:30 pm | #6

    Very interesting post. We are a Cloud Provider and had a long standing issue in our CommVault environment that was the result of Heap Size. We run CommVault on stad alone ESXi hosts and use the HotAdd transport for backup. For us we started getting random HotAdd failures. We spend way too much time troubleshooting with out any real resolution. We even had worked with VMware support. Come to find out that we found the Heap Size issue in some last ditch troubleshooting and for us it totally added up.

    We were limited to 8TB per host of active VMDK. This was an issue for us as we tend to HotAdd much more than this on the hosts as part of the backup process. Increased the value to max and HotAdd issues are gone.

    It does for us lead us to some questions around the possibility of large RAM hosts and capacity planning. Think of a host that has 1TB of RAM and I can bet it will need to have more than 25TB of attached VMDK's to support the VM workloads.

    Cheers,

  6. January 8, 2013 at 2:41 am | #7

    We are also a cloud provider and have design issues while trying to stay below the 256 LUN limit per host. When each customer has multiple Datastores, the number of Datastores (and thus LUNs or NFS mounts) and escalate pretty quickly.

    The 25TB of attached VMDK's seems a bit absurd. I certainly hope some of these scalability issues are taken seriously soon. It seems that ever since day one, whether it be ESX or vCenter, VMware hasn't thought this through carefully, and instead let customers troubleshoot ridiculous issues, while the support staff at VMware has little or no real-world knowledge of larger environments.

    Eric

  7. skyfx
    March 2, 2013 at 11:32 pm | #8

    Great article! Quick question – you suggest the use of physical RDM as one of the approaches of circumventing the 2TB limit. I understand that virtual RDM does not offer the same advantage, but I don't fully understand why. Could you elaborate on the limitations of virtual RDM?

    In our case, we have two disk arrays configured in a RAID 6 array comprising multiple TBs. If we were to take the RDM approach, my understanding is that we would separate the array into two partitions:

    1) A VMFS partition to host the guest OS .vmdk's as well as the RDM mapping .vmdk's

    2) A raw partition

    Assuming partition 2 is greater than 2TB, could we not use it as a virtual RDM?

    Thanks :)

    • @vcdxnz001
      March 3, 2013 at 10:33 pm | #9

      The Virtual Mode RDM still has the hypervisor sits in front of it to intercept IO, which is why vSphere Snapshots are supported etc. This limits it to 2TB currently (refer to vSphere Maximum's guide). However a raw device map in physical mode allows the VM to connect directly to the underlying LUN essentially. So at that point the 2TB limit no longer applies.

  8. September 10, 2013 at 2:20 am | #10

    Been having some fun testing 62TB virtual drives under ESXi 5.5, so far, so (very) good!

    • @vcdxnz001
      September 10, 2013 at 2:30 am | #11

      Hi Paul,

      That's great to hear. It's certainly a feature that was asked for quite a bit. Not necessarily because everyone needs 62TB per VMDK right away, but because there are a lot of use cases for 3TB or 4TB VMDK's. It'll be interesting to see how this feature in conjunction with some of the flash technologies goes, as they have the potential to reduce the performance problems traditionally associated with large VMDK and LUN sizes.

  9. Mike
    September 13, 2013 at 5:28 pm | #12

    Paul, I'm concerned about the 256 SCSI devices per host you mentioned. My 5.0 hosts connect to 40 iSCSI VMFS datastores containing C drive VMDKs and 150 pRDMs containing SQL and Exchange data, so the iSCSI software adapter sees 190 devices. So is 256 my limit if using pRDMs and datastores? If so, what do I do? Convert the pRDMS into VMDK files on additional larger VMFS datastores and just use VMDKs for the future instead of multiple small pRDMs? Do you know if a future release will address this?

    • Mike
      September 13, 2013 at 7:00 pm | #13

      Sorry, this was a question for Mike, not Paul.

  1. No trackbacks yet.

Leave a Reply

%d bloggers like this: