I’ve recently been testing some Monster VM’s in my lab and setting up templates to provision these Monster VM’s more rapidly. Along the way I’ve run into a few things I wasn’t aware of that could be a potential barrier to Monster VM’s in some ways, and some interesting behaviour during large storage migrations. When you start provisioning Monster VM’s more often and move them around you are quite likely to run into the same problems. So hopefully this article will help you work around the problems I hit and avoid them until there is a more permanent solution. I also hope that our friends at VMware are aware of this behaviour and work towards addressing it in an upcoming release so that Apps can continue to love vSphere as much as they do today, even when the scale increases and they become the norm. I would like to thank Duncan Epping, Andrew Mitchell and Frank Denneman for their help with understanding this problem and it’s possible solutions.
Firstly let me qualify what I mean by a Monster VM. When I say Monster VM in this case I mean a VM with 8TB of disk (4 vSCSI Controllers and 17 VMDK’s), 16 vCPU’s and 192GB RAM. So it’s pretty big, but not massive compared to the maximums that vSphere 5.x can support. But this was big enough for the experiments I was going to run with Microsoft Exchange 2010, Exchange 2013 and Oracle 11g R2.
The Problem and Why it’s a Barrier to Monster VM’s
So what’s the problem? Well the problem is I was creating templates for these Monster VM’s so I can provision them rapidly after my experiments. These templates are thin provisioned so I don’t waste unnecessary space. Also the systems I deployed from these templates and was running the experiments on were also thin provisioned. Where the problem comes in is I had put these VM’s on my bulk NAS storage until I need them. This was so they don’t take up space on my higher more valuable tiers of storage (See my lab environment). I hear you saying, so what could possibly cause a problem with that?
Well that is the interesting part. It all comes down to a thing called a Data Mover. A Data Mover is the component within VMware vSphere that moves data from place to place (funny that) when doing operations such as Storage vMotion and other types of storage migrations (clones). In vSphere today (4.1 and up) there are three different Data Movers.
The three Data Movers are FSDM, which is the most simple, most compatible and slow data mover (highest level in the IO stack too), FS3DM, more advanced and faster, and FS3DM Hardware Accelerated (offloads to the array). I will not go into all the details here as I think the best way to understand these is to read Duncan and Frank’s Book – vSphere 5.1 Clustering Deep Dive, and Frank’s article VAAI hw offload and Storage vMotion between two Storage Arrays.
In my case I was migrating between arrays, going from my NAS, which is NFS, to either another NAS (NFS filesystem), or a VMFS5 datastore on an iSCSI array. My iSCSI Array (HP StorVirtual) and one of my NFS systems (Nutanix NX-3450) have VAAI enabled, my other NAS (QNAP 869 Pro, QNAP 469) don’t have VAAI yet. This is where I hit a little snag. Going between arrays regardless of their VAAI capability, won’t use FS3DM Hardware Accelerated Data Mover, that’s only used within the same array. But going between a NAS and VMFS5, or from one NAS to another NAS won’t even use FS3DM. It will only use FSDM, the most compatible, but also the slowest and simplest of the Data Mover species. Why is this so important I hear you say? Let me explain.
FSDM will read every block of a VM, even if it has never been written to. You read that correctly, even if the block has never existed, it will be read and transferred. This is very bad news if you have a Monster VM or Monster VM Template using Thin Provisioning. This effects every vendor using NFS with VMware vSphere. One of the reasons it probably isn’t that well known or noticeable is that most VM’s have very small storage footprints, relatively speaking, especially when they are initial provisioned. Large production VM’s also don’t move around all that much.
So in my case one of the VM’s had 8TB allocated, 5.65TB in use, another VM had 5TB allocated and only 15GB in use. Guess what it did. It copied over all the real data, and then for all the VMDK’s that had zero bytes physically written on disk it proceeded to copy TB’s worth of zeros for every block that had never yet been written to. The storage migration took 3 days to complete! In case you were wondering I have a 10G network but the QNAP NAS is dual 1G connected.
In contrast to this, when I migrated or cloned the template VM with 15GB used and 5TB allocated from VMFS5 to VMFS5 and it used the FS3DM data mover (I did this across arrays) it took all of 24 minutes. 24 minutes (VMFS5-VMFS5 across arrays) vs 3 days (NAS-VMFS5, NAS-NAS). You see my problem.
The Workaround or Temporary Solutions
So I have considered a number of solutions or workarounds.
1. I could mount an iSCSI LUN on my NAS and use VMFS5 to store my templates and make sure their destination is also VMFS5. This really isn’t ideal, as there are a lot of advantages to NFS, simplicity being one of them. Still wouldn’t help if I needed to deploy the template on an NFS datastore.
2. Only put the OS disks on my templates and leave off all the other VMDK’s and SCSI controllers until after the VM’s are provisioned. This isn’t ideal either as I’ll have to re-provision all the apps and data each time I want to re-create a VM and run a test from a known good state.
3. I could provision and set up a VM on each of the storage devices I’m going to use for my experiments. This option is also not ideal, as I’ll have to build these templates on each storage device. But at least being thin provisioned they won’t take up much space initially, at least for the templates.
4. Instead of using Storage vMotion I could use Backup and Restore, with say VMware VDP and vADP (vSphere API for Data Protection). This is potentially a good option, and might be quite fast.
5. I could set up vCloud Director and use Fast Provisioning. Provided I set up the template and use the same underlying datastore for all the copies this would work fine. This is one of the advantages of NFS, you get one huge volume, but VM level granularity. If I wanted to provision a copy to another datastore though, it would probably be hit by the same problem.
6. It’s NFS. I could just copy the files using the filesystem and not using a Storage vMotion or cloning. This is fine in my lab environment where the VM’s will likely be powered off when I want to move them about. But in a production environment where the VM’s need to remain powered on, this is not going to work.
In a production environment if I was looking to migrate a Monster VM that was running a database or Exchange or similar I would probably set up a new instance on the destination and use the application level utilities to migrate the data and switch over. Your requirements will vary, so evaluate the best options for your situation.
Putting this in perspective
In order for you to come experience this you need to be migrating cloning or deploying VM templates that have a source or destination on different storage devices, backed by an NFS Datastore, have Thin Provisioned Storage, and have a large storage allocation 1TB+. So right now this might not impact many. But the numbers will grow over time. Especially when you consider that many Cloud environments, especially those using vCloud Director, are backed by NFS storage, as it’s more efficient for Cloud. Although many environments may still be using block storage right now for the majority of their workloads, I think in the future that will change. Many systems are now running on NFS, and many vSphere environments are now backed by NFS Datastores. This will increase in the future.
I think the real solution is that the process VMware uses for migrating large VM’s between arrays where a NAS or NFS is involved, especially those that might be thin provisioned, needs to change. The FSDM Data Mover doesn’t cut it, not for Monster VM’s. Copying TB’s worth of zeros that have never been written to is of absolutely no value. It can be done much smarter. Perhaps this might come in the form of some new VAAI for NAS Primitive. Perhaps VMware just invents a new Data Mover. They have many smart people and are committed to Monster VM’s and running these types of workloads in a Software Defined Datacenter or Hybrid Cloud, so I’m sure in due course a viable long term solution will become available. Until then I’ve probably only thought of a subset of the options, so I’d like to hear your thoughts on this.
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.