I’ve recently been testing some Monster VM’s in my lab and setting up templates to provision these Monster VM’s more rapidly. Along the way I’ve run into a few things I wasn’t aware of that could be a potential barrier to Monster VM’s in some ways, and some interesting behaviour during large storage migrations. When you start provisioning Monster VM’s more often and move them around you are quite likely to run into the same problems. So hopefully this article will help you work around the problems I hit and avoid them until there is a more permanent solution. I also hope that our friends at VMware are aware of this behaviour and work towards addressing it in an upcoming release so that Apps can continue to love vSphere as much as they do today, even when the scale increases and they become the norm. I would like to thank Duncan Epping, Andrew Mitchell and Frank Denneman for their help with understanding this problem and it’s possible solutions.
Firstly let me qualify what I mean by a Monster VM. When I say Monster VM in this case I mean a VM with 8TB of disk (4 vSCSI Controllers and 17 VMDK’s), 16 vCPU’s and 192GB RAM. So it’s pretty big, but not massive compared to the maximums that vSphere 5.x can support. But this was big enough for the experiments I was going to run with Microsoft Exchange 2010, Exchange 2013 and Oracle 11g R2.
The Problem and Why it’s a Barrier to Monster VM’s
So what’s the problem? Well the problem is I was creating templates for these Monster VM’s so I can provision them rapidly after my experiments. These templates are thin provisioned so I don’t waste unnecessary space. Also the systems I deployed from these templates and was running the experiments on were also thin provisioned. Where the problem comes in is I had put these VM’s on my bulk NAS storage until I need them. This was so they don’t take up space on my higher more valuable tiers of storage (See my lab environment). I hear you saying, so what could possibly cause a problem with that?
Well that is the interesting part. It all comes down to a thing called a Data Mover. A Data Mover is the component within VMware vSphere that moves data from place to place (funny that) when doing operations such as Storage vMotion and other types of storage migrations (clones). In vSphere today (4.1 and up) there are three different Data Movers.
The three Data Movers are FSDM, which is the most simple, most compatible and slow data mover (highest level in the IO stack too), FS3DM, more advanced and faster, and FS3DM Hardware Accelerated (offloads to the array). I will not go into all the details here as I think the best way to understand these is to read Duncan and Frank’s Book – vSphere 5.1 Clustering Deep Dive, and Frank’s article VAAI hw offload and Storage vMotion between two Storage Arrays.
In my case I was migrating between arrays, going from my NAS, which is NFS, to either another NAS (NFS filesystem), or a VMFS5 datastore on an iSCSI array. My iSCSI Array (HP StorVirtual) and one of my NFS systems (Nutanix NX-3450) have VAAI enabled, my other NAS (QNAP 869 Pro, QNAP 469) don’t have VAAI yet. This is where I hit a little snag. Going between arrays regardless of their VAAI capability, won’t use FS3DM Hardware Accelerated Data Mover, that’s only used within the same array. But going between a NAS and VMFS5, or from one NAS to another NAS won’t even use FS3DM. It will only use FSDM, the most compatible, but also the slowest and simplest of the Data Mover species. Why is this so important I hear you say? Let me explain.
FSDM will read every block of a VM, even if it has never been written to. You read that correctly, even if the block has never existed, it will be read and transferred. This is very bad news if you have a Monster VM or Monster VM Template using Thin Provisioning. This effects every vendor using NFS with VMware vSphere. One of the reasons it probably isn’t that well known or noticeable is that most VM’s have very small storage footprints, relatively speaking, especially when they are initial provisioned. Large production VM’s also don’t move around all that much.
So in my case one of the VM’s had 8TB allocated, 5.65TB in use, another VM had 5TB allocated and only 15GB in use. Guess what it did. It copied over all the real data, and then for all the VMDK’s that had zero bytes physically written on disk it proceeded to copy TB’s worth of zeros for every block that had never yet been written to. The storage migration took 3 days to complete! In case you were wondering I have a 10G network but the QNAP NAS is dual 1G connected.
In contrast to this, when I migrated or cloned the template VM with 15GB used and 5TB allocated from VMFS5 to VMFS5 and it used the FS3DM data mover (I did this across arrays) it took all of 24 minutes. 24 minutes (VMFS5-VMFS5 across arrays) vs 3 days (NAS-VMFS5, NAS-NAS). You see my problem.
The Workaround or Temporary Solutions
So I have considered a number of solutions or workarounds.
1. I could mount an iSCSI LUN on my NAS and use VMFS5 to store my templates and make sure their destination is also VMFS5. This really isn’t ideal, as there are a lot of advantages to NFS, simplicity being one of them. Still wouldn’t help if I needed to deploy the template on an NFS datastore.
2. Only put the OS disks on my templates and leave off all the other VMDK’s and SCSI controllers until after the VM’s are provisioned. This isn’t ideal either as I’ll have to re-provision all the apps and data each time I want to re-create a VM and run a test from a known good state.
3. I could provision and set up a VM on each of the storage devices I’m going to use for my experiments. This option is also not ideal, as I’ll have to build these templates on each storage device. But at least being thin provisioned they won’t take up much space initially, at least for the templates.
4. Instead of using Storage vMotion I could use Backup and Restore, with say VMware VDP and vADP (vSphere API for Data Protection). This is potentially a good option, and might be quite fast.
5. I could set up vCloud Director and use Fast Provisioning. Provided I set up the template and use the same underlying datastore for all the copies this would work fine. This is one of the advantages of NFS, you get one huge volume, but VM level granularity. If I wanted to provision a copy to another datastore though, it would probably be hit by the same problem.
6. It’s NFS. I could just copy the files using the filesystem and not using a Storage vMotion or cloning. This is fine in my lab environment where the VM’s will likely be powered off when I want to move them about. But in a production environment where the VM’s need to remain powered on, this is not going to work.
In a production environment if I was looking to migrate a Monster VM that was running a database or Exchange or similar I would probably set up a new instance on the destination and use the application level utilities to migrate the data and switch over. Your requirements will vary, so evaluate the best options for your situation.
Putting this in perspective
In order for you to come experience this you need to be migrating cloning or deploying VM templates that have a source or destination on different storage devices, backed by an NFS Datastore, have Thin Provisioned Storage, and have a large storage allocation 1TB+. So right now this might not impact many. But the numbers will grow over time. Especially when you consider that many Cloud environments, especially those using vCloud Director, are backed by NFS storage, as it’s more efficient for Cloud. Although many environments may still be using block storage right now for the majority of their workloads, I think in the future that will change. Many systems are now running on NFS, and many vSphere environments are now backed by NFS Datastores. This will increase in the future.
Final Word
I think the real solution is that the process VMware uses for migrating large VM’s between arrays where a NAS or NFS is involved, especially those that might be thin provisioned, needs to change. The FSDM Data Mover doesn’t cut it, not for Monster VM’s. Copying TB’s worth of zeros that have never been written to is of absolutely no value. It can be done much smarter. Perhaps this might come in the form of some new VAAI for NAS Primitive. Perhaps VMware just invents a new Data Mover. They have many smart people and are committed to Monster VM’s and running these types of workloads in a Software Defined Datacenter or Hybrid Cloud, so I’m sure in due course a viable long term solution will become available. Until then I’ve probably only thought of a subset of the options, so I’d like to hear your thoughts on this.
—
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Great article Michael thanks.
Great Article. I am also testing Nutanix. As I did build to separate Nutanix clusters and Storage VMotioned the VM from Nutanix cluster one to Nutanix cluster two, it was slow. As I mounted the nfs datastore from Nutanix cluster one to two it was faster. If I understood it write in the first case it uses fsdm and in the second fsdm3. I don’t know if hardware offload is used
Very helpful was also frank dennemans article. http://frankdenneman.nl/2012/11/06/vaai-hw-offloa…
Also helpful is the comment of Birk: If you use multiple Netapp Boxes in cluster mode, VAAI will offload the Storage vMotion task
http://dresxi.blogspot.de/2013/07/storage-vmotion…
•fsdm – This is the legacy 3.0 datamover which is the most basic version and the slowest as the data moves all the way up the stack and down again.
•fs3dm – This datamover was introduced with vSphere 4.0 and contained some substantial optimizations so that data does not travel through all stacks.
•fs3dm – hardware offload – This is the VAAI hardware offload full copy that is leveraged and was introduced with vSphere 4.1. Maximum performance and minimal host CPU/Memory overhead.
Hi Michael, I had a link to Frank's article. Doing between arrays will use FS3DM if the datastores are VMFS5. If the datastores are NFS, it will not use FS3DM, it'll use FSDM. Going between datastores on the same array, again provided VMFS5, it'll use FS3DM Hardware Accelerated. FS3DM Hardware Accelerated is not used between arrays.
Your Article could mean, that we see maybe a new datamover with version 5.5 Update 1 when vsan is added? Good question is, what happens when we build a cluster one with 3 esxi hosts and a cluster two with 3 esxi hosts, An both is vsan enabled. Then a VM is storaged moved from cluster one to cluster two.
For the old datamovers the problem still exists. This means VMware means a new one in ESXi 5.5 Update 1.
I not sure if there will be any new data movers or enhancements to data movers in the vSphere 5.1 U1 timeframe. I'm also not sure what data mover VSAN uses. VSAN to the vSphere hypervisor might look more like VMFS5 than NFS in which case might use FS3DM. I've asked Duncan Epping and although he hasn't tested it, he thinks it uses FS3DM. I have not tested this aspect of VSAN. In terms of comparing the VAAI primitives on block and NAS arrays you should check out Cormac Hogan's article here – http://cormachogan.com/2012/11/08/vaai-comparison…
Thanks Michael. your article is very interesting and I am very interested to hear about Vmware's next improvements on their Data Movers. Also when I read your post, and comments, It seems that FSDM is always used during NFS to NFS operations, but what about if these volumes are on the same NAS , FS3DM hw accelerated won't be used ?
The data mover used when on the same array NFS will depend on if the VAAI plug-in is loaded and available and if hardware acceleration is supported. But on the same array it should definitely be faster.
Hopefully. Thanks for the clarification Michael.
Michael, thanks for the great article.
Would you answer some questions to clarify some technical details, please?
I would like to cleary understand the prereqs needed to ensure that no zero bytes are copied If a move a huge, thin provisioned VM from one physical storage to second physical storage (not the same vendor).
You wrote "[….] VMFS5 to VMFS5 and it used the FS3DM [….]"
1. If FS3DM is used to move thin provisioned VMs, than zero bytes are never copied/replicated – correct?
2.1 Does VMware always use FS3DM in case of VMFS5, regardless of type of storage and combination?
2.2 NFS to iSCSI ?
2.3 NFS to NFS?
2.4 iSCSI to iSCSI?
3.1 Is VMFS5 the only prereq to avoid copy of never written blocks/bytes?
3.2 Or does one of the storage needs to have VAAI support – or only the target?
3.3 If VAAI support is required, which VAAI primitive is exactly needed ensure that no zero bytes are copied (Reference: http://kb.vmware.com/selfservice/microsites/searc… ).
Thanks
-AM-
Hi AM, If you're using NFS as either source or destination my understanding is that it'll copy the non-existent zero bytes because it uses FSDM data mover. If you're using VMFS5 datastores then it should use FS3DM and not copy the zero bytes if it is between different arrays. If you are migrating a VM on the same storage array it will likely use FS3DM Hardware Accelerated if it is supported by the array.
Hi Michael,
Excellent Article and very helpful.
Its an old post so it would be great if you could help ?
"FS3DM Hardware Accelerated is used if it is supported by the array."
Considering the above statement I have few questions ?
1) What process (Block replication etc) does an EMC VNX follow in the background when a VMDK is moved from one datastore to the other datastore. In this case both datastores are on LUNS coming from same array.
2) Are their any licensing requirements on Storage side for VAAI support ?
Interesting post.
I am running vCSA 5.5 and ESXi 5.5 u1 with shared iSCSI storage all VMFS5.
If I Storage vMotion a thick provisioned 190GB vmdk it takes about 1 minute.
If I Storage vMotion a thin provisioned 6TB vmdk with only 190GB reported as used by ESXi it takes forever…..still waiting…….got bored after 1 hour.
Running a dedicated 10Gp storage network (end to end 10Gb with jumbo frames)
Same issue?
Yes, that sounds like the same issue. Does your iSCSI array have VAAI enabled? Is the svMotion going from one datastore on the same array to another datastore on the same array?
I just tried it again and all worked fine. I am unsure what happened. Will do some more testing
I figured it out.
We have SAN latency issues, which were the cause (too much IO with several svMotions)..so ignore this as svMOtion on iSCSI with thin vmdk works properly
[…] Really slow. If you have a monster VM, a vMotion can take a looooong time (worth reading: “VMware Storage vMotion, Data Movers, Thin Provisioning, Barriers to Monster VM’s” by Michael […]
Great article; this is exactly the info I was looking for. We’re finding “issues” with this as we move hundreds of terabytes of THIN PROVISIONED vmdk’s around our Netapp NFS volumes. Even though we have the VAAI NFS primitives working, there are limitations and we’re seeing the need to read through ALL of those blocks even though they are completely empty. It’s really slowing the process down and putting unneeded stress on the filers. I wish there was a better way but I guess not until Netapp and/or VMware progress the VAAI NFS primitives.
Could well be something else going on. That should only happen with cloning operations. Not in other cases. Recommend raising a case with VMware Support.
Yeah, I think we’re going to open a case with support just to see if they have any better ideas but I think we’re stuck with it working this way. Storage VMotion really is akin to a cloning operation especially for powered-on vm’s on NFS storage so if the data mover is reading all the blocks, then there may be no way around it. For reference to Netapp NFS VAAI and how it works for powered-on vm’s according to https://kb.netapp.com/support/index?page=content&id=3013572: “Storage vMotion on NFS Datastores is NOT offloaded to storage; only cold migration of VMs on NFS is supported for VAAI copy offload. In addition, NFS copy offload between volumes will be significantly slower than SAN copy offload between volumes due to hole punching.” So, unless there’s a way to force use a different mover, I think we’re stuck with this process. You might want to check out Duncan’s article (even though old) at http://www.yellow-bricks.com/2011/02/18/blocksize-impact/ where he talks about the different movers as well. The difference on his write-up is that he says the FSDM mover “gobbles the zeros” where the FS3DM does not.
Yeah, Storage vMotion and Cloning same same on traditional array. That\’s why on NetApp SnapClone was invented. On Nutanix clones are done instantly as well. Only problem then remains Storage vMotion with NFS / VAAI NAS. Need VMware to fix the data movers.
The problem with the normal data mover is that it tries to copy the blocks even if they have never been written before. It\’s just silly. At least on some systems the zero\’s are ignored as a NOOP.
Just found this & was curious… Is this still an issue with vSphere 6+?
To clarify your earlier comment re clones on Nutanix, would any clone ((running VM & powered off VM)) done via VMWare be instant (and no bloat from empty data)?
Yes