Over the past few months I’ve been working with some of our talented engineers at Nutanix on a new feature. This feature will become generally available in a future release of the Nutanix AOS and AHV products. One Saturday afternoon while I was in the US, I had a thought, how much performance can we drive with this feature, from a single VM? This is when I enlisted the help of Felipe Franciosi, Nutanix Sr Staff Engineer on the AHV Team. Could we do 1Million IOPS in a single VM? So started a tuning process that would lead us to the Nutanix .Next Conference in Nice, France, and a world first. For this article I thought I’d spice things up and add live migration to the mix as well!
As with any major achievement there is a big team involved. This time is no different. Felipe helped me get a single Ubuntu VM on Nutanix NX9030 cluster up to 1M IOPS at 4KB. 100% Random Read. I ran a series to tests to make sure it could be sustained and then thought why can’t we do 8KB instead of 4KB. After more work with Felipe and some last minute tuning by Malcolm Crossly (Staff Engineer on AHV Team), we got to 1M IOPS at 8KB 100% random read and could sustain it for 24 hours. What was also impressive was that the latency was just 110 microseconds, or 0.11ms.
1Million IOPS in a single VM has been done before. VMware and Microsoft have both demonstrated it. However neither case was on a hyperconverged infrastructure platform, and neither case involved such little complexity. In the VMware case there were hundreds of LUN’s, zoning, masking and two all flash storage arrays, for 1 VM. For this test we had a small cluster of just 10 x NX9030 nodes (which could be scaled out much further), and the VM used is configured with locally attached disk volumes, 33 in total (1 for OS, plus the rest for the IO load).
Here is the image that I captured from running some of the initial sustained IO tests from this single VM on the Nutanix cluster.
This begs the question, do you really need 1M IOPS in a single VM? Do you want it to be random read? In some cases the answer might be yes. Even if it doesn’t need to be quite so high as 1M. There are many applications that can benefit from lots of very low latency read operations, such as payment gateways for financial institutions.
Some might also ask why is a single VM that interesting when most situations these days are for many VM’s on a scaleable platform. I agree, but the challenge for distributed scale out systems, such as Nutanix, has always been scale up performance. Scale out performance for many VM’s is trivial. Whereas getting a single very large VM to reach very high performance levels is a much trickier challenge. This is especially important for large data warehouse environments.
To validate what the capabilities might be for a large scale up data warehouse, I did a 70% read, 30% write 100% random workload at 64KB IO size. This is similar to what a SQL Server database might generate. The following image is the result of this test.
As you can see with this configuration on a single VM we could sustain 13GB/s throughput at 70% read, 30% write. The working set is > 3TB. This proves that this configuration could support a very large data warehouse from an IO throughput perspective. For the writes we use RDMA in the for of RoCE (RDMA over Converged Ethernet) and Mellanox 40GbE CX3 Pro Adapters. All uplinks are to Mellanox SN2700 switches to provide the most consistent performance with the lowest latency and a lossless fabric for RDMA.
Was this all the performance the cluster had? Well no. The 10 nodes could achieve much higher performance when the number of VM’s is scaled out. As the following image demonstrates. With Windows and IOMeter I could achieve 28GB/s throughput from the 10 node cluster with 64KB IO size and 100% random reads.
When the IO pattern is changed to 70% random read and 30% random write the throughput drops to 21GB/s for the 10 node cluster.
What this clearly demonstrates is that there is a benefit for scaling out the number of virtual machines. Even when a single VM can harness a good proportion of the clusters IO performance. Just for fun let’s have a look at a 32KB 100% random read result, which is popular with some all flash storage vendors.
22GB/s from 10 VM’s running 100% random read workload on a 10 node Nutanix NX9030 cluster. This cluster is quite small and could easily be expanded to 32, 48 or even more nodes, with performance scaling linearly as it grows. This is a benefit of how the Nutanix acropolis distributed storage fabric works. Scaling linearly, predictably and consistently as nodes are added to the cluster and as workload is added to consume it.
This is of course expected, if you’re not also live migrating and upgrading the environment at the same time. But what happens if you also introduce live migration? I’m glad you asked. During a test where a single VM is doing 1M IOPS at 8KB random read I have live migrated the VM. You can see from the short video below the VM moved from one host to the next host and IOPS drops slightly during this process before returning to normal. You can also see in the video that this wasn’t the only time a live migration had happened.
Nutanix AHV Hypervisor is smart enough to balance CPU, RAM, and Storage IO resources to achieve the best possible quality of service for all VM’s running on the platform. As the video demonstrates there is no loss of performance even after the migration, which is as we would expect. Check out Josh Odgers article here, which includes where bottlenecks will move in future.
How would this change if we introduced writes to the mix as well? With 8KB IO Size, 70% read and 30% write the expectation is that it would behave the same. The following video shows a single VM being live migrated while doing ~600K IOPS and > 4GB/s throughput.
The features being demonstrated here are AHV Turbo (taking the IO path out of the AHV Hypervisor kernel), which will be available in the upcoming AOS 5.5 release with a simple one click upgrade. All VM’s will benefit from AHV Turbo when on AHV without any other changes. However if you want to maximize performance you should enable multi-queue block IO in Linux and use the latest virtio-scsi drivers for Windows. The single VM tests are also making use of another feature, but this will be released further out in 5.5.1. The feature allows a single VM to use more storage IO resources to gain more performance if it is required. This works seamlessly with the Acropolis Distributed Scheduler. Lastly the VM utilizes Virtual NUMA, which is coming in AHV that ships with AOS 5.5. We first demonstrated AHV Turbo in Washington, DC at Nutanix .Next Conference US. Josh Odgers has some other highlights of the 8KB 70/30 test in his article here.
After an upgrade to AHV we are now able to sustain 1.2M IOPS at 8KB IO Size 100% Random Read. This is also at much lower CPU Utilization and lower latency. The Image is below:
Nutanix software, Acropolis and Prism, and AHV, allows you to unleash the power of modern server hardware and achieve significant performance for even the most demanding applications. As new hardware innovations come to industry standard platforms the Nutanix software will be able to immediately take advantage of them. This will allow a continuous improvement cycle of performance and scalability without ever having to consider a forklift upgrade again. Whether you choose Intel x86 or IBM Power Architect, AHV is a platform that can achieve performance results that other hypervisors wish for. We have only just begun to unleash the true power of AHV and there is much more to come. Stay tuned and you will see more at .Next conference in New Orleans in May 2018.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2017 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.