Over the past few months I’ve been working with some of our talented engineers at Nutanix on a new feature. This feature will become generally available in a future release of the Nutanix AOS and AHV products. One Saturday afternoon while I was in the US, I had a thought, how much performance can we drive with this feature, from a single VM? This is when I enlisted the help of Felipe Franciosi, Nutanix Sr Staff Engineer on the AHV Team. Could we do 1Million IOPS in a single VM? So started a tuning process that would lead us to the Nutanix .Next Conference in Nice, France, and a world first. For this article I thought I’d spice things up and add live migration to the mix as well!
As with any major achievement there is a big team involved. This time is no different. Felipe helped me get a single Ubuntu VM on Nutanix NX9030 cluster up to 1M IOPS at 4KB. 100% Random Read. I ran a series to tests to make sure it could be sustained and then thought why can’t we do 8KB instead of 4KB. After more work with Felipe and some last minute tuning by Malcolm Crossly (Staff Engineer on AHV Team), we got to 1M IOPS at 8KB 100% random read and could sustain it for 24 hours. What was also impressive was that the latency was just 110 microseconds, or 0.11ms.
1Million IOPS in a single VM has been done before. VMware and Microsoft have both demonstrated it. However neither case was on a hyperconverged infrastructure platform, and neither case involved such little complexity. In the VMware case there were hundreds of LUN’s, zoning, masking and two all flash storage arrays, for 1 VM. For this test we had a small cluster of just 10 x NX9030 nodes (which could be scaled out much further), and the VM used is configured with locally attached disk volumes, 33 in total (1 for OS, plus the rest for the IO load).
Here is the image that I captured from running some of the initial sustained IO tests from this single VM on the Nutanix cluster.
This begs the question, do you really need 1M IOPS in a single VM? Do you want it to be random read? In some cases the answer might be yes. Even if it doesn’t need to be quite so high as 1M. There are many applications that can benefit from lots of very low latency read operations, such as payment gateways for financial institutions.
Some might also ask why is a single VM that interesting when most situations these days are for many VM’s on a scaleable platform. I agree, but the challenge for distributed scale out systems, such as Nutanix, has always been scale up performance. Scale out performance for many VM’s is trivial. Whereas getting a single very large VM to reach very high performance levels is a much trickier challenge. This is especially important for large data warehouse environments.
To validate what the capabilities might be for a large scale up data warehouse, I did a 70% read, 30% write 100% random workload at 64KB IO size. This is similar to what a SQL Server database might generate. The following image is the result of this test.
As you can see with this configuration on a single VM we could sustain 13GB/s throughput at 70% read, 30% write. The working set is > 3TB. This proves that this configuration could support a very large data warehouse from an IO throughput perspective. For the writes we use RDMA in the for of RoCE (RDMA over Converged Ethernet) and Mellanox 40GbE CX3 Pro Adapters. All uplinks are to Mellanox SN2700 switches to provide the most consistent performance with the lowest latency and a lossless fabric for RDMA.
Was this all the performance the cluster had? Well no. The 10 nodes could achieve much higher performance when the number of VM’s is scaled out. As the following image demonstrates. With Windows and IOMeter I could achieve 28GB/s throughput from the 10 node cluster with 64KB IO size and 100% random reads.
When the IO pattern is changed to 70% random read and 30% random write the throughput drops to 21GB/s for the 10 node cluster.
What this clearly demonstrates is that there is a benefit for scaling out the number of virtual machines. Even when a single VM can harness a good proportion of the clusters IO performance. Just for fun let’s have a look at a 32KB 100% random read result, which is popular with some all flash storage vendors.
22GB/s from 10 VM’s running 100% random read workload on a 10 node Nutanix NX9030 cluster. This cluster is quite small and could easily be expanded to 32, 48 or even more nodes, with performance scaling linearly as it grows. This is a benefit of how the Nutanix acropolis distributed storage fabric works. Scaling linearly, predictably and consistently as nodes are added to the cluster and as workload is added to consume it.
This is of course expected, if you’re not also live migrating and upgrading the environment at the same time. But what happens if you also introduce live migration? I’m glad you asked. During a test where a single VM is doing 1M IOPS at 8KB random read I have live migrated the VM. You can see from the short video below the VM moved from one host to the next host and IOPS drops slightly during this process before returning to normal. You can also see in the video that this wasn’t the only time a live migration had happened.
Nutanix AHV Hypervisor is smart enough to balance CPU, RAM, and Storage IO resources to achieve the best possible quality of service for all VM’s running on the platform. As the video demonstrates there is no loss of performance even after the migration, which is as we would expect. Check out Josh Odgers article here, which includes where bottlenecks will move in future.
How would this change if we introduced writes to the mix as well? With 8KB IO Size, 70% read and 30% write the expectation is that it would behave the same. The following video shows a single VM being live migrated while doing ~600K IOPS and > 4GB/s throughput.
The features being demonstrated here are AHV Turbo (taking the IO path out of the AHV Hypervisor kernel), which will be available in the upcoming AOS 5.5 release with a simple one click upgrade. All VM’s will benefit from AHV Turbo when on AHV without any other changes. However if you want to maximize performance you should enable multi-queue block IO in Linux and use the latest virtio-scsi drivers for Windows. The single VM tests are also making use of another feature, but this will be released further out in 5.5.1. The feature allows a single VM to use more storage IO resources to gain more performance if it is required. This works seamlessly with the Acropolis Distributed Scheduler. Lastly the VM utilizes Virtual NUMA, which is coming in AHV that ships with AOS 5.5. We first demonstrated AHV Turbo in Washington, DC at Nutanix .Next Conference US. Josh Odgers has some other highlights of the 8KB 70/30 test in his article here.
[Updated 14/12/2017]
After an upgrade to AHV we are now able to sustain 1.2M IOPS at 8KB IO Size 100% Random Read. This is also at much lower CPU Utilization and lower latency. The Image is below:
Final Word
Nutanix software, Acropolis and Prism, and AHV, allows you to unleash the power of modern server hardware and achieve significant performance for even the most demanding applications. As new hardware innovations come to industry standard platforms the Nutanix software will be able to immediately take advantage of them. This will allow a continuous improvement cycle of performance and scalability without ever having to consider a forklift upgrade again. Whether you choose Intel x86 or IBM Power Architect, AHV is a platform that can achieve performance results that other hypervisors wish for. We have only just begun to unleash the true power of AHV and there is much more to come. Stay tuned and you will see more at .Next conference in New Orleans in May 2018.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2017 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Hi Michael,
It’s refreshing to see a HCI single VM benchmark. I’ve been asking for this out of vSAN, Datrium and even you guys (Nutanix) for a little while now. It’s for the very reason that you brought up. My DBA’s don’t care that an entire cluster can do 1.2 million IOPS @ <1ms latency, they care what their one big servers performance is. So kudos to you guys for taking the first step, and I say that as a person who's not really a fan of HCI (yet).
What I would love to see you guys (and other hyper / "open" converged vendors) show, is a more realistic scenario(s). 33 disks in a single VM, even a large one, is likely not realistic. No less than the VMware benchmarks you outlined. I'm sure there are edge cases where folks have that many disks in an attempt to squeeze every last IO out (or show what a single VM can do like in your case).
IMO, what might be more useful for folks, is showing what a single VM + a single disk performance equals. That's probably 80% of the VM's out there. Meaning, if i fire up IOMeter on a single vdisk, what kind of IO can you deliver? Maybe even show it scaling up to 33 disks if you wanted. So start with one disk, go to two, four, eight, sixteen and then thirty two. Then when you're comparing HCI solutions, we can look at which vendor provides the best single VM performance *and* aggregate VM performance.
I also feel its important for HCI vendors to show what their resiliency settings are. Especially when it comes to write benchmarks. it's like being required to show your work in school. If one vendor can only hit "x" write IOPS with their resiliency set to one host failure, how realistic is that configuration. Versus something like three host failures, which is probably more common.
Anyway, really great article, and thanks for demonstrating what a single VM can drive in your solution.
Hi Michael, I read through your blog. You still have not posted fio latency. The blog only has “cluster wide controller latency” and that is some internal nutanix thing that is not relevant to apps. Apps care about latency as seen by the app. Care to share the fio latency? Thanks.
Eric, I would acknowledge that 33 disks is not likely to be real world representative, but in all fairness, neither is one disk. High IO servers have for years been configuring with multiple virtual disks to get extra performance. Common example: SQL server with a separate boot partition, DB drive, LOG drive and TempDB drive. All on separate paravirtualized interfaces.
The recent work Nutanix is doing in AHV has been impressive, and where queue depth has been a limiting factor in the past, I’m impressed with where they’re going.
Hi Eric, Some good points there, thanks for the feedback. 1 Million IOPS is note real world or app specific, it's just a number to show a single VM can do an unrealistically high IO workload. There are just too many variables to how apps and DB's work, and no two environments are the same, so the numbers aren't directly comparable. But you hit the point of this, which is to show how a single large VM might behave, such as a DB. To get to single vDisk performance you can just divide this number by 32, as the OS disk was issuing no IO (roughly 40K IOPS per vDisk). The scalability is linear, which is one of the main benefits of a scale out platform. There are many problems with synthetic tests of any sort and IO size, pattern, randomness, IO type and other factors mean that you milage will always vary. That's why I included data from multiple tests and multiple scenarios, to give some sort of indication. Also there is plenty of data already published for single vDisk multiple VM tests and other scenarios not covered by this post. For most large databases, especially the critical high performance ones, there will be multiple vDisks. This has been a best practice since before HCI was invented. So single vDisk tests aren't as relevant. For example a large healthcare SQL Server data warehouse app would usually have 8 drives for data files, another 8 for TempDB, 1 from Transaction Log for the Data files and one Transaction Log for TempDB. So you'd end up with about 20 vDisks, assuming this is not a system with tens or hundreds of databases on the same instance (which can be the case as well). For the config DB's numbers of vDisks is less important. In the case of that healthcare system specifically the performance of a single VM on the system I had was 14GB/s for large read IO's, which simulates how the apps does reporting and 6GB/s for the ETL data load portion, this would cover a significantly large proportion of very large data warehouse and OLTP database environments. Also it should be noted that these numbers aren't maximums as the cluster I had was limited in terms of node count. We could quite easily keep scaling out the cluster to increase performance in addition to adding more VM's and spreading the workload across them.
[…] vDisks required” point is also verified in Michael Webster’s post 1 Million IOPS in 1 VM – World First for HCI with Nutanix. Where he states “The VM used is configured with locally attached disk volumes, 33 in total […]
Single vDisk performance is still limited by the OS and it’s just a reality that you need to split data files across multiple disks and controllers to maximise performance for a database. Throughput is bound by IO size and IOPS of small IO size is bound by queue depth and response time. Pretty much every SQL DB out their could significantly improve its performance if configured properly. At scale these improvements matter. Good DBA’s know this and it’s very easy to optimise systems with modern orchestration frameworks. You refer to the same Brent Ozar that said you don’t allocate data file and temp dB files based on CPU configuration and then Microsoft made it the default for SQL2016. He doesn’t have much experience with high performance and large systems. It’s also not required to partition tables. That’s optional. But having the databases and data files including tempDB etc split over multiple vDisks is important to get the best performance. You also don’t want a situation where a single element of a system can max out the performance. Else noisy neighbour will hit and you will get unpredictable performance unless you have built in QoS. These are all things already addressed by Nutanix and other well designed systems that can scale to any required performance.
[…] to have a rock solid solution. We were the first and still only hyperconverged vendor to achieve 1 Million IOPS in a Single VM. All of which is great, but what if you have a small use case, a small retail shop (or lots of […]