I recently read an article that announced Microsoft and Intel had achieved 1 Million iSCSI IOPS out of a single 10 GbE NIC. This is by no means a small achievement, on the surface at least. When you dig into the article a bit deeper you find out that the headline is not quite what it’s cracked up to be. Want to find out the full story?
Here I’ll compare the results presented in the Microsoft Windows 2008 R2 and Hyper-v Test from Stephen Foskett’s blog and the Intel Whitepaper in comparison to the 1 Million IOPS Test documented on by VMware on VMware vSphere 5.
Firstly though let’s acknowledge that 1 Million IOPS is a pretty big achievement. But IOPS by themselves aren’t the only story, and each IOP is not created equal. For those IOPS values to be useful they need to be low latency, and they also need to carry enough data, or be individually large enough to be useful to an application. You need to know if they were read or write, and what the ratio or bias was. So it’s no use using an artificially low IO size for the purposes of a benchmark to grab a headline. Oh wait, that’s what Intel/Microsoft did!
I also want to say up front I love the Intel X520 10GbE NIC cards and I have them installed in the servers in my lab, so I am very pleased to see these results from a NIC standpoint. It also raises the bar for iSCSI, even if the attention grabbing headline isn’t exactly what it appears at first glance.
Microsoft Windows 2008 R2 and Hyper-V Storage Performance
For everyone it’s good to know figures were published for a variety of IO sizes, so that we know what some of the performance characteristics are, as you can see from the image below. I’m linking directly to Stephen Foskett’s blog for this. The throughput tops out at 4KB IO size, under 600K IOPS and just over 2000 MBytes/s. 8KB IO Size delivers the same throughput, but only around 300K IOPS. It degrades dramatically from there, as it has hit a bottleneck, which in this case was the maximum throughput of a single 10GbE NIC (full duplex). Interestingly the testing was done using a standard MTU size of 1500 bytes. Would have been nice to see the Jumbo Frame equivalent.
You can see from the graph that the magical 1 Million IOPS was reached with 512 byte IO size. This is very unrealistic for most applications as they will use either 4K (NTFS default block size), 8K, or above. Intel acknowledges this in their whitepaper, which I have linked to at the bottom of this article. The throughput at this level from the graph is a decent 500MBytes/s. What the article and Intel’s whitepaper doesn’t give us is average IO latency. This test was from a raw Windows 2008 R2 OS. Interestingly the 8K IO Test showing 300K IOPS is the tested throughput of a single VM on VMware vSphere 5, which I’ll get into later.
The next great image from Stephen’s article is a comparison between native Windows 2008 R2 and 10 VM’s running inside Hyper-V. You can see the remarkable similarity in performance compared to native at the higher more realistic block sizes. But there is a big divergence between native and Hyper-V at smaller block sizes, even at 4K it is significant. I’ve linked directly to Stephen’s blog again and made the image bigger so it’s easier to read.
So there we have it, high performance from a high performance OS. At least performance equivalent to the maximum tested for a single VM on VMware vSphere 5. This now leads into the VMware 1 Million IOPS test. I’ll give you a bit of detail on the setup, for the rest of the detail you can refer to the VMware whitepaper and blogs.
VMware vSphere 5 Storage Performance
VMware conducted benchmark testing at the EMC Lab using a single host configured with the following specifications:
Quad Socket, 10Core E7-4870 CPU’s at 2.4GHz
6 x Emulex LPe12002 8Gb/s HBA’s
The storage was provided by a VMAX that was maxed out. 8 engines each quad socket quad core with 128GB cache, 64 front end 8 Gb/s ports, 64 back end 8Gb/s ports and 960 15K 450GB FC Disks.
The maximum IOPS of the array itself was approximately 1.5 Million IOPS. The array was configured with enough cache that the entire test could be serviced out of cache. The IO tests used a 100% Read/100% Random pattern. Up to 6 x Windows 2008 R2 VM’s were used to generate the IO for the VMware tests. This allows a comparison between the two benchmark tests to be drawn directly. This also demonstrates the performance capability of Windows 2008 R2, even if both benchmarks are artificial and not real world.
Here are the results of the VMware performance test at a high level:
- A single vSphere 5 host is capable of supporting a million+ I/O operations per second.
- 300,000 I/O operations per second can be achieved from a single virtual machine.
- I/O throughput (bandwidth consumption) scales almost linearly as the request size of an I/O operation increases.
- I/O operations on vSphere 5 systems with Paravirtual SCSI (pvSCSI) controllers use less CPU cycles than those with LSI Logic SAS virtual SCSI controllers.
Here is the IOPS graph taken directly form the VMware whitepaper. This graph contains a lot of very important information.
As you can see from the graph the IOPS scales very nicely all the way up to just over 1 Million IOPS. But very importantly the latency is still just less than 2 ms. Also very importantly this test was done using 8KB IO Size. So this while not a real world test, at least is using a real world IO size. So at the peak of this test this single host was doing >8GB/s throughput. This is quite a different story compared to the previous Microsoft Windows 2008 R2 and Hyper-V test, which couldn’t reach this point with as few VM’s or with as large an IO size. Perhaps if they had added more 10GbE NIC’s it would have continued to scale, we shall never know.
There are a number of other tests documented in the VMware whitepaper including CPU utilization between different VM SCSI adapter types, IOPS and Latency for different VM SCSI adapter types for different IO sizes. I will let you read the results from the whitepaper as there is no direct comparison available with the Microsoft/Intel Tests.
You will note from the above graph that a single VM is doing just less than 200K IOPS. This is not the maximum performance of a single VM, and the configuration of the VM’s was only designed to saturate the performance of the host. Let’s now look at Single VM performance.
Single VM vSphere 5 Performance
The set up for the Single VM test involved changing the Guest VM configuration. a 16 vCPU VM was used with 16GB RAM. The test results were graphed per virtual SCSI Controller, using the VMware Paravirtual SCSI Controller (pvSCSI). Here is the graph of the results from the VMware whitepaper.
The above graph shows the scalability of IOPS per controller from 1 to 4 pvSCSI controllers. Each controller was configured with 10 virtual disks. The IO pattern was reused from the Multi-VM tests. Just over 300K IOPS was achieved and just over 2ms latency. Given this is again an 8KB IO size test this is the same as the IOPS result for the native Windows 2008 R2 test published by Intel in their whitepaper.
While I was reading the VMware whitepaper I noticed that they changed a significant setting during some of their tests. The number of Outstanding IO Operations was changed from 32 to 16 in some of the tests. This would have the effect of reducing the load on the back end storage and resulted in less overall performance for pvSCSI during the comparison tests with LSI SAS. Normally when running experiments you want to change as few variables as possible between tests. I decided to find out why this was done so I contact Chethan Kumar directly, who was the author of the paper. Here is what he said:
“Maximum queue depth supported by LSI driver (in guest) cannot be changed. So to keep the aggregate outstanding I/Os per controller lower than the max queue depth we had to use 16 OIOs per vDisk. To have a fair comparison between LSI and pvSCSI, second test also had 16 OIOs per vDisk for pvSCSI as well. Test case 1 just focused on achieving max IOPs (which happened to be with pvSCSI). Default max queue depth supported by pvSCSI driver can be increased. We raised it to a large enough value such that we could support 32 OIOs per vDisk.”
So there you have it. The comparison between the Intel and Microsoft Windows 2008 R2 and Hyper-V storage performance, which achieved 1 Million IOPS on iSCSI at 512 bytes, and 1 Million IOPS on a single vSphere 5 host achieved at 8KB IO size and 2ms latency with a single VMAX array. It would be good to see some more scalability tests from Microsoft on Windows 2008 R2, so hopefully this article prompts them to publish some, or for someone to send a full IO scalability test my way to review. It would also be great to see a comparative iSCSI test using vSphere 5 and the same Intel NIC’s.
Neither of the tests are real world by any stretch, and there is a big difference in terms of 6 x 8Gb/s HBA’s vs 1 x 10GbE NIC. The point of this article is really about what the claim of 1 Million IOPS amounts to, and that not all are created equal. The iSCSI performance is still definitely very respectable.
I’m very pleased with the Intel NIC’s and I’ll continue to buy them, they have been rock solid and the performance has been exceptional. I’ll also be sticking with vSphere in terms of performance, not that any real world application I’ve found in an enterprise environment currently needs 300K IOPS per VM or 1 Million IOPS per host. No doubt in the future this will change.
My last final thought is this. I wonder what the performance would be of a Hyper-V system that is virtualized on top of vSphere 5 running the 10 Guest VM’s all generating IO as in the Intel Test. I wonder if that would benefit from the vSphere storage layer and actually boost performance. I’m not sure we’ll see such a virtual Hyper-V nested test on top of vSphere published by VMware. But it would be interesting none the less.
To allow you to do your own comparison I’ll give you my sources so you can, if you wish, do the comparison for yourselves.
Intel Whitepaper titled Stable Reliable Performance for iSCSI Connectivity.
VMware Whitepaper titled Achieving a Million IO Operations per Second from a Single VMware vSphere 5.0 Host.
Stephen Foskett posted one of the best articles on the Microsoft test results and if nothing else the images in his post made reading it very worthwhile. I would highly recommend you read his post titled Microsoft and Intel Push One Million iSCSI IOPS.
Chethan Kumar posted an article on VMware VROOM! Blog regarding the performance testing that he conducted to achieve 1 million IOPS from a single vSphere 5 host. The article is titled Single vSphere Host, a Million IO Operations per Second. This was also followed up with an article and video titled A Conversation About 1 Million IOPS by Todd Muirhead.