There is no doubt that flash technology is changing the shape of data centers (a.k.a. data centres). Even the Rock Band Queen knew this more than 30 years ago. Flash changes the economics of deploying high performance applications and removes the performance bottlenecks that exist with more traditional storage systems based on spinning disks. As I demonstrated in Real World Data Reduction from Hybrid SSD + HDD Storage you don’t need flash to get data reduction, but flash storage with data reduction allows you to improve the capacity economics of flash (performance of flash at or cheaper than disk) in addition to the power and cooling savings, and the reliability improvements due to no moving parts. Something you may not know is that today modern flash devices, such as certain SSD’s, may have better reliability than hard drives. From a physical capacity stand point SSD capacity is increasing at similar rates to processor increases in performance due to the number of transistors per dye reducing over time. Some vendors are already shipping 16TB 2.5″SSD drives, which will be at $ per GB cheaper than 2.5″ drives, at lower power, lower cooling and much less space. The M.2 Device from Intel in the image above has a capacity of 3.5TB and is an incredibly small form factor. So what does networking have to do with this?
Whether you are using fibre channel or ethernet, networking will play a big role in how well flash technology helps you and more importantly your applications. The reason for this is quite simple. It is something that database administrators have known for a while. The reason is this, the further away from your application that the data lives the higher the latency and lower the throughput, and consequently the less performance and response time to your users. This is even more acute with flash, especially as the capacity and performance of the technology marches ever forward as it only takes one or a few drives to bring a high performance network to saturation point. This factor is also one of the reasons that some engineered systems have been using Infiniband networks and RDMA for some time, but even that is too slow. Here is a graph comparing the throughput of three different flash devices with current ethernet network technologies.
The common types of flash devices of the 2.5″ hot pluggable variety we see in enterprise systems today can deliver about 500MB/s throughput and up to 50K or more IOPS at low latency. So it would take only 2 drives to saturate a 10GbE ethernet network or 4 drives to saturate a 16Gb/s FC network. Lucky we usually have multiple NIC ports or HBA’s per server. But this doesn’t help us when a storage array could have 10’s to hundreds of individual drives, or 12 to 24 drives in a single server or storage shelf. Even with the common flash technology today, any time you connect it to a network, you are creating a performance bottleneck and can’t possibly achieve close to full performance.
If we look at NVMe now, which is the next generation of flash technology that is going to become more popular by the end of 2016 and mainstream into 2017. Each device can deliver enough throughput and IOPS to saturate a 40GbE NIC. If you have 2 devices in a system, you can saturate a dual port 40GbE NIC. This is one of the primary reasons NVMe based storage systems such as EMC’s DSSD are not using traditional networks to connect the storage to servers. Instead they are using lots of direct PCIe Gen3 connections. They have realized the network is a major bottleneck and is too slow to deliver the kind of performance capability that flash based on NVMe can deliver. Each individual NVMe device is 6x to 8x faster than the common flash we see in most enterprise storage systems today. How many customers have multiple 40GbE NIC’s or 32Gb/s FC HBA’s per server in their datacenters today?
SSD’s are fast, NVMe based SSD’s are faster, but 3D Xpoint, a joint development between Intel and Micron is mind boggling fast. 3D Xpoint, which was announced in 2015 and expected to be in enterprise platforms by 2018/2019 is 1000x faster than todays common SSD’s used in most enterprise systems. At the sort of performance that 3D Xpoint can deliver motherboards, processor technology, memory bus and everything else will have to have a massive boost as well. Each device could more than saturate a multi port 400GbE network (400GbE is the next thing after 100GbE). As soon as you put this on a network you are waiting an age. 3D Xpoint is expected to deliver latency as low as 150ns or less, faster than the enterprise 40GbE and 100GbE switch ports today. Even Gen3/Gen4 PCIe is not fast enough to keep up with this sort of performance. Don’t even start thinking about the impact of In-memory DB’s, which are running at DRAM speeds.
As the image from Crehan Research Inc. above shows 10GbE and 40GbE ports are increasing and the cost of 100GbE ports are coming down. But 100GbE still isn’t widely adopted, and neither is 40GbE in servers just yet. Crehan Research expects 100GbE to start being more widely adopted from 2017 according to their 2015 report. But this will be at the switching / backbone and not at the server. With NVMe becoming mainstream and 3D Xpoint only a couple of years away from being deployed, network connectivity to each server has no hope of increasing 1000 fold in this short amount of time. We would effectively need dual port TbE connectivity to every server.
So we can see from the evidence that if you connect flash to a network you are going to have a bottleneck that impacts performance and limits the usefulness of the investment to some fraction of its potential. At the same time you want to make sure you have data protection, while still getting closer to the potential performance that can be achieved from the flash you have. How can we do both? Have high performance, low latency, the data as close to the applications as possible, and still maintain data protection?
The simple answer would be to connect SSD’s to local RAID cards. This would work with every day 2.5″ SSD’s (although you’d need multiple RAID cards per server for performance), but that doesn’t work with NVMe or 3D Xpoint. Multiple local RAID controllers in every server would create hundreds or thousands of silos of storage capacity that then have to be individually managed. We spent a long time creating architectures that could be centrally managed to eliminate this management overhead. We shouldn’t be going backwards to take advantage of new technologies.
The real answer is by two fold, firstly virtualization and secondly investing in system architectures that are distributed in nature and have at the heart of them a concept of data locality. An architecture with data locality ensures that data is kept local to the applications, on the local server, while being distributed for data protection. The reason we need virtualization is because we have so much abundance of high performance storage now there are few single applications that can actually make use of it. By using virtualization we can make use of the compute capacity and the storage capacity and performance. We’re fortunate that Intel keeps increasing the power of their processors every year for us to make ever better use of the cores for compute and now also for high performance storage (no proprietary components required).
The concept of data locality is used by many web scale applications that need to grow and scale while maintaining data protection and high performance. By having a concept of data locality you reduce the burden on the network, removing it from being an acute bottleneck, and future proof your datacenter for the new types of flash technology. Data is accessed locally by the application from the local PCIe bus through memory and only the writes or changes are sent across the network to protect them. Architectures based on data locality when implemented properly will scale linearly with predictable and consistent performance. This eliminates a lot of guess work and troubleshooting and reduces business risk as the architecture is more capable of adapting to changing requirements quickly by adding or subtracting systems from the environment.
You can adopt distributed architectures with data locality by building it into your custom written apps, or by implementing some of the new web scale big data applications (Hadoop and the like). But if you don’t have an army of developers how can you benefit form this type of architecture? An architecture that will be adaptable to new storage technologies and is future proof for the changes coming in the data center? The answer isn’t a SAN, because as we have covered, if you connect flash on the end of a network you can’t achieve anywhere near it’s potential. The only current solutions that exist are hyperconverged systems where the server and storage are combined into a single unit and then combined into a distributed architecture.
Not all hyperconverged systems have a concept of data locality as part of their architecture. So you need to be careful which vendor you choose. You should evaluate each vendor with regards to your requirements and business needs and look at who can protect your investment into the future without major architecture disruption. Some vendors are promoting anti-locality and recommending customers go all flash and just buy more network ports. Unfortunately the network can’t and won’t keep up with flash technology (400GbE is too slow). So you are guaranteeing substandard performance and an architecture that won’t be able to seamlessly adapt to the rapidly changing flash technologies.
Also note once you invest in flash and you move it closer to your applications you will find that your overall CPU utilization increases, in some cases very dramatically. This is because storage is no longer your bottleneck. Your applications are no longer waiting for IO to complete and are therefore much more responsive to users, and are able to complete batch jobs much more quickly, process more transactions in less time. As a result your CPU’s are a lot less idle. Ultimately you get a lot more useful work done in less time. So don’t be alarmed if suddenly your CPU utilization hits 80% when you run on flash. This is expected. After all isn’t this just good use of your investment, getting the most out of the assets?
Final Word
You can watch this and other topics being discussed over beers with Tony Allen, one of the development engineers at Nutanix in the Beers with Engineers series below.
Data Locality is the only way to future proof your architecture and get the most out of the continuing evolution of flash technology that is continuing to disrupt datacenters. Nutanix (where I work) has had data locality as a core part of the architecture since the very first release. This is the primary reason the architecture has been able to continue to scale and adapt customer environments through different generations of technology over the past 5 years without changing the underlying architecture, and is future proofed for the changes that are coming. We allow our customers to mix and match platforms, while keeping the data local to the applications, making the path between the applications and the data as short as possible and thereby lowering application latency.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2016 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Your network is too slow…so buy Nutanix 🙂
Data is good, conclusion is somewhat biased. But the point I wanted to make is that you won’t be using 100GBE in the data centre, much more likely 25 or 50GBE because the cost of 100GBE is going to be too high. Mostly because of vendor licensing and overpriced SFP but that is another story.
Might be surprised. I’m seeing some good competition in 100G switching and NICs. 25 and 50 are already too slow.
[…] wrote a great article about the storage area network and de speeds that are involved with the upcoming flash […]
[…] for all flash. I explain why Nutanix has some unique benefits with all flash configurations in Your Network Is Too Slow For Flash And What To Do About It. This article takes a look at some of the highlights from the conference specifically around […]
Does Nutanix start with the 3D XPOINT samples in Q4 2016 ?
I don\’t know when 3D Xpoint will be available. But until it\’s enterprise ready there will be NVMe. Both need data locality to be fully leveraged. Else the network will get saturated with remote IO accesses. SAN based AFA and any HCI that doesn\’t have data locality will struggle. Especially as local processing power and flash capacity continues to increase year by year with tech advances.
[…] VMs will need. Nutanix’s resident performance secret agent Michael Webster (NPX007) wrote a wonderful blog about the upcoming performance impacts this new hardware will have on networking so I’d encourage you to read it. The grammar is infinitely better for […]
The other vendors start before.
bandwidth of DDR4-3200 at 25.6 GB/s, 4 million IOPS, sub-2 microsecond latency
Up to 4 TB dimms
http://xitore.com/what-is-nvdimm-x/
Thanks for your contribution. The biggest problem with NVDIMM is serviceability. When they need to be replaced there is some downtime of the system potentially. With Optane (3D Xpoint) connected through NVMe / PCIe you will be able to hot plug. But we will need PCIe 4 or 5 before we really get decent performance from it. In any case, the network is way to slow for all of this flash. Local to the server is the only place for it.
[…] When it comes to networking you need to consider more than just the user access to the database, you need to consider management workloads including live migration for maintenance and load balancing, backup, monitoring and out of band management. With very large SQL Server VM’s with 512GB and above the live migration network may have some hefty requirements. Especially with very active SQL Server VM’s. I have seen the live migration networks struggle with evacuating a host for maintenance if they were not designed and implemented correctly. If you have hosts with multiple TB of RAM and enough VM’s to occupy that RAM you should consider multiple 10G networks for live migration traffic. Using LACP network configurations can indeed help, as can using Jumbo Frames. As you start adopting 40GbE, 50GbE and above NIC’s the use of Jumbo Frames to increase performance and lower CPU utilization becomes ever more important. You can achieve up to 10% to 15% additional performance by using Jumbo Frames for live migration traffic depending on CPU type and bandwidth of you NIC. But take care as it does need to be implemented properly. It is fortunate that many enterprise class switches now come with Jumbo Frames enabled by default, but you will still need to enable it in your hypervisor and on the life migration virtual NIC. If you are using Jumbo Frames why not enable SQL Server to use a packet size of 8192 bytes instead of the standard 4096 (same size as a database page although there is no direct relationship), 8192 bytes and it fits nicely into the 8972 byte TCP packet (9000 bytes with overhead included) on the wire. Take into consideration the network impacts of any software defined storage solution especially as adopting modern all flash systems because your network may be too slow for flash. […]
Is there a compare test with Intel Optane SSD with Nutanix coming ?
How do you mean a compare test?
[…] As an option of the usual 4096 (same measurement as a database web page though there isn’t any direct relationship), 8192 bytes and it suits accurately into 8972 byte TCP packet (9000 bytes with overhead included) on the wire. Get into chronicles County the community impacts of any software program outlined storage resolution especially since adopting trendy all flash techniques as a result of your network may be too slow for flash. […]
[…] References [1] Flash Storage Disaggregation http://csl.stanford.edu/~christos/publications/2016.flash.eurosys.pdf [2] Disaggregation marks an evolution in hyper-convergence http://searchstorage.techtarget.com/opinion/Disaggregation-marks-an-evolution-in-hyper-convergence [3] How Facebook Does Storage https://thenewstack.io/facebook-storage/ [4] Some Food For Thought About Hyper-Converged Infrastructure https://idc-community.com/groups/it_agenda/infrastructureanddatamanagement/some_food_for_thought_about_hyper_converged_infrastructure [5] Your Network Is Too Slow For Flash And What To Do About It (Image) http://longwhiteclouds.com/2016/06/05/your-network-is-too-slow-for-flash-and-what-to-do-about-it/ […]
[…] As an option of the usual 4096 (same measurement as a database web page though there isn’t any direct relationship), 8192 bytes and it suits accurately into 8972 byte TCP packet (9000 bytes with overhead included) on the wire. Get into chronicles County the community impacts of any software program outlined storage resolution especially since adopting trendy all flash techniques as a result of your network may be too slow for flash. […]