The hot question at the moment in hyperconvergence (solutions that combine compute and storage into a distributed solution to run VM’s) appears to be this: Is running storage services in the hypervisor kernel the best solution or superior to running storage services in user space. I strongly believe the answer is no, in kernel is not superior, but not for some of the reasons some people have mentioned, such as on Nigel Poulton’s blog article of the same topic. There are many reasons why I hold this view, and some of these reasons are why I chose to work for Nutanix even after I was familiar with VSAN while consulting to VMware for such a long time. Firstly let me get one thing clear, performance is not superior just because a storage service is running in the kernel. I’ve extensively tested multiple VSA solutions and VSAN and found that performance can be similar, in fact VSA’s can outperform VSAN under certain conditions (I will publish the data at a later date once the testing is completed). If in kernel was the best place to run things wouldn’t the applications run there too? Is the hypervisor not a good place to run high performance applications? We know that is not the case and hypervisors, especially vSphere are well suited to running very high performance business critical applications. So let’s dive into this a bit more.
For years VMware has educated us that vSphere is superior to Hyper-V. One of the principle reasons they used as the basis for this is that the Hyper-V solution is bloated, insecure, and required regular patching. In my opinion VMware is now bloating the ESXi Kernel by putting VSAN inside it. It’s not the thin super model it once used to be. Even if you don’t use VSAN you can now be impacted.
Because VSAN is in the kernel you get the bugs even if you’re not benefiting from the VSAN storage. If you’re running FC, NFS, or even a VSA based storage solution, you can still be hit by kernel bugs caused by VSAN code that are unrelated to other areas of the hypervisor that you use. I found this myself with vSphere 5.5 where putting a host into maintenance mode caused it to try and enter VSAN Maintenance Mode (which of course it couldn’t do), even though VSAN was not being used. This bug has since been fixed of course, but this just serves to illustrate the point.
Manageability and maintainability are king in my opinion. These factors reduce opex and give your admins and architects their weekends back. To get the real benefits from hyperconvergence you need to be able to maintain and update everything non-disruptively, this means your storage software should be updated without any disruptions to your VM’s. By tying VSAN to the kernel you are limiting the ability to update it without updating the entire hypervisor, which ties you into the hypervisor release cycles and cause unnecessary disruption to VM’s. Many customers don’t update hypervisor versions that quickly as they have been burnt by bugs in the past and this means new features and bug fixes are left on the shelf for a long time. Having the storage outside of the kernel, either in user mode, offers a lot more flexibility, reduced disruption, and additional isolation.
I think one of the best summaries of why in kernel is not the best solution to the hyperconverged storage problem is summed up by the comment posted on Nigel Poulton’s blog by Dheeraj Pandey, CEO of Nutanix.
I have published it below with small edits for blog readability.
The whole management argument of integration is being broken apart. Had that been true, Oracle apps would have continued to rule, and people would never have given Salesforce, Workday, ServiceNow, and others a chance. And this has been true for decades. Oracle won the DB war against IBM, even though IBM was a tightly integrated stack, top-to-bottom. After a certain point, even consumers started telling Facebook that their kitchen-sink app is not working, which is why FB started breaking apart that experience into something cleaner, usable, and user-experience-driven.
These are the biggest advantages of running above the kernel:
Fault isolation: If storage has a bug, it won’t take compute down with it. If you want to quickly upgrade storage, you don’t have to move VMs around. Converging compute and storage should not create a toxic blob of infrastructure; isolation is critical, even when sharing hardware. That is what made virtualization and ESX such a beautiful paradigm.Pace of Innovation: User-level code for storage has ruled for the last 2 decades for exactly this reason. It’s more maintainable, its more debuggable, and its faster-paced. Bugs don’t bring entire machines down. Exact reason why GFS, HDFS, OneFS, Oracle RDBMS, MySQL, and so on are built in user space. Moore’s Law has made user-kernel transitions cheap. Zero-copy buffers, epoll, and O_DIRECT IO, etc. makes user-kernel transitions seamless. Similarly, virtual switching and VT-x technologies in hypervisors make hypervisor-VM transitions seamless.
Extensibility and Ecosystem Integration: User-space code makes it more extensible and lends itself to a pluggable architecture. Imagine connecting to AWS S3, Azure, compression library, security key management code, etc. from the kernel. The ecosystem in user-space thrives, and storage should not lag behind.
Rolling Upgrades: Compute doesn’t blink when storage is undergoing a planned downtime.
Migration complexity (backward compatibility): It is extremely difficult to build next-generation distributed systems without using protobufs and HTTP for self-describing data format and RPC services. Imagine migrating 1PB of data if your extents are not self-describing. Imagine upgrading a 64-node cluster if your RPC services are not self-describing. Porting protobufs and HTTP in kernel is a nightmare, given the glibc and other user library dependencies.
Performance Isolation: Converging compute and storage doesn’t mean storage should run amuk with resources. Administrators must be able to bound the CPU, memory, and network resources given to storage. Without a sandbox abstraction, in-kernel code is a toxic blob. Users should be able to grow and shrink storage resources, keeping the rest of application and datacenter needs in mind. Performance profiles of storage could be very different even in a hyperconverged architecture because of application nuances, flash-heavy nodes, storage-heavy nodes, GPU-heavy, and so on.
Security Isolation: The trusted computing base of the hypervisor must be kept lean and mean. Heartbleed and ShellShock are the veritable tips of the iceberg. Kernels have to be trusted, not bloated. See T. Garfinkel, B. Pfaff, J. Chow, M., Rosenblum, and D. Boneh, “Terra: A virtual machine-based platform for trusted computing,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 193–206, 2003. Also see P. England, B. Lampson, J. Manferdelli, M. Peinado, B. Willman, “A Trusted Open Platform,” IEEE Computer, pp. 55–62, July 2003.
Storage is just a freakin’ app on the server. If we can run databases and ERP systems in a VM, there’s no reason why storage shouldn’t. And if we’re arguing for running storage inside the kernel, let’s port Oracle and SAP to run inside the hypervisor!
In the end, we’ve to make storage an intelligent service in the datacenter. For too long, it has been a byte-shuttler between the network and the disk. If it needs to be an active system, it needs {fault|performance|security} isolation, speed of innovation, and ecosystem integration.
One more thing: If it can run in a Linux VSA, it will run as a container in Docker as well. It’s future-proof.
Final Word
In my experience user mode is a great place to run high performance workloads, and storage is one example of a high performance workload. The way some VSA’s implement their controllers they have direct access to the underlying disks for the fastest possible performance and they leverage all of the high performance capabilities of the hypervisor. By having a controller in user mode you are much more easily able to update it non-disruptively and therefore take advantage of new features, capabilities, and bug fixes, all without disruption to the VM’s and without having to completely upgrade the hypervisor. The examples of ShellShock and HeartBleed shows having user side code in the kernel can be a bad thing, both caused urgent security patches to core hypervisors. People have used lock in as an argument, and there is definitely lock in at some point, as you eventually have to make a decision. But with hyperconverged solutions it’s just a matter of a storage vMotion away to get to another solution stack anyway, so it is much easier to change then it ever has been in the past. The benefit for customers is that this keeps the vendors keen to innovate and keen to ensure that the customers have the best possible experience. You don’t have to have your storage controller firmware running in the kernel of your hypervisor to get enough performance and everything else you need to meet the requirements of your applications. VSA’s are just as capable of achieving it, and in many cases they will have advantages over other solutions. However the solution that is right for you will depend entirely on your requirements and it isn’t a one size fits all world.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
What’s the difference between running something like this in “user space” and a kernel module that can enabled or disabled at boot?
I think the isolation differences and consequences around bugs and bloat of the kernel are covered in the body of the article. By tightly coupling even if VSAN is not enabled it can still impact other parts of the operations as I have already experienced. It's not just enabling or disabling but isolation, security and the other aspects covered within the article.
There’s a typo in your article which takes away the main point you are trying to make. You write “By tying VSAN to the kernel you are not limiting the ability to update it without updating the entire hypervisor…”
I’m pretty sure you want to replace ‘not’ with ‘now’.
I would say that some of this is actually incorrect. If storage has an issue it will take down compute with it any way last time I saw an APD the VM’s die. While it does restrict the damage this could be done in kernel with docker just as well this would also reduce attack surfaces by only loading what is needed.
One question not covered here is data. e.g.:
Data is so important it is the first word in data center it is the sole reason DC’s exist and we have jobs. By running it on compute you are putting your most critical part that is entirely about persistence on top of a disposable compute layer.
Would you also say that the correct place for a vswitch is running in a VM? My view is that a hypervisor is an infrastructure visualizer providing IaaS to all consumers.
Then on scale and innovation the same can be done with modern SAN’s while AFA will scale higher than VSA.
Thanks for the comment. In a properly implemented architecture the loss of a local controller on a hypervisor host won't cause an APD. In fact the VM's won't even know anything has happened. However if the components of the hypervisor that deal with persistent local storage go funky, then you're on a path to an unhappy place. The biggest problem I see is that VMware a now bloating the kernel with this storage stack even is the vast majority of customers won't ever use it. You get the bugs without the benefits. Also implementing a storage controller in user space means it could run as a container, just as easily as it can run as a VM. Data and persistence of that data is definitely the top priority, that is why you want an architecture that protects data even if the local compute node is unavailable for whatever reason. Although I would equally argue that compute isn't disposable as it run's your VM's, the architecture is designed to recover those VM's in the case of a compute failure, and so should the persistent storage. Whether a vSwitch should run in a VM depends on the function of the VM. If the VM is a Docker Container VM, running many other containers, then yes maybe it should, if it's a load balancer or security device that support multiple virtual security domains, then maybe it should, but if it's none of those, then probably not. The hypervisor would be a better place for the vSwitch. The innovation could be done on SAN's, but that's not where it's happening, mainly due to scalability limitations. AFA's can be a good option, but there is no reason why a well run VSA can't outperform and out-scale an AFA, I see it all the time.
Couldn’t you also argue that keeping the storage in the kernel allows you to keep the components “in sync”, so that upgrades are always done together? I now have one less thing to upgrade, since the storage is built right into the hypervisor.
Thanks for your comment. You could argue that, but there is nothing stopping you having an automated process to upgrade storage that is outside of the kernel at the same time as the hypervisor. Some companies have this today, a process to upgrade storage with one click while upgrading the hypervisor, you can do both at the same time. If you are benefitting from the built in storage capability then it could be an advantage to have it tightly coupled together.
[…] or Not In Kernel – This is the Hyperconverged Question Michael has a very thoughtful article here on the idea of storage (or other things actually too) in the Kernel or not. Even though I am a […]
We all know that storage performance is “how do I bring the I/O to disk”. The real performance is impacted how writes are committed in cache, which block size is used by the solution, in the background and which block size does my application produce. Are there mechanisms to reduce backend I/Os and how effective does the tiering algorithm of the solution work. The discussion if it runs in or out of the kernel is so high level, that there is no real world advantage of it. All the other things do have much more impact on performance, from my point of view the “in or outside the kernel performance discussion” is the needle in the haystack.
Would be interesting to get insights about differences of the solutions that compared on other level than just “kernel”.
Hi Daniel, I agree, at the end of the day it's all just about the apps. Also performance is much less of a factor for the two different methods than security, mobility, testability, reliability and other factors.