Over the past week I was discussing vSphere 5.5 best practices with regards to Nutanix with Josh Odgers (Fellow VCDX and member of the Nutanix Solutions and Performance Engineering Team), who is putting together a comprehensive paper on vSphere 5.5 best practices for Nutanix. One of the topics that came up in our discussions was availability and capacity planning, such as the number of node failures to tolerate, and how to make capacity planning easy and ensure you can always meet your failure tolerance. Unlike many storage architectures there is no fixed limit to the number of node failures a Nutanix environment can sustain before data protection may become an issue. It all comes down to your design and how much free space there is to re-protect your data. If for example you have a 64 node Nutanix cluster that is 80% full, you can potentially survive the loss of a more than 10 nodes (from a storage perspective) before your ability to re-product the data becomes an issue. But even at small scale, say 3 nodes, what is the best way to avoid trouble and be able to plan for capacity and failure, and ensure your data is protected?
Any environment where rapid storage growth could result in an out of space condition requires monitoring and capacity planning. Every environment at some point will experience a failure. Hardware eventually fails, software eventually works. Hardware fails, software has bugs, people make mistakes. Choose whichever cliché you like, we as IT professionals need to plan for capacity and also for failure no matter what architecture we choose for our environments. Fortunately we can make this process very simple in a Nutanix environment.
When planning your Nutanix environment you should plan for maintenance, availability and failure up front. If you want to be able do hardware or hypervisor maintenance on a node non-disruptively, then you’ll have to have capacity for that built into your environment. If you want to be able to do maintenance and survive a node failure at the same time, then you need to build in that capacity as well. At a minimum your Nutanix environments, just like your hypervisor environment, should be planned for N+1 failure. If you have a total of 3 nodes and N+1 for failure then you have 2 nodes worth of capacity that can be used before impacting your recovery posture.
My general rule of thumb is similar to vSphere environments where you may wish to have a node for failure for each 12 or 16 nodes in your cluster. So if you have a 32 node cluster you might want to logically have N+2, so 30 nodes of capacity and survive a two node failure. Some environments prefer to have a node for failure in every 8. So a 32 node cluster would be N+4 and 28 nodes worth of capacity before impacting recovery.
This makes sense and it’s what we’ve all done for years. We’ve got HA built into vSphere and we have admission control enabled to ensure we don’t overload our environments to the point that we can’t sustain a failure that might risk the important VM’s not getting protected and recovered. But how do we apply the same admission control at a storage layer? To ensure we can always re-protect the data in the case of a catastrophic node failure?
The easiest and quickest way to do this in a Nutanix environment is by using a Container Reservation and a FreeSpace Container. A FreeSpace Container is an idea Josh and I came up with to make capacity planning simple, and so that administrators don’t over provision storage to the point that an out of space condition could result from a failure. It is essentially like admission control on the Nutanix storage layer. Here’s how it works.
A FreeSpace Container is a container that you’ve provisioned, without mounting it on any hosts, that has a reservation equal to your failure tolerance. The FreeSpace Reservation is based on raw capacity before any Replication Factor or Resiliency Factor is taken into account. So for example in a 10 node cluster you might set a FreeSpace Reservation Container to 1 node or 10% of your raw capacity (e.g. 4TB if using Nutanix 3000 series). Just like vSphere Admission Control would be set to 10% of compute capacity. The available free space shown to your hypervisor will be minus the free space reservation. So you will easily see when you’re running out of capacity from the hypervisor side and need to add nodes. This also ensures you can survive a failure of the number of nodes you’ve designed for.
[Updated 04/12/2018] Note: The reservations in recent versions of AOS use logical space instead of physical space. When this article was originally written 4 years ago the reservations were based on physical space. 4TB Physical at RF2 = ~ 2TB Logical at RF2.
Here is a diagram that shows a 16 node cluster with a failure tolerance of 2 nodes and the setting for the FreeSpace container.
In the case of the example above and using RF2 for data protection you would have a usable storage capacity 56TB visible to the hypervisor after the FreeSpace Container has been created. If you were planning a 4 node cluster with a failure tolerance of 1 and 4TB in the FreeSpace Container then you’d have usable storage visible to the hypervisor of 12TB. If you have a mixed cluster with different node models then you should plan on tolerating a failure for at least the biggest node in the environment.
It is not a good idea to run out of free space in any storage environment. If you run your storage environments too full not only do you risk data integrity but also performance. Most storage architectures recommend that you have some sort of free space buffer. Fortunately by choosing a platform that makes more efficient use of the underlying storage and easily reclaims free space when VM’s or virtual disks are deleted, has built in compression and data de-duplication, such as Nutanix, you’re already better off. By using a FreeSpace Container that is not mounted to any host you can ensure you’ve always got a safety net, a buffer to keep you out of trouble, like admission control in vSphere HA. All you need to do when adding new nodes to the Nutanix cluster is to modify the FreeSpace Containers Reservation to match your failure and capacity tolerance. Simple as that.
I hope you’ve found this useful and as always your feedback and comments appreciated.
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Interesting. Definitely something to be concerned with a hyper converged solution as a node failure now constitutes a loss of both compute and storage resources. Not something I have to worry about if I had storage separated from compute (ie all flash or flash hybrid storage array).
In the latter I don’t have to oversubscribe my storage to facilitate compute maintenance or failure.
Definitely not knocking hyperconverged solutions. Just saying it’s a shift in thinking about properly sizing storage.
Actually overfilling any storage platform is a problem and you still have to worry about it regardless of if it’s a traditional array or not. So just separating storage from compute doesn’t resolve the out of space concerns or reduce capacity planning. It in some ways makes it harder. So it’s a good practice regardless where the storage sits. Just easier to do the capacity planning in a hyper converged infrastructure because it’s tightly coupled.
I kind of disagree. Of course you have to capacity plan for any solution. My point was that for a non hyperconverged platform I calculate my storage requirements based on my compute requirements. If one of my compute nodes dies, my storage performance doesn’t suffer. If anything, the remaining compute nodes/workloads get more storage resources.
I dont have to overprovision storage to account for compute failure – just rightsize based on my requirements.
With Nutanix a node constitutes both compute and storage and therefore has to be thought about a little differently IMO. The fact that you’re illustrating that we need to look at failure to tolerate conditions as it relates to storage design shows that. That used to be just a compute construct.
That’s true. You don’t have to over provision storage for compute failure. But you still have to over provision storage, and a failure of a storage processor has a higher proportional impact than the loss of a single Nutanix node in terms of performance. You simply have to right size your Nutanix environment for your requirements. If you’re saying your running your storage environment at more than 80% – 90% full, good luck.
I'm not entirely sure I understand your comment about "you still have to over provision storage". Even with an active/passive storage architecture, you're still only right-sizing your storage. If a storage controller fails, the standby controller takes over the network and storage resources. The impact of the controller failure isn't even felt by the workloads housed on that storage. So, still confused about the mandatory over provision of storage.
Right-sizing a storage design should always take an acceptable level of growth and buffer into account – so you're not running at 80-90% full.
Anyways, this is all getting away from my original point that hyperconverged solutions demand that you now consider storage resource capacity planning in a different manner. Nobody is taking a swipe at Nutanix :). Just nice to know that I now need to account for extra storage for my Nutanix design.
I think the point is you needed to account for that growth and buffer space (over provisioning of storage capacity) regardless if it's in a Nutanix environment or in a traditional storage environment. The difference is in a hyperconverged environment you use that buffer space during maintenance operations as well. The percentage of capacity you need as a buffer decreases the larger the Nutanix cluster becomes. But you want to have a safety net to ensure you never run the risk of an out of space condition.
Very good idea to have a space reservation set! I am however interested to hear what happens if you would actually not keep free space to accommodate for re-protection. Let's say you have a three node cluster and you are using 80% of its capacity physically. Now one node fails. Whats gonna happen? Will all remaining space be filled up when re-protecting the data or will the system leave some space for newly allocated blocks (let's say from the existing VMs already provisioned which didn't fill all of their space/VMDK yet)? Or does the system refuse to allocate new blocks because it knows it will eventually not be able to replicate all existing blocks (which were not yet copied to the second node)?
If you don't have sufficient space you will run out of space and your VM's will pause until some space is created. This is standard behaviour when a storage system is full. If you are out of space you are out of space and you won't be able to write any more data. When you have fixed the out of space issue your VM's will start to work again. I would not recommend running any storage system near 100% full. You should always have sufficient capacity in a system to recover from failure scenarios.
Thanks for the reply! So you are saying that NOS will eventually "claim" whatever space it can to re-protect? The reason I asked the question is because of the following scenario we have at a customer: They have two clusters with replication in between. All production data is sent to the second cluster for failover capability in case of a site failure. On the primary cluster they currently don't have enough space to accommodate a node failure (assuming the VMs will use nearly all space they can, I know NDFS is thin provisioned underneath). The customer doesn't care so much since he can always failover to a full copy if necessary. The issue however is that he expects that in case a local node fails NOS will continue writting data even if it can't be protected within the cluster (RF2). Based on your answer however I understand that the remaining nodes will fillup with copies (re-protection) and therefore a failover to a second cluster will be necessary? Do I understand it correctly?
Yes, your understanding is correct. Sounds like it is time to buy another node for that environment. Enabling compression or dedupe may also be a good idea, depending on the type of VM\’s.