As more and more companies start to look at hyperconverged, web scale (such as Nutanix – where I work) or just converged solutions for your vitualization platform there is a need to make sure you know what your looking at and go in eyes wide open. There are a number of different aspects to evaluate, none of the different options are the same. There are no right or wrong answers, as many different solutions may meet your requirements, or be best suited to your requirements. The idea behind this article is to give you a list of questions to consider and to ask any potential vendor. This list is designed to be vendor neutral, and there are really no right or wrong answers, it’s just so you understand what you’re getting for some important aspects, based on my experience.
This list is going to be a starting point and isn’t going to be completely exhaustive. You will have your own priorities and additional questions that need to be answered. Let’s first look at some key aspects of any hyperconverged virtualization platofrm.
- Data Protection. Any system in my opinion that doesn’t have data protection as the highest priority has no place in the enterprise. Every system should take great steps to protect data and protect against data loss in common and uncommon failure scenarios. This is not backup, but primary data protection and data resiliency of online in flight data. Data protection is important through all aspects of operations, including upgrades. Make sure that all upgrade scenarios are non-disruptive and non-data destructive.
- Availability. Making sure the system reduces single points of failure, can continue operations in spite of component failure (including management components), reduce single points of contention, hot spots, denial of service conditions and includes components that are reliable.
- Manageability. Being able to operate, monitor, troubleshoot, provision new components, scale (up or down), provide security and auditability.
- Performance. Having enough performance to meet your requirements, being able to adapt performance to meet future requirements, having predictability of performance and reducing and containing impacts of runaway workloads or noisy neighbours.
I have put the above key characteristics of any enterprise hyperconverged solution, in what I believe is the priority order that is most appropriate.
Some additional aspects to consider:
- Architecture Effort. How much time do you need to invest in architecting the solution, sizing the initial solution, and determining how the solution would best fit your requirements? If the solution is engineered from the factory this will reduce your effort. Are you expected to be the platform architect and build all the individual components and do all the integration?
- Purchasing and Acquisition. Local partner, ease of acquisition, initial units or quantity required, cost alignment to consumption and utilization, purchase on demand, pay as you grow, flexibility to change components or mix and match components after initial purchase based on business or technical requirements.
- Delivery and Deployment. Expected delivery timeframes, expected deployment timeframes, expected timeframes and effort required to scale up, or scale out a solution, expected time to productivity.
- Support. Integrated support for full solution stack from single or multiple points, spare parts, availability of parts, location of parts, service and response times, location of support, frequency of software updates, consequences for missed SLA’s. Do you really need 24/7/365 with 4hr response or can you settle for a lower service level due to the resiliency, redundancy and reliability of the solution?
Some questions to consider:
- How is data resiliency and protection achieved?
- Does it support hypervisor storage acceleration or offload and space saving features (VAAI for vSphere, ODX for Hyper-V)?
- Does it Support API’s for Data Protection and Backup (Such as VMware VADP)?
- Are more than one hypervisor supported and if so which ones?
- Is it possible to change hypervisors for an existing running environment and how much effort is involved?
- Are multiple hardware platforms supported and if so which ones?
- How many other environments is the solution deployed into that have similar requirements to yours?
- What happens when hypervisor HA features have to be disabled for troubleshooting or if HA has to be turned off for some other reason?
- What happens if the whole cluster needs to be shut down and it runs your key management components (vCenter or SCCM etc)?
- Is there any ability to encrypt disks or to encrypt communications between components of the architecture?
- Does it support Fault Tolerance (VMware vSphere)?
- Does it support Jumbo Virtual Disks (2TB+ virtual disks, VHDX etc)?
- Is there any way a failure of a single node or component could cause degradation of productive workloads to the point that they become non-responsive?
- Is there any built in support for DR Replication, backup, snap shots?
- What happens in the case of an SSD failure?
- What happens in the case of a hard disk failure?
- Is the upgrade process non-disruptive and can it be completed without a reboot of the physical host and migration of the virtual machines?
- Does the upgrade process in any way impact data availability, data protection or data integrity?
- What happens if a node is unavailable for more than a set period of time minutes (for example half an hour or an hour)?
- What is the impact on productive end user workloads in the case of a data rebuild / re-protection scenario?
- How is a failure of a hard disk alerted?
- How is health of the environment checked, monitored and alerted?
- What happens if a physical host is rebooted and it has a failed disk?
- How does it integrate with enterprise backup products?
- Can Change Block Tracking techniques for incremental backups be used?
- How are virtual machine recoveries from backups be achieved and can individual files be restored?
- Can the management infrastructure itself be protected easily for DR, snap shotted and backed up without any third party components or add ons?
- How easy is it to restore the management infrastructure in the case of a DR event?
- Does the solution require multicast support on the network for it to work?
- Does the solution require Jumbo Frames (>1500 byte MTU)?
- What happens if management components (vCenter or SCCM) is down or completely destroyed and needs to be rebuilt?
- Where are Hypervisor Core Dump and other troubleshooting data stored?
- Are any other storage locations required for troubleshooting data (placement of core dumps, logs, scratch locations etc) outside of the hyperconverged storage provided?
- What happens if different types of management traffic share the same IP subnet?
- If a decision is made to change hardware platforms or hypervisors in the future can it be done and how easy is it to achieve?
- How does the solution provide the balance of resource requirements to fit your needs (CPU, RAM, Storage, Network)?
- What are the units of scale and how is performance impacted when the solution is scaled up or down?
- Can the solution be scaled down as well as being scaled up and if so can it be done non-disruptively?
- If your solution is tightly coupled to the hypervisor is there any chance that environments that don’t leverage this feature or solution could be impacted by the code that is coupled to the hypervisor?
- How many patches have been released for core hypervisor and management components that are a direct result for tight coupling to the hypervisor but not related to core hypervisor functionality?
- What built in monitoring, management, and alerting capabilities exist out of the box?
- How does the solution integrate with existing monitoring and management platforms?
- Can the solution be extended or information be made available to other systems through plug-ins or API’s?
- If you make a miscalculation or can’t accurately predict the required performance what is the consequences of changing the solution at a later stage and how costly would it be?
- Does the solution allow you to reduce the dependence on specialised skills and resources and reduce training requirements for your environment?
- How long does it take to recover / rebuild from various component failure scenarios, such as node, hard disk, SSD etc?
- Are there any single points of failure?
- What data services, such as compression, data deduplication are included in the solution and what is the expected performance and use cases for them?
- How long does it take to clone or provision new virtual machines and how much space do the clones take up?
- How does the solution prevent a single workload from monopolising all resources?
- How are points of congestion handled and how is data and network congestion dealt with by the solution?
- How does the solution limit the impact of component failure or performance problems?
Final Word
Although I put this list together for hyperconverged virtualization platforms and solutions it would apply to many other converged solutions and even traditional architectures. This list is by no means exhaustive and there are many more things to consider. This is just some of the things I could come up with off the top of my head . It would be great if you contribute to this list by providing feedback and comments below on other aspects that are important to consider and questions to be asked.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.