Some vendors in the storage, hyper-converged, and cloud industries may be playing Russian Roulette with their customers’ data. Solutions are not created equally, some turn off basic data integrity features such as data checksum by default, or when there are performance problems. Some don’t have background consistency checks and scrubbing to protect against silent data corruption or latent sector errors. Others might use consumer grade devices that may have a higher risk of error and higher failure rate. In the age of software defined solutions, the customer has become the storage platform architect. There is enough rope to hang yourself (your data and your platform availability) any number of different ways. Which is why having a software foundation and integrated solution that has been properly validated from end to end, and that contains data integrity and enterprise data protection features at it’s core, should be the highest priority. Return of data, in the form it was originally written, at any scale, while protecting against known data and device risks, is of upmost importance. How important is performance (IOPS, Latency and Throughput) if you can’t even read back the data you originally wrote? Here are the top 10 questions you can ask potential vendors to find out if they really have protecting your data as their top priority.
Before we get started with the questions, it’s always good to have some science and evidence to back things up. Here is one paper – An Analysis of Data Corruption in the Storage Stack. Another paper – Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters. Both papers cover large scale studies. Any study across a small population of devices or a very small sample size is going to be invalid. Any conclusions from something like a 30 drive study isn’t going to be valid when you have tens of thousands, hundreds of thousands, or millions of devices.
Questions to ask your potential solution vendor:
- Does the solution include data checksums to ensure that data written is the same as data read / returned, if so, are they on by default or optional?
- Do the checksums have a performance impact on random or sequential IO operations, if so, what is the impact?
- Does the solution include consistency checks or scrubbing to protect against silent data corruption, silent bit rot, and latent sector errors, if so, are they on by default or optional?
- Does the solution include SMART checks and predictive failure analysis, which could include predictive replacement and automated support case generation?
- What is the annualized return rate or failure rate of the devices used in the solution and over what number of devices and duration has that been measured?
- How does the solution protect data between different components (disks, servers/nodes, clusters) and is this tunable based on different requirements?
- How does the solution protect against multiple concurrent component failures and which type of component failures are protected against?
- Are user defined failure domains supported to protect against situations such as chassis failure, rack failure, multiple storage device failure?
- How does the solution recover from single and multiple component failure, this could be single storage device failure, multiple devices on the same shelf or node, or multiple node failures, and what is the expected recovery time and performance impact? Does this scale linearly as the solution continuously grows over time?
- Do recovery options rely on a single device, such as a hot spare, or in the case of an object store, a single device holding the replica of a large component, or do recoveries utilize all devices in the system equally and fairly?
There are plenty more questions that could be asked, but the 10 questions above cover the most common areas of risk in terms of data integrity, data protection and data loss prevention, and that are not always protected against, at least not by default, with some systems.
From a Nutanix point of view, as a leader in the Gartner Magic Quadrant for Hyperconverged Infrastructure, we take data integrity seriously and it’s our top priority. We protect against all of the areas highlighted in the questions above and we have a paper that explains the Infrastructure Resiliency of Nutanix Solutions, which compliments the research paper on enterprise clusters. We also have many hundreds of thousands of devices in production that are proactively monitored from which we can draw real world data, and a very thorough device qualification and QA process, which limits risk. We use similar high standards across all hardware platforms that our software supports. Our software is built based on a philosophy that hardware will eventually fail, so we must deal with these failures gracefully. Your data deserves better protection!
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2018 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.