I had heard the murmurs through the ether that something might be up. But it was at that time an unsubstantiated rumour. I couldn’t really believe that a tier 1 storage company would have an array that involved complete data migration / destruction and disruption in order to do a software/firmware upgrade between firmware versions. This isn’t the SDDC you’re looking for. I didn’t see the point in slinging mud for something that may be untrue, or be corrected in time for a GA release (Customers are still hoping). Everyone who’s been in the IT game for long enough knows that things do go wrong from time to time, despite the best efforts of everyone, but planned data destruction for an upgrade is kinda hard to take in this day and age. This is certainly not the always on, non-disruptive upgrades that we’ve all gotten used to, at least some of us. It appears however that the rumours are true and they’ve been reported by Andrew Dauncey – The Odd Angry Shot XtremIO Gotcha, Chad Sakac – Virtual Geek on Disruptive Upgrade (transparency on this issue is good), El Reg – No Biggie: EMC XtremIO Firmware Upgrade Will Wipe Data, and IT News – Extreme upgrade pain for XtremIO Customers. To upgrade XtremIO from 2.4 line of code to 3.0 will involve removing all of the data and putting it back after the upgrade completes. That’s right, anything left on the array during the upgrade, will in effect be lost. Not to mention the required downtime. What’s my take?
Disclaimer: I work for Nutanix, but I also work in the real world and know that business decisions and IT architecture are based on business requirements. I don’t regularly come across a requirement that says it’s ok to destroy, wipe, remove and put back, data during an upgrade. Nutanix goes to great lengths to make non-disruptive always on operations a core principle of our systems (part of our Web-scale Converged Infrastructure Platform), even when things change. This is my opinion and doesn’t necessarily represent the opinion of my employer or anyone else.
I spoke to an XtremIO customer about this and they were well aware of this (and had planned for it). They also were aware of having to do a similar sort of process previously when expanding X-Bricks in the XtremIO platform. The latter has since been addressed partially so I’ve been told (no longer requires data destruction). The former is still a data destructive and disruptive operation. Planning for any upgrade is important, backups for any upgrade are also important and prudent, even if it’s a non-disruptive upgrade. But that is just in case the worst happens. We don’t usually go through a process where we have to migrate all the data off a system, have it wiped and then move all the data back again. This is a much bigger exercise entirely, even when managed properly and with proper support.
There is of course a justification given for this, changing the data structures to enable better dedupe and compression and to greatly improve performance. Ok, but there are storage systems available that have great performance improvements and have changed dedupe factors between releases and don’t have to wipe the data, and still the upgrades are non-disruptive. EMC thought it was ok to disrupt over 1000 customers who are currently running the XtremIO systems in production (if they choose to upgrade), which suggests that all concerned couldn’t come up with another way of doing it.
I don’t buy the argument it’s because they only started cutting code for XtremIO in 2009 as mentioned in a comment on El Reg. Many startups came into existence in 2009 and they haven’t they haven’t all had to do this. But it’s ok you’re told. Professional Services and the partners will stand behind the upgrade and it’ll be at no cost to the customers (as they should). The actual upgrade process is likely to be a complete array replacement and migration to a new array with the old one taken away. This is the least disruptive of an otherwise very disruptive process. Otherwise there would be a loan array to migrate to, and then you just migrate back. Either way, this is going to be time consuming.
How exactly are you measuring cost? I would say the cost of the customers time and resources needs to be factored in. You’ll need to rack up more equipment, potentially more switches, cables, racks (what if you’re out of space?), and then proceed to migrate everything to the new system (thank god for live storage vMotion I hear you say, at least with VM’s). Maybe not so much of a drama if you only run non-persistent VDI desktops (as was the case to one customer I spoke to for this article), but what if you’re running persistent desktops, business critical applications, perhaps you have some physical servers still lying around? Not everything is so simple to live migrate using Storage vMotion (Oracle RAC?).
I have a fairly black and white view of priorities when it comes to enterprise storage. This upgrade path seems to break priority #1 and #2.
- Data Protection
- Data Availability
If you can’t do 1 and 2 I don’t really care if you can do 3. Now it’s unfair to say that just because an upgrade process can’t achieve 1 and 2 that the system when it’s running in production doesn’t. By all accounts that is not the case. But it does go against those priorities somewhat and would make me think twice. I’d be asking is this really what I signed up for? Was I advised of this during the sales process or earlier in the planning around a potential future upgrade to 3.0? If the answer to either was no, well then it’s just as easy to migrate to another array as it is to migrate to a different version of the same array. But you have other options. You could just stay on the same old 2.4 firmware and not take advantage of the 3.0 release, you’ll be supported on 2.4 for the foreseeable future. Then once you’re sick of 2.4, and you’ve had a chance to get a return on your sunk investment (or earlier as you need to expand) you could easily look to move to something else.
Regardless of the technology I think storage upgrades should be simple and non-disruptive. The problems highlighted here can be worked around, and the disruption can be minimised and risks mitigated. But in an always on world the workarounds might not cut it. Virtualization mitigates a lot of the problems, so now might be a good time to virtualize those last physical servers if they’re running on XtremIO. If you want to learn of a better way to support applications, one that is non-disruptive to upgrade, simple to architect, implement and manage, linearly scalable, and suitable for the vast majority of enterprise IT workloads, talk to someone from Nutanix. It’s not a silver bullet for all business requirements, but it’s at least worthwhile to investigate your options.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.