I had heard the murmurs through the ether that something might be up. But it was at that time an unsubstantiated rumour. I couldn’t really believe that a tier 1 storage company would have an array that involved complete data migration / destruction and disruption in order to do a software/firmware upgrade between firmware versions. This isn’t the SDDC you’re looking for. I didn’t see the point in slinging mud for something that may be untrue, or be corrected in time for a GA release (Customers are still hoping). Everyone who’s been in the IT game for long enough knows that things do go wrong from time to time, despite the best efforts of everyone, but planned data destruction for an upgrade is kinda hard to take in this day and age. This is certainly not the always on, non-disruptive upgrades that we’ve all gotten used to, at least some of us. It appears however that the rumours are true and they’ve been reported by Andrew Dauncey – The Odd Angry Shot XtremIO Gotcha, Chad Sakac – Virtual Geek on Disruptive Upgrade (transparency on this issue is good), El Reg – No Biggie: EMC XtremIO Firmware Upgrade Will Wipe Data, and IT News – Extreme upgrade pain for XtremIO Customers. To upgrade XtremIO from 2.4 line of code to 3.0 will involve removing all of the data and putting it back after the upgrade completes. That’s right, anything left on the array during the upgrade, will in effect be lost. Not to mention the required downtime. What’s my take?
Disclaimer: I work for Nutanix, but I also work in the real world and know that business decisions and IT architecture are based on business requirements. I don’t regularly come across a requirement that says it’s ok to destroy, wipe, remove and put back, data during an upgrade. Nutanix goes to great lengths to make non-disruptive always on operations a core principle of our systems (part of our Web-scale Converged Infrastructure Platform), even when things change. This is my opinion and doesn’t necessarily represent the opinion of my employer or anyone else.
I spoke to an XtremIO customer about this and they were well aware of this (and had planned for it). They also were aware of having to do a similar sort of process previously when expanding X-Bricks in the XtremIO platform. The latter has since been addressed partially so I’ve been told (no longer requires data destruction). The former is still a data destructive and disruptive operation. Planning for any upgrade is important, backups for any upgrade are also important and prudent, even if it’s a non-disruptive upgrade. But that is just in case the worst happens. We don’t usually go through a process where we have to migrate all the data off a system, have it wiped and then move all the data back again. This is a much bigger exercise entirely, even when managed properly and with proper support.
There is of course a justification given for this, changing the data structures to enable better dedupe and compression and to greatly improve performance. Ok, but there are storage systems available that have great performance improvements and have changed dedupe factors between releases and don’t have to wipe the data, and still the upgrades are non-disruptive. EMC thought it was ok to disrupt over 1000 customers who are currently running the XtremIO systems in production (if they choose to upgrade), which suggests that all concerned couldn’t come up with another way of doing it.
I don’t buy the argument it’s because they only started cutting code for XtremIO in 2009 as mentioned in a comment on El Reg. Many startups came into existence in 2009 and they haven’t they haven’t all had to do this. But it’s ok you’re told. Professional Services and the partners will stand behind the upgrade and it’ll be at no cost to the customers (as they should). The actual upgrade process is likely to be a complete array replacement and migration to a new array with the old one taken away. This is the least disruptive of an otherwise very disruptive process. Otherwise there would be a loan array to migrate to, and then you just migrate back. Either way, this is going to be time consuming.
How exactly are you measuring cost? I would say the cost of the customers time and resources needs to be factored in. You’ll need to rack up more equipment, potentially more switches, cables, racks (what if you’re out of space?), and then proceed to migrate everything to the new system (thank god for live storage vMotion I hear you say, at least with VM’s). Maybe not so much of a drama if you only run non-persistent VDI desktops (as was the case to one customer I spoke to for this article), but what if you’re running persistent desktops, business critical applications, perhaps you have some physical servers still lying around? Not everything is so simple to live migrate using Storage vMotion (Oracle RAC?).
I have a fairly black and white view of priorities when it comes to enterprise storage. This upgrade path seems to break priority #1 and #2.
- Data Protection
- Data Availability
- Performance
If you can’t do 1 and 2 I don’t really care if you can do 3. Now it’s unfair to say that just because an upgrade process can’t achieve 1 and 2 that the system when it’s running in production doesn’t. By all accounts that is not the case. But it does go against those priorities somewhat and would make me think twice. I’d be asking is this really what I signed up for? Was I advised of this during the sales process or earlier in the planning around a potential future upgrade to 3.0? If the answer to either was no, well then it’s just as easy to migrate to another array as it is to migrate to a different version of the same array. But you have other options. You could just stay on the same old 2.4 firmware and not take advantage of the 3.0 release, you’ll be supported on 2.4 for the foreseeable future. Then once you’re sick of 2.4, and you’ve had a chance to get a return on your sunk investment (or earlier as you need to expand) you could easily look to move to something else.
Final Word
Regardless of the technology I think storage upgrades should be simple and non-disruptive. The problems highlighted here can be worked around, and the disruption can be minimised and risks mitigated. But in an always on world the workarounds might not cut it. Virtualization mitigates a lot of the problems, so now might be a good time to virtualize those last physical servers if they’re running on XtremIO. If you want to learn of a better way to support applications, one that is non-disruptive to upgrade, simple to architect, implement and manage, linearly scalable, and suitable for the vast majority of enterprise IT workloads, talk to someone from Nutanix. It’s not a silver bullet for all business requirements, but it’s at least worthwhile to investigate your options.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Great Post.. I agree firmware upgrades for storage should be free and done for the customer. my biggest fight is trying to convince the customers that firmware updates are very important.
FW upgrade being disruptive in this day and age is a crime. Having said that, XtremeIO sales guys should take this on board and get replacement arrays for customers who don't have the time to go through this process, after all nothing else in the architecture changes and one can use vMotion to move the workloads. I am working on a project where the client has bought 10 x bricks and am having a hard time convincing them not to put any data on it since moving the workloads is not an option (badly designed VDI).
An unpopular stance I am sure, but I actually disagree with a core issue in this post and most of the comments here and at other sites. Yes, standard firmware updates should be free and non-disruptive. Standard defined as bug fixes and maybe small upgrades. However, major improvements to performance and usability do not fall under standard firmware upgrades. Is Apple swapping out everyone's iPhone to the just released iPhone 6? Of course, not.
I believe the mistake that EMC has made, is that they should have just EOL'd the existing platform and released 3.0 as the next gen release. Problem solved. I think they are now suffering the old adage – "No good deed goes unpunished".
For the record – I don't work for EMC or any of the competitors. I work for a testing company where I get to work with all the storage vendors. From my perspective, EVERY firmware upgrade has the potential to be disruptive. Best advice – backup and test prior to migrating. No matter which vendor.
I think if we were just talking about downtime and a bit of disruption then most people could agree. But we're talking about data destructive upgrades. Maybe this should have been a new product. But having a software upgrade that destroys data, even to add some features and performance is not ok. Especially not for a system that is pitched as being an enterprise storage system for mission critical and business critical enterprise applications. Nutanix continues to make major improvements in performance and functionality release to release all without disruption and without destroying data. Even the AFA's that compete with XtremIO have made major improvements in performance and functionality without destroying data. Yes things can go wrong with an upgrade process, but knowingly having data destruction as part of the upgrade process, in 2014, that should not be part of the program. If you want to stick with SAN I guess it's something you have to accept, disruption and forklift upgrades, but it doesn't have to be that way.
[…] Michael Webster of Nutanix […]
[…] Michael Webster of Nutanix […]