Today we experienced the second outage to this blog in the last 12 months. The outage today was 15 hours in total, unacceptably long in my view for a professional hosting company. Very inconvenient to me as I’m sure it was to all of you as well. But this does serve as a good reminder about the maxims of cloud computing. Mind you the maxims don’t just apply to cloud computing, they are general system design principles that need to be considered and dealt with. But particularly so in Cloud Computing, Outsourcing or Hosting when you are delegating some form of control over your systems to a third party. Maxims are generally established propositions or principles, at least if you’re talking about legal maxims, which I find fascinating to read. But in the context of Cloud Computing the title of this article gives them away. Hardware Fails, Software Has Bugs and People Make Mistakes. It’s up to us to deal with these, as they are guaranteed to happen. This article covers some lessons that can be learned form this recent experience in the hope that we can all have higher availability.
High Availability Doesn’t Seem So High When It Happens At The Wrong Time
Even though this was a 15 hour outage, if this was the only outage this year (and I hope it is), the hosting provider availability would still be 99.8%. That’s quite high isn’t it? Well it depends. It depends what the impact on your system is. The impact will likely depend on the type of business you operate and also the type of function the system performs for your business. It will also depend on the time of month that the outage occurs.
I had a customer example once where a 3 day outage happened at exactly the wrong time. The impact of this outage resulted in a loss in the hundreds of millions of dollars. It worked out to something like $10million per hour of downtime. Fortunately although this impact was incredibly significant for this organisation it didn’t cripple it to the point of extinction. But even with this three day outage, that happened at the wrong time, the system availability was still measured at 99%. The point is you can’t just take an annualized availability metric on it’s own. You must understand the context, impact, and other variables.
Annualized system availability is one thing, maximum tolerable downtime might be something completely different. For example, if you have an availability SLA of 99.5%, on an annual basis you are accepting 1.83days of downtime. But what happens if this is during the busy Christmas retail period and you’re running an online store expecting a lot of customers? Almost 2 days downtime would be catastrophic to sales, and could potentially put you out of business. So it’s unacceptable for all that downtime to happen in one incident, perhaps your maximum tolerable downtime is 2 hours or 4 hours. Even this would be a lot, but it might be tolerable. This means that your system must be able to recover from any failure and be operational again within 4 hours of any individual incident, no matter what has failed or for what reason, even if over the course of a year you accept it might be unavailable for almost 2 days in total. These sort of metrics in your SLA’s are absolutely critical if you are to achieve from your suppliers what you actually expect.
The Maxims of Cloud In Action
In the case of this site I have no such SLA for MTD. However my hosting provider Bluehost.com does have a network availability guarantee. They guarantee 99.9% availability. They have multiple levels of redundancy, skilled staff, their own datacenters etc. So what happened to cause a second outage in 12 months and in this case a 15 hour outage in a single incident?
I suspect there are multiple reasons and hopefully a full root cause analysis is made available to customers, along with a preventative action plan that gives customers confidence that steps are being taken to reduce the risk of another similar incident. This post incident process is standard procedure in many environments and the transparency helps everyone learn from the incident and work together to prevent it happening again.
The information that has been made available suggests that a firmware bug in the networking equipment caused the outage. Because so much of our hardware is now software, or software controlled (software-defined), bugs can have disastrous cascading impacts. However not only is the actual bug a problem (and some of them can occur after a period of time, long after an update), but the complexity of the systems involved and the troubleshooting process that you have to go through to even determine this is a bug rather than something else, is immense. Complex systems in and of themselves can produce a higher probability of downtime and also increase the duration and impact of downtime.
Usually this level of complex troubleshooting and systems outage has a contributing human error element. It’s just natural and happens in almost all environments at some time or another. We’re all human and we all make mistakes some time. This is just something that has to be accepted and we have to put measures in place to reduce the risk and the impact of making mistakes. This is where testing and validation become so important in non-production environments.
This 15 hour outage is a breach of the Bluehost.com SLA’s, even if the first outage wasn’t. Their network availability from this one incident has reduced on an annualized basis to roughly 99.8%.
Lessons and Takeaways
Validation of all system components is important to ensure they function as expected, fail as expected, and any updates are properly vetted prior to going into production. Knowing how a system behaves when everything is going well is not enough, you need to thoroughly understand what happens when things go wrong. But even this will not completely eliminate all incidents. The systems must be designed to be able to tolerate incidents and be restored within the SLA’s set by the systems. With critical systems where impacts must be limited staggered updates may be required to ensure that not all system components can be impacted at one point in time.
Complex systems are hard to troubleshoot, can increase the probability of downtime, and can increase the duration of downtime. Keep everything as simple as possible, and no more complex than absolutely necessary to meet your requirements.
Understand your availability requirements and impacts, not just annualized downtime, and the impact associated with downtime at different times of your business cycle. A thorough availability impact analysis and requirements should be part of every system design.
Hardware fails, software has bugs and people make mistakes. This is a reality, you need to plan for it. It’s up to you to put measures in place to ensure your system meets availability requirements. This means reducing single points of failure, which may include not relying on a single organizations to run or host your systems. This could be as simple as having a primary site and a backup site that can be activated by a simple DNS change. The measures appropriate for your situation need to be financially justifiable, and based on a proper risk analysis. Another good saying is that ‘Hardware eventually fails, and software eventually works’.
Clearly defined and understood standard operating procedures, troubleshooting procedures, and communications plans can help a lot in a crisis. But you also need to have well trained staff that know all of these procedures and have them tested regularly. Having them written down and not part of the organisation culture is no good.
Your hardware is now software, it probably has bugs. Now that your hardware has so much software, is software controlled or software defined, it probably has lots of bugs. This means you not only have to test your applications software, but you have to test your hardware as well.
Use the concept of availability domains to limit risk. An availability domain is basically the same thing as a failure domain. It’s conceptually a way to limit the risk of an outage or incident to a subset of system components. The idea being that not all of the system will fail at the same time if it’s split or separated into more than one availability domain. If you design availability domains into your systems you can increase availability. Even with availability domains, in almost all systems there is always a few single points of failure that could still impact availability, you just have to ensure the probability is as low as possible and the risks are understood and properly managed.
Transparency and owning up to your mistakes is the best policy in a crisis. Transparent, direct and frank communication can buy you a lot of good will in a crisis and allow you the time you need to get to resolution. Transparency after the crisis is over is also important. Handling a disaster or a crisis well can enable you to retain customers you would have otherwise lost. Having transparent root cause analysis and preventative action plans can build confidence in your customers so they know you’ve learned from the experience and it is much less likely to happen again. This is not just a technical or IT problem, this is a business problem. The communications plan and transparency needs to be organisation wide.
Although I’m disappointed about the outage I understand the complexity involved in troubleshooting and resolving something like this. I also take responsibility as I know the buck stops with me. The decision to host with this provider was mine. I could host with a different provider or with multiple providers. But unfortunately that’s not economic. Sometimes you just have to accept a risk, which is hopefully reduced by vetting and good SLA’s. But I apologise to you, my readers, for the site being down for 15 hours, this outage was unacceptable. I hope we can all learn some lessons from this experience, my hosting provider included, and that we can provide much better availability to our customers as a result. As always your feedback is appreciated.
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.