Today we experienced the second outage to this blog in the last 12 months. The outage today was 15 hours in total, unacceptably long in my view for a professional hosting company. Very inconvenient to me as I’m sure it was to all of you as well. But this does serve as a good reminder about the maxims of cloud computing. Mind you the maxims don’t just apply to cloud computing, they are general system design principles that need to be considered and dealt with. But particularly so in Cloud Computing, Outsourcing or Hosting when you are delegating some form of control over your systems to a third party. Maxims are generally established propositions or principles, at least if you’re talking about legal maxims, which I find fascinating to read. But in the context of Cloud Computing the title of this article gives them away. Hardware Fails, Software Has Bugs and People Make Mistakes. It’s up to us to deal with these, as they are guaranteed to happen. This article covers some lessons that can be learned form this recent experience in the hope that we can all have higher availability.
High Availability Doesn’t Seem So High When It Happens At The Wrong Time
Even though this was a 15 hour outage, if this was the only outage this year (and I hope it is), the hosting provider availability would still be 99.8%. That’s quite high isn’t it? Well it depends. It depends what the impact on your system is. The impact will likely depend on the type of business you operate and also the type of function the system performs for your business. It will also depend on the time of month that the outage occurs.
I had a customer example once where a 3 day outage happened at exactly the wrong time. The impact of this outage resulted in a loss in the hundreds of millions of dollars. It worked out to something like $10million per hour of downtime. Fortunately although this impact was incredibly significant for this organisation it didn’t cripple it to the point of extinction. But even with this three day outage, that happened at the wrong time, the system availability was still measured at 99%. The point is you can’t just take an annualized availability metric on it’s own. You must understand the context, impact, and other variables.
Annualized system availability is one thing, maximum tolerable downtime might be something completely different. For example, if you have an availability SLA of 99.5%, on an annual basis you are accepting 1.83days of downtime. But what happens if this is during the busy Christmas retail period and you’re running an online store expecting a lot of customers? Almost 2 days downtime would be catastrophic to sales, and could potentially put you out of business. So it’s unacceptable for all that downtime to happen in one incident, perhaps your maximum tolerable downtime is 2 hours or 4 hours. Even this would be a lot, but it might be tolerable. This means that your system must be able to recover from any failure and be operational again within 4 hours of any individual incident, no matter what has failed or for what reason, even if over the course of a year you accept it might be unavailable for almost 2 days in total. These sort of metrics in your SLA’s are absolutely critical if you are to achieve from your suppliers what you actually expect.
The Maxims of Cloud In Action
In the case of this site I have no such SLA for MTD. However my hosting provider Bluehost.com does have a network availability guarantee. They guarantee 99.9% availability. They have multiple levels of redundancy, skilled staff, their own datacenters etc. So what happened to cause a second outage in 12 months and in this case a 15 hour outage in a single incident?
I suspect there are multiple reasons and hopefully a full root cause analysis is made available to customers, along with a preventative action plan that gives customers confidence that steps are being taken to reduce the risk of another similar incident. This post incident process is standard procedure in many environments and the transparency helps everyone learn from the incident and work together to prevent it happening again.
The information that has been made available suggests that a firmware bug in the networking equipment caused the outage. Because so much of our hardware is now software, or software controlled (software-defined), bugs can have disastrous cascading impacts. However not only is the actual bug a problem (and some of them can occur after a period of time, long after an update), but the complexity of the systems involved and the troubleshooting process that you have to go through to even determine this is a bug rather than something else, is immense. Complex systems in and of themselves can produce a higher probability of downtime and also increase the duration and impact of downtime.
Usually this level of complex troubleshooting and systems outage has a contributing human error element. It’s just natural and happens in almost all environments at some time or another. We’re all human and we all make mistakes some time. This is just something that has to be accepted and we have to put measures in place to reduce the risk and the impact of making mistakes. This is where testing and validation become so important in non-production environments.
This 15 hour outage is a breach of the Bluehost.com SLA’s, even if the first outage wasn’t. Their network availability from this one incident has reduced on an annualized basis to roughly 99.8%.
Lessons and Takeaways
Validation of all system components is important to ensure they function as expected, fail as expected, and any updates are properly vetted prior to going into production. Knowing how a system behaves when everything is going well is not enough, you need to thoroughly understand what happens when things go wrong. But even this will not completely eliminate all incidents. The systems must be designed to be able to tolerate incidents and be restored within the SLA’s set by the systems. With critical systems where impacts must be limited staggered updates may be required to ensure that not all system components can be impacted at one point in time.
Complex systems are hard to troubleshoot, can increase the probability of downtime, and can increase the duration of downtime. Keep everything as simple as possible, and no more complex than absolutely necessary to meet your requirements.
Understand your availability requirements and impacts, not just annualized downtime, and the impact associated with downtime at different times of your business cycle. A thorough availability impact analysis and requirements should be part of every system design.
Hardware fails, software has bugs and people make mistakes. This is a reality, you need to plan for it. It’s up to you to put measures in place to ensure your system meets availability requirements. This means reducing single points of failure, which may include not relying on a single organizations to run or host your systems. This could be as simple as having a primary site and a backup site that can be activated by a simple DNS change. The measures appropriate for your situation need to be financially justifiable, and based on a proper risk analysis. Another good saying is that ‘Hardware eventually fails, and software eventually works’.
Clearly defined and understood standard operating procedures, troubleshooting procedures, and communications plans can help a lot in a crisis. But you also need to have well trained staff that know all of these procedures and have them tested regularly. Having them written down and not part of the organisation culture is no good.
Your hardware is now software, it probably has bugs. Now that your hardware has so much software, is software controlled or software defined, it probably has lots of bugs. This means you not only have to test your applications software, but you have to test your hardware as well.
Use the concept of availability domains to limit risk. An availability domain is basically the same thing as a failure domain. It’s conceptually a way to limit the risk of an outage or incident to a subset of system components. The idea being that not all of the system will fail at the same time if it’s split or separated into more than one availability domain. If you design availability domains into your systems you can increase availability. Even with availability domains, in almost all systems there is always a few single points of failure that could still impact availability, you just have to ensure the probability is as low as possible and the risks are understood and properly managed.
Transparency and owning up to your mistakes is the best policy in a crisis. Transparent, direct and frank communication can buy you a lot of good will in a crisis and allow you the time you need to get to resolution. Transparency after the crisis is over is also important. Handling a disaster or a crisis well can enable you to retain customers you would have otherwise lost. Having transparent root cause analysis and preventative action plans can build confidence in your customers so they know you’ve learned from the experience and it is much less likely to happen again. This is not just a technical or IT problem, this is a business problem. The communications plan and transparency needs to be organisation wide.
Final Word
Although I’m disappointed about the outage I understand the complexity involved in troubleshooting and resolving something like this. I also take responsibility as I know the buck stops with me. The decision to host with this provider was mine. I could host with a different provider or with multiple providers. But unfortunately that’s not economic. Sometimes you just have to accept a risk, which is hopefully reduced by vetting and good SLA’s. But I apologise to you, my readers, for the site being down for 15 hours, this outage was unacceptable. I hope we can all learn some lessons from this experience, my hosting provider included, and that we can provide much better availability to our customers as a result. As always your feedback is appreciated.
—
This post appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
There are two interesting bits here:
1) You aren't sure what actually happened and don't know whether a full postmortem will be made available.
2) Your SLA was violated, but will that make you whole?
It's interesting to see how various service and services providers handle these things differently. So many times I've heard an out-sourced IT org say "We are paying for the SLA" while at the same time the provider is saying "Make the SLA 100% for marketing purposes, we're never on the hook for actual damages." In your example of the company losing $10M an hour, I promise if there was a 3rd party data center or service provider involved, the SLA didn't compensate them for that loss!
Postmortems are the best evidence of transparency between a provider and a customer. Before I sign up, I want to see where they have published all of the communication related to previous outages. I want to see how they communicate with their customers. I want to *know* that if there's an outage I'll know the good, bad and ugly of the event, because that's what I have to gauge risk and whether I want to stay.
Having been on the service provider side, the idea that customers buy SLAs is the single best piece of misdirection ever marketed.
Absolutely spot on. The SLA doesn't help you when the service has already been down way beyond the limits you've signed up for. But you've got to know what the limits and are manage the risks. Even with a highly professional and competent third party things can still go wrong. The only thing that can help you then is the transparency and learning from it and hopefully the compensation clauses in the contract. But as we know they will not usually cover consequential losses. I like the idea of reviewing the previous outage communications and how they're handled. That is a good way of measuring risk. The $10m an hour example wasn't the result of a third party but a combination of hardware failures, software bugs and human error. The organisation concerned did a thorough root cause analysis, created an in depth preventative action plan, performed a risk analysis and prioritised each risk area and mitigation steps and worked methodically over a period of time to reduce and eliminate the causes of the service interruption and implemented mitigations of additional risks. An additional outcome of the incident was additional business continuity processes outside of technology to reduce the impact of system outages. Not all solutions have to be technical or be part of IT. The Business Continuity plan, processes and risk assessments need to drive the requirements for IT solutions.
There's a *great* Google+ (I know, I know) community for nothing but postmortem reports of all kinds. It's fascinating reading:
https://plus.google.com/u/0/communities/115136140…
[…] The timing of this is a coincidence that it comes right on the coat tails of my previous article Hardware Fails, Software Has Bugs and People Make Mistakes – Usually You Get All At Once! During the disconnects VM’s will appear frozen and the NFS datastores may be greyed out. […]
" hopefully a full root cause analysis is made available to customers, along with a preventative action plan that gives customers confidence that steps are being taken to reduce the risk of another similar incident."
In my experience infrastructure is the best area in IT for adopting this approach. I am quite sure you will find things improve.
<RANT>Unfortunately Software Projects crash and burn -or- succeed and no one works out why! That's how you have project after project failing and costing a fortune.
Trying to get clients to run Post Implementation Reviews seems like the hardest thing in the world. What are the Universities teaching people in Business courses? </RANT>
From a Data Protection perspective, I’ve felt SLAs are often only numbers good for the marketing of the service provider. It’s an agreement, and the customer has no way to check if the SP is effectively able to guarantee that value. We agree on an SLA, as a customer I hope the real SLA would always be 100%, and if the SP violates it, at least I want some money back.
Also, in your examples there is the other problem with SLA: it’s an average value usually on a year base, while an outage impacts a business in every single event. I’d prefer to discuss (and write down on the contract) RPO and RTO with a Service Provider rather than SLA.
Hi Luca, I agree. I think the RPO and RTO as well as the maximum tolerable downtime should form part of the SLA. The system then needs to be designed to meet these SLA's and the right procedures and processes put in place. Often times the problem with the SLA's and the services is that the customer doesn't actually know the risk they're taking and therefore can't adequately plan. They have to base everything on the SLA contract with know real way to know if the service provider can meet it. That's why i like Jeramiah's way of determining the risk by looking over the old incident root cause and corrective action reports. At least then you can measure how they respond to incidents. Otherwise you may find that the SLA isn't worth the paper it's written on and is at best a target or aspirational goal, rather than something that can be relied upon.