Microsoft Azure outage lessons

Cloud outages can be a serious pain for organisations and sometimes it takes high-profile emergencies to get providers and customers to change their ways.

After Microsoft Windows Azure’s advertised outage this week, I spent some time talking to a handful of Azure customers via phone, email, and Twitter.  Here are some observations and important learnings for cloud customers and cloud providers.

Cloud providers continue to track cloud outages/issues based only on availability

Azure’s health dashboard and communication originally communicated that only 3.8 per cent of customers were affected with this outage. There was no context around where the 3.8 per cent came from or how it was measured but I spoke to several customers that suspect they were not included in the 3.8 per cent figure. Just recently, the percentages were increased at the dashboard. Based upon region, the latest affected customer percentages are 6.7 per cent, 37 per cent, and 28 per cent (and may still change). I was informed by some customers that various Azure roles (web, worker, VM) are up and online for many of these customers but that service performance was degraded to the point of being unusable. Because most provider service-level agreements (SLAs) are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected. 

You can follow some of my interactions via Twitter (@kylehilgendorf) to see a couple of examples. Providers must start including performance and response SLAs into their standard service. A degraded service is often as impactful as a down service. A great quote came in on twitter this morning via @qthrul, “…a falling tower is ‘up’ until it is ‘down’.” A falling tower is not very useful for most customers.

Service dashboards continue to rely on the underlying cloud service being online

The Azure Service Dashboard has been experiencing very intermittent availability.  Throughout the outage, I have had about a 25 to 30 per cent success rate of getting the dashboard to load. I’ve been informing providers frequently that service health systems and dashboards must be hosted independently from the provider’s cloud service. If the cloud service is down or degraded, customers had better be able to see the status at all times.

I recently finished a lengthy document on evaluation criteria for public IaaS providers that will publish in the near future, and one of those criteria specifically states this as a requirement. If the service dashboard is the primary vessel by which cloud providers communicate outage updates, it must be up while the service is down.

Customers can never get enough information during the outage from the provider

Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage. AWS led the way with 30-45 minutes outage updates through their painful EBS outage and Ireland issues. While updates don’t solve the problem, they do demonstrate customer advocacy and concern. Some customers told me this morning they feel completely in the dark.

There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage. Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred. 

More important in my opinion however, is a thorough post-mortem on the outage once the service has been restored. This should come within three to four days of the outage and must be very open and honest about the root cause, the fix, and the pointers for the future. Providers please note, the world is very smart. If a provider even tries to mask or hide any of the details, it will come back to reflect negatively. Honesty wins.

We all know outages are inevitabilities, but in the midst of one, pain is real

I’ve heard from some customers very impacted and as a result very frustrated and disappointed.  When a cloud service has a good track record, we all admit that an outage will happen at some point.  Yet, in the middle of an outage, emotion gets involved, which brings me to my next point.

Customer application design needs to continue to evolve

Similar to previous cloud outages, customer application design must continue to evolve to account for possible (some would say probable) cloud outages and issues. No cloud service is identical to another and each has its own unique design and configuration options.

Most cloud services have the concept of zones and regions from a geographical or hosting location standpoint. In most cloud outages, not every zone or region is affected. Therefore, the best-prepared applications will be those designed cross-zone and cross-region to avoid an outage or degradation in any one area.

However, this comes at extreme complexity and increase in cost, as much as 10 times the cost advertised by providers.  If you are running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it.  This may also mean that you have to hire or retain some very skilled cloud staff.

It is always a sad day as a cloud analyst to see these outages. However, it seems that significant change in the industry, at both a provider and customer level, only tends to come after an emergency.

Kyle Hilgendorf works as a principal research analyst in Gartner's IT Professionals service. He covers cloud computing (external and hybrid), as well as application, desktop and server virtualisation. Read other posts by him here.