How Azure Fell Over And How Microsoft Fixed It

Cloud computing services are highly automated, but still ultimately managed by humans — and humans make mistakes. That’s the key lesson from a major outage Microsoft’s Azure service suffered last month.

Microsoft’s Azure blog has published a detailed root cause analysis for the outage in its Azure Storage Service, which fell over on November 18. Because storage is also a key requirement for many virtual machines, other Azure customers are always affected.

The main cause of the problem? Azure updates are supposed to be rolled out gradually in “slices” across regions, but that process was ignored — an update designed to improve performance was tested but then immediately put into production across the infrastructure. In other words: a sensible business process was ignored:

The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk. Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure.

Microsoft has now updated its deployment platform so changes can only be made in stages, which means (in theory at least) that the problem won’t re-occur. It’s a reminder that there’s no such thing as 100 per cent uptime. Hit the link below for the full account.

Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption [Azure Blog]


The Cheapest NBN 50 Plans

Here are the cheapest plans available for Australia’s most popular NBN speed tier.

At Lifehacker, we independently select and write about stuff we love and think you'll like too. We have affiliate and advertising partnerships, which means we may collect a share of sales or other compensation from the links on this page. BTW – prices are accurate and items in stock at the time of posting.

Comments