Cloud computing services are highly automated, but still ultimately managed by humans — and humans make mistakes. That’s the key lesson from a major outage Microsoft’s Azure service suffered last month.
Microsoft’s Azure blog has published a detailed root cause analysis for the outage in its Azure Storage Service, which fell over on November 18. Because storage is also a key requirement for many virtual machines, other Azure customers are always affected.
The main cause of the problem? Azure updates are supposed to be rolled out gradually in “slices” across regions, but that process was ignored — an update designed to improve performance was tested but then immediately put into production across the infrastructure. In other words: a sensible business process was ignored:
The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk. Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure.
Microsoft has now updated its deployment platform so changes can only be made in stages, which means (in theory at least) that the problem won’t re-occur. It’s a reminder that there’s no such thing as 100 per cent uptime. Hit the link below for the full account.