Microsoft's Azure cloud service suffered a lengthy outage this morning in many locations — one which might trigger the right to claim credits under its service level agreements (SLAs) which guarantee uptime of 99.9% or more. Here's what you need to know.
As Microsoft explained on its service update page:
Starting at 18 Aug 2014 17:49 UTC, we are experiencing an interruption to Azure Services, may include Cloud Services, Virtual Machines Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, StorSimple and possible other Azure Services in multiple regions. Customers began to experience service restoration as updates were deployed across the affected environment.
Those updates didn't affect every region and every service, but did impact numerous Azure options. Particularly concerning would have been outages in Azure's virtual machine capability, as that's one of the most widely used features. Only two regions escaped that outage: US South Central and Asia Pacific Southeast (Singapore). Australian Azure users who had chosen Asia Pacific rather than US sites (while we all wait for the local Azure centres to open) thus may not have been impacted as much. Even customers using Hong Kong should have been automatically failed over to the Singapore centre.
So how long did the outage last? It was almost two hours between when Microsoft first announced the outage and when it said that customers began to experience service restoration. The exact figures are likely to have been different in every case, but outages of 120 minutes or more are certainly possible, and outages of more than 30 minutes for affected customers seem extremely likely. The exact length turns out to be important when we come to consider the SLA.
The exact SLA which applies to Azure services varies; this complete list shows that 99.9% applies to most options. General compute has a 99.95% SLA, and some storage and directory services have a 99.99% SLA.
In a 31-day month, a 99.9% uptime means 44 minutes of outage before the SLA kicks in. 99.95% gives you just 22 minutes, and 99.99% means under 5 minutes. So if the outage ran for more than 44 minutes — which seems to be the case for many customers — then customers are entitled to apply for service credits on everything. Possibly.
Again the terms vary between services, but the following extract shows that process is generally quite convoluted:
To submit a Claim, Customer must contact Customer Support and provide notice of its intention to submit a Claim. Customer must provide to Customer Support all reasonable details regarding the Claim, including but not limited to, detailed descriptions of the Incident(s), the duration of the Incidents, the affected Protected Items and any attempts made by Customer to resolve the Incident. In order for Microsoft to consider a Claim, Customer must submit the Claim, including sufficient evidence to support the Claim, by the end of the billing month following the billing month in which the Incident which is the subject of the Claim occurs. Microsoft will use all information reasonably available to it to validate Claims and make a good faith judgment on whether the SLA and Service Levels apply to the Claim.
In other words: until Microsoft comes out with a more definitive update on what has happened, it's hard to say whether an SLA claim for credits will kick in. Take screenshots and note times, but save ringing customer support until that statement emerges (which will almost certainly be in the next day or so, unless a secondary problem emerges).
There's a chance Microsoft will use an "it was an external fault" clause to try and ignore the SLA, but given that it has said updates were deployed to fix the issue, a software fault seems indicated. (It might potentially be related to the failed patches for Windows released last week.) We'll update as new information emerges.