Google Outage Teaches Us That Uptime Percentages Aren't Helpful

Over the weekend, Google suffered an outage for most of its services. Some users — including paid Google Apps For Business customers — weren't able to access it for close to an hour. But even with that outage, Google's uptime percentage looks fairly healthy.

Picture: Getty Images

According to Google's blog post, around 10 per cent of customers were unable to access Gmail, Google+ or other services for 55 minutes. In a typical 31-day month, if that was the only outage, that represents a 99.88 per cent uptime. That's not quite the "four nines" or "five nines" demanded of some IT services, but it's ahead of the 99.95 per cent that applies to many others.

The issue, of course, is that if you're trying to work during those 55 minutes, the uptime percentage means nothing. If an outage happened for 2 minutes a day, the percentage would be the same, but far fewer people would notice, and by the time they started complaining, the problem would be resolved. The lesson? While service-level agreements will often specify uptime as a percentage, it's how the problem is dealt with that really matters.

Google's explanation of what went wrong also reminds us that debugging a live service can be very difficult:

At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users' requests for their data to be ignored, and those services, in turn, generated errors. Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google's Site Reliability Team. Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time.

A self-correcting problem is good, but not having the problem in the first place is still better.

Today's outage for several Google services [Official Google Blog]


Comments

    In a typical 31-day month, if that was the only outage, that represents a 99.88 per cent uptime. That’s not quite the “four nines” or “five nines” demanded of some IT services, but it’s ahead of the 99.95 per cent that applies to many others.

    Am I reading that wrong? 99.95>99.88

    not sure of the OLAs but if it's just 10 % users affected, it will most likely be classified as a partial outage and not count against the 99.95 availability stats. Typcially only if it's a full outage would it be counted into the 99.95 average. Correct me if im wrong

Join the discussion!

Trending Stories Right Now