Gmail suffered an outage yesterday, with some users unable to fully utilise the service for close to ten hours after two successive network failures created difficulties. While that sucked if you were one of the people affected, it's worth noting that even a 10 hour outage means 98.6 per cent uptime.
Like many IT projects, Gmail aims for 99.9 per cent uptime. As the Google blog explains, the issue began around 6am PST, and some users couldn't fully send messages until 4pm. 10 hours out of a 30-day month equates to a 1.4 per cent failure rate, which means 98.6 per cent uptime.
Google says only 1.5 per cent of users were affected that badly, and notes that most functions (reading existing email and searching) continued to work, so I guess it can claim the 99.9 per cent if it wants to.
The cause of the problem?
The message delivery delays were triggered by a dual network failure. This is a very rare event in which two separate, redundant network paths both stop working at the same time. The two network failures were unrelated, but in combination they reduced Gmail’s capacity to deliver messages to users.
Two lessons here. First, the intersection of complex network services means outages aren't always predictable. Second: make your uptime targets realistic. 99.9 per cent gives you less than 45 minutes in a month for everything to go wrong.
More On Gmail’s Delivery Delays [Google Enterprise Blog]