Ten Things We Can Learn From #TelstraOutage

So Telstra suffered a Dodo-induced outage yesterday, millions of Australians were temporarily internet-free, and those who weren't were bombarded with messages on Twitter and Facebook about what went wrong. Normal service has been restored, but there are lessons for everyone from the experience, whether or not you're a Telstra customer.

10. One hour without net is not the end of the world

It is important to keep a sense of perspective. Yes, it's incredibly irritating when your net connection doesn't work, but billions of people get through their day without one. Depending where you live, high-speed broadband can still seem like a pipe dream. If you really can't find something useful to do for an hour that doesn't involve using your computer, then you're suffering a serious failure of imagination. Go for a walk. Clean your desk. Sort the contacts on your phone.

9. Social media is fast, but not always fast enough

People instantly turned to Twitter to discover what was going on, but despite a hefty presence on social media, Telstra didn't offer any official confirmation until 2:44PM, when the outage had been apparent for nearly an hour. The general speculation was that Telstra itself couldn't update those accounts because no-one had any access. That didn't mean Twitter wasn't helpful; it made it clear there was a problem, but not that Telstra was responding to it.

8. ISP status pages should be updated more often

Here's a screen grab of Telstra's network status page during the outage:

Two things to note. Firstly, there's no mention of what was in fact a nationwide outage. Secondly, the status page apparently hadn't been updated since 9AM. That's really not good enough.

7. We're quick to spread the blame

Before Dodo fessed up to its part in the incident, there was much speculation as to what the cause might have been. Commenters variously mused that it might be a problem with Telstra's international links, a DNS resolution issue, the result of some strange attempt to block all references to Megaupload. The truth emerged, but plenty of misinformation and speculation did as well (as well as many, many references to the dangers of trying to Google Google).

6. No ISP is truly independent

The nature of the Internet means that problems with one ISP can often affect another. Dodo's routing tables caused a problem for Telstra, and that in turn caused a problem for iiNet and others. You can choose an ISP based on perceived reliability, but no-one is entirely immune from network problems.

5. Psychics cannot be trusted

Enough said/, really.

4. Rebooting doesn't always work

The standard advice whenever there's a computer problem is always: reboot and see if it works. This was one of the cases where that didn't help. Doesn't mean it's not generally a good strategy, just a reminder that it's not the definitive solution.

3. Networking can make your head hurt

Here's the official statement from Dodo on what happened:

Dodo experienced a hardware issue with a Cisco border router. This issue caused Dodo to broadcast network routes to Telstra. In normal circumstances, this would not result in a network outage. However, it appears that these routes were accepted by Telstra and propagated to Telstra's downstream customers rather than Telstra simply filtering the routes. This caused major issues for Telstra and its customers which should have been avoided.

I can follow this, but I can't claim I'm a networking expert -- and neither were many of the people commenting on the issue. Like many technologies, we only notice networks when they don't work.

2. Bundling all your services leaves you exposed

The most visible strategy for phone companies these days is bundling: encouraging you to get your landline and your mobile phone and your internet service and your pay TV from a single supplier. The advantage is that you generally pay less and have fewer bills to handle. The big disadvantage is that if your provider has a major outage, you may not have a backup strategy.

The outage affected not just PCs using ADSL or cable, but mobile phones browsing the net as well. If your ADSL was through Telstra and your phone was through Optus, you could have tethered the phone as a temporary solution. That isn't possible if you're only using one provider. Whether that kind of "insurance" is worthwhile depends on your own circumstances, but it's something to remember when contemplating a bundling deal.

1. Working solely in the cloud can be risky

One of the reasons I've never shifted to Google Docs is that I want to be able to work on documents and spreadsheets when disconnected, and there have been long periods when Google hasn't offered an offline mode. My main motivation for that has been that I often work on planes or trains, but it's also a consideration if there's a network outage. Working exclusively in the cloud isn't possible if you can't connect in the first place.

Again, that's not to say that online storage and backup isn't sensible; it is. But "work local, store global" is still a more flexible strategy.


Comments

    I went to Telstra's status page during this outage and saw nothing to indicate there was a problem. Disappointing.

    Well, I am, (ok, 'was') a networking expert and I knew exactly what was going on at 2:01pm when I suddenly couldn't access Facebook. I did a traceroute to a random IP and hit a routing loop indicating a munted BGP table. BGP is a nasty beast when it goes wrong and can bite you in the butt very quickly.

    Even though Dodo broadcast the routes, I wouldn't blame them. The fact that Telstra where listening for broadcasts on client links is the problem. Nice work Telstra. haha.

    Numbers 3 and 7 are intimately related. I wasn't too sure of what was going on at my data centre until I read the ANOG mailing list. But still laughed uproariously at suggestions that DNS was broken.

    Is #4 thrown in this list just to make the total a nice round number?

    Anyone else waiting with baited breath for the end of the sentence on point #9?

    This wasn't just a one off issue for an hour,we ave had this problem in the Albury area for the last several days and still have a connection that is slow and intermittent to say the least.

    Isn't it a shame that the technical community (i'm one of them) can't use a language that the regular punter can understand when they communicate. What happened...Someone took some of the direction signs down and swapped them for others and your internet traffic was sent the wrong way for a short time. We've now put the signs back up the the correct place and your internet now works as it did before. Much Simpler.

      Would be good if the technical jargon was widely understood too - knowledge is power.

        While I'd agree that the lambchop's example is probably too simplistic, I can't say I agree that people need to care too much about the technical jargon.

        The way I see it, I pay somebody to know all of the hard core knowledge about how my car works, and they pay me to know all of the hard core technical jargon about keeping their Internet connection running.

    Regarding point one, yes that's true but you never know until it comes back how long you'll have to wait. My company needed to authorise our payroll online during the outage, yes there are other options but they take time to set up through the bank and we weren't sure whether to proceed or not. Luckily Telstra came back in time!

    I remember reading about a large-scale BGP outage a few years ago caused by a Pakistan ISP router advertising bogus BGP info. Seems similar to what happened yesterday - http://www.infoworld.com/t/applications/youtube-outage-underscores-big-internet-problem-702

    Hardcore!

    What I learned (or relearned) is that there's no subject a publication desperate for page impressions can't make the object of a "Top Ten" list.

      OTOH, they did have a considerable amount of time to consider what they might put in their top ten list…

Join the discussion!