Could The Telstra Outage Happen To You?

Image: iStock

Most businesses have disaster recovery and business continuity plans. Often, they are baked into the design of systems so, if something goes wrong, another system picks up the load and things keep running, more or less, as designed. If you think about a modern passenger jet, there are lots of redundant systems so it's highly unlikely the failure of a single part causes a disaster. Telstra's recent season of outages continued yesterday with a problem that knocked out much of their phone and data network. But should the outage have been avoidable?

It's easy, on the face of it, to say Telstra messed up and is running the their network poorly. But here's what one of my former colleagues had to say about yesterday's outage.

Telstra released a statement about the issue saying that a software issue caused a piece of equipment to malfunction. When things were meant to failover to another piece of hardware there was a further fault with redundancy built into the systems not working as intended.

Typically, companies test for all sorts of scenarios when it comes to ensuring their redundant systems pick up the load as expected - or at least they should test. And I have little doubt Telstra does do testing. But with a network the size and complexity of Telstra's it's very difficult to test for every single possible scenario and potential knock on effect.

So, while it's easy to pick on Telstra for yesterday's failure and the other issues they've faced over recent weeks, we should consider those failures in the context of the systems they have deployed and are managing.

When was the last time your company did some serious business continuity testing? Have you walked into your data centre and randomly pulled cables to see if the redundancy you've designed works? A former CIO of mine used to do exactly that - basically he was a live Chaos Monkey.

Perhaps yesterday's failure by Telstra is a salient reminder to do your own testing.


Comments

    When things were meant to failover to another piece of hardware there was a further fault with redundancy built into the systems not working as intended.
    But thats the issue, the repeated cascading failures.

    It takes one failure in a single location to take out a majority of the network, not a single region or service area, the whole nationwide network loses most of its capacity in mere minutes. A lightning strike in Orange took out most of the national network earlier this month, a technician in 2016 cause one when working on 10 mobile nodes subnet and the flow-on effect did exactly what happened the other day.

    2 years... and still each and every time one system error (human, software, hardware, damage) takes out the WHOLE NETWORK!

Join the discussion!

Trending Stories Right Now