Most businesses have disaster recovery and business continuity plans. Often, they are baked into the design of systems so, if something goes wrong, another system picks up the load and things keep running, more or less, as designed. If you think about a modern passenger jet, there are lots of redundant systems so it's highly unlikely the failure of a single part causes a disaster. Telstra's recent season of outages continued yesterday with a problem that knocked out much of their phone and data network. But should the outage have been avoidable?
It's easy, on the face of it, to say Telstra messed up and is running the their network poorly. But here's what one of my former colleagues had to say about yesterday's outage.
For those complaining about @Telstra outages: Please remember: Humanity has never built a mobile network as fast or as complex as Telstra's 4G network. It is the premiere network of its type globally, and usage of it is exploding. We are pushing humanity into new territory here.
— Renai LeMay (@renailemay) May 21, 2018
Telstra released a statement about the issue saying that a software issue caused a piece of equipment to malfunction. When things were meant to failover to another piece of hardware there was a further fault with redundancy built into the systems not working as intended.
Typically, companies test for all sorts of scenarios when it comes to ensuring their redundant systems pick up the load as expected - or at least they should test. And I have little doubt Telstra does do testing. But with a network the size and complexity of Telstra's it's very difficult to test for every single possible scenario and potential knock on effect.
So, while it's easy to pick on Telstra for yesterday's failure and the other issues they've faced over recent weeks, we should consider those failures in the context of the systems they have deployed and are managing.
When was the last time your company did some serious business continuity testing? Have you walked into your data centre and randomly pulled cables to see if the redundancy you've designed works? A former CIO of mine used to do exactly that - basically he was a live Chaos Monkey.
Perhaps yesterday's failure by Telstra is a salient reminder to do your own testing.