BA Computer Woes: A Lesson In Recovery Vs Resiliency

BA Computer Woes: A Lesson In Recovery Vs Resiliency

Over the weekend, British Airways IT systems had a meltdown resulting in the cancellation of all services out of Heathrow and Gatwick airports. While this sort of issue would be a major issue on a quiet day, the fact it happened on a long weekend exacerbated the situation. Australian Business Traveller says it could cost the airline a pretty penny as passengers can make claims for compensation for sums ranging from €250 (A$375) through to €600 (A$900) per passenger under EU law.

The big question in my mind is how a system failure could cripple such a large and well-resourced organisation. BA’s Chairman & CEO, Alex Cruz very earnestly stood in a high-vis vest and posted a video message saying the root cause was a “power supply issue” and that there was “no evidence of any cyber attack”. To be fair to Cruz, the immediate focus of the issue should be on recovery and restoration of services and not on completing a forensic investigation of what went wrong.

Over the last few weeks I’ve been talking with a number of IT managers from large government departments and agencies as well as private enterprise about system resilience and recovery. One of the things that stands out to me locally is that much of the thinking around system failure still thinks in terms of recovery times you can measure with a clock rather than a stopwatch.

In other words, people are thinking about processes for restoring broken systems rather than designing for failure and making systems resilient. For most of the last three decades of computing, we have been raised on a diet of backup and recovery. And I’ve been an advocate of the 3-2-1 system for protecting data. That’s three copies of your data on at least two different media with at least one copy stored well away from your core data.

That model is still valid for protecting data in case things get really bad and you suffer a catastrophic incident where data is at risk. But too much of the thinking is still mired in the backup/recovery world.

The big question businesses need to ask is “How long can I afford for my systems to be offline?”.

When I worked in the energy industry our infrastructure was designed around a corporate KPI. We had a system uptime goal that was calculated on the basis of what impact downtime would have on our stakeholders. We then built the systems to support that KPI.

In that case it meant every server had redundant power supply and network interfaces. We had redundant inbound and outbound network connection with different service providers. And we had two co-primary data centres with every duplicated across both sites and replicated in real time.

For some, that might be overkill. But our job was keeping the lights on for most of the country so it was appropriate. And the infrastructure was backed up with processes for regularly testing this through both structured testing and ad hoc simulations where someone would walk through a data centre and pull random cables out of machines.

There were also financial incentives/penalties that were applied corporately so every person in the business had some financial skin in keeping things running. That created a culture of designing everything on the basis something could go wrong.

Justifying this level of investment is difficult in many businesses. But it is possible. The most important things to do, in my view, is to talk about system resilience in business terms. If we take the BA example, the impact of cancelling flights for 1000 passengers is easy to assess.

There are the refunds that are made. This translates into lost revenue. If passengers seek compensation that’s a further cost. In the aftermath of the incident, it’s likely bookings will take a hit. The amount of further lost revenue can be estimated. And there’s the recovery cost. That covers staff working over the weekend, consultants and contractors that are brought in, and emergency purchases of hardware to replace failed equipment.

Those costs can be used to build a business case that helps put resources into developing systems that are designed to fail.

In BA’s case, this should be a relatively easy sell. After all, the fly aircraft that are designed with multiple, redundant systems so that the risk of a single component failing doesn’t compromise the safety of all the passengers and crew.

In Australia, the new mandatory breach notification legislation can be used similarly to bolster protections around personal data. The threat of a fine that could be levied if a breach is found to be caused by a failure to have adequate protections in place could help make a business case for more resilient data protection.

The good news is that it is possible to build resilient systems today. We use them every day. While Facebook, Amazon, Google and others have had the occasional outage, their track record is very solid. And I can’t recall an incident where an entire service provider has been offline in the same way BA was crippled.

When we build systems, we need to build them with the assumption that things will go wrong. Whether that’s hardware failure, user error, cyber attack or some other incident, we should be designing with appropriate redundancy in mind. And that’s not purely an IT challenge – it’s one the entire business needs to engage in.