The Illusion Of Zero Downtime In IT

I received a press release from a vendor the other day which was riddled with the words "always-on" and "zero downtime" to describe what its offerings bring to IT environments. These terms are being used more frequently by service providers and IT vendors, but is zero downtime even possible?

Circuit board with clock picture from Shutterstock

The concept of improving uptime has become a popular topic as disaster recovery is declining in popularity. Disaster recovery is a source of consternation for IT managers, according to Gartner research director for technology and services, Michael Warrilow.

"It's a very capital intensive and frustrating activity," he told Lifehacker Australia. "You're spending a lot of money on the off-chance that something goes wrong with your IT and even when it does go wrong, its might not even work properly. So the conversation is shifting from preparing for disaster to moving towards high levels of availability."

This explains the surge in the use of "zero downtime" in the IT space. You'd think that zero downtime would mean that an IT service or infrastructure will never suffer an outage, effectively providing 100 per cent uptime. Not the case, according to IDC senior market analyst, Prabhitha Sheethal Dcruz.

"Zero downtime frequently translates to 99.999 per cent uptime, which equates to 5.26 minutes of downtime per year," she told us. "While short outages may be acceptable for non-criticali workloads, the same is not true for business critical and mission critical workloads where the downtime stakes can be very high - consider a stock exchange where a single lost transaction may incur a significant financial cost or a medical system downtime that can cost lives."

IDC has done global research into the financial implication of IT system downtime and found:

  • For Fortune 1000 companies, the average total cost of unplanned application downtime per year is US$1.25 billion -- US$2.5 billion.
  • The average hourly cost of an infrastructure failure is US$100,000 per hour.
  • The average hourly cost of a critical application failure is US$500,000 -- US$1 million.

Downtime can be costly, but if you're an organisation that uses third-party cloud services it's difficult to expect 100 per cent uptime from your providers.

"I guess it's technically feasible, but financially impossible," Warrilow said. The term "zero downtime" in the service provider space is essentially marketing speak, he said, and the best you're going to get from them is quote on a service level agreement (SLA) offering a certain level of uptime.

"They usually offer 99.5 per cent or higher uptime and it's a matter of horses for courses," he said. "Cloud is most often used for general purpose workloads rather than mission-critical traffic, not for important things like key financial systems. You can also add additional layers of resilience to get downtime close to zero per cent."

As for hardware, it's a different ballgame and there are products that do pass as "always-on", according to Warrilow, and he highlighted IBM's mainframe servers and HP's Nonstop systems.

For organisations seeking a that zero downtime nirvana, Warrilow and Dcruz have a few tips:

  • Figure out if you actually need zero downtime A lot of times you don't need it. Considering the higher the uptime you seek, the higher the cost incurred on IT so going with a lower uptime offering can save a lot of money.

  • Make service providers and IT vendors stress test their zero downtime claims Get them to come in and demonstrate their offerings in your environment and consider negotiating penalties into your contract if they don't live up to their claims.

  • Plan well Invariably, things will break. Impact analysis should forms part of an organisation's greater business continuity and disaster recovery (BCDR) plan.

Does your company strive for 100 per cent uptime with its IT systems? Let us know in the comments.


Comments

    100% uptime and 100% service availability are 2 different things.

    Last year one of my services that I manage had 100% service availability due to the redundant design and capacity planning put in place.

    However we had multiple upgrades during that time on when we operated on reduced redundancy to keep the environment running well.

    Measuring on service availability makes more sense in my mind.....

    The old tale of 'my server has 725 days uptime' really translates to I'm not patching or maintaining it!

    Cloud is most often used for general purpose workloads rather than mission-critical traffic

    Many companies run mission-critical systems using cloud-based infrastructure. AWS allows me to run mission critical systems across multiple data centres and scale my infrastructure up to handle increases in traffic. It makes it easy to create a production infrastructure with high availability.

    The author uses Downtime and Availability as one and the same thing, but with the words that he chose and the examples that he provided - it did not seem like he was even talking about 100% uptime systems like NonStop. The major reason I say this is because for a requirement like ATM machines where NonStop plays a crucial role, Zero Downtime and 100% Availability are one and the same thing, but for hosting an application like a CHAPS system (an interbank transfer application), you do not require Zero Downtime, do you? You can work with the 100% Service Availability. Which would render the importance of NonStop systems to a lower level. This was even mentioned in a similar manner in one of the other comments. It was very visible from a few points put forth by the author, such as - "consider money, do you really need it, can you make do with a lesser downtime?" These questions never come up with some systems. And I literally mean, never. Like the 911. Or the ATM, as is repeatedly the case with NonStop. Or even the Stock Exchange as pointed out by the article itself! Because it is not just about saving money, is it? It is about a customer, about a billion transactions, about saving lives even. And about the 5.26 minutes of downtime for a year? I've seen far better NonStop systems which do not really fall into that cadre. Yes, of course that is the basic 'risk' or even 'assumption' that is highlighted by HP during the purchase of a machine, but do you really encounter that? For the systems I've maintained - we've had failed disks, power outages, and worse CPU failures, but none of them had ever cause downtime for the whole system. Now, what is really his perception of Downtime and Availability?

    P.S: Never ever talk about IBM Mainframes and HP Nonstop Systems in the same line. IBM Mainframes are high processing durable systems, whereas HP Nonstop as the name says, are the invincible kind. From the core architecture to the Operating System, the way they are designed are opposite to the very nature of Mainframes.

    I agree that there is no such thing as "100% uptime" but 99.9% should be good enough for any business, especially if it's a financially backed SLA to ensure the provider puts his money where his mouth is.

    Another thing to consider would be data sovereignty. Choosing a 100% Australian Based Provider (both staff and servers) has great implications down the line, especially if (and when) something goes even a bit wrong.

Join the discussion!