I first explored the topic of achieving high system uptime earlier this year in Does Your SMB Really Need 99.999 Percent Uptime? In that post, I questioned conventional beliefs that achieving 99.999 percent system uptime-also known as the "five nines"-automatically equates to a better level of IT support. Among other factors, I pointed out that attaining better availability costs exponentially more for every additional "9" added behind the decimal place, and suggested that a small amount of downtime might have negligible business impact.
Well, CIO and project turnaround expert Peter Kretzman, in a recent CTO/CIO Perspectives post also thinks that focusing on the system availability metric by itself isn't terribly useful. Writing in Business impact and transparency: expressing system availability, Kretzman has the following to say of the fascination behind the raw statistics:
The underlying insight here is that raw outage statistics, whether they're expressed as an uptime percentage or as downtime hours, are nothing but a proxy for business impact. And not a very good one
What I found interesting was Kretzman's anecdote of how many technicians will insist that a system is up "even though no one is able to access it." This is understandable at one level, given how response time and downtime are often explicitly stated as part of contractual agreements, and can result in hefty penalties if not followed to the letter. But beyond the importance of having someone assigned to be the arbitrator on such issues is the oblique reference on how an "operational" system might not actually be capable yet of delivering acceptable performance where users are concerned (during a RAID rebuild, for example).
Moreover, as all IT professionals in the industry will know, system maintenance and upgrading activities-which includes security patching and software updates-are often not included in the uptime computation, another point that Kretzman highlights. As you can imagine, a Web hosting company bringing down its servers for one full day a month as part of "scheduled maintenance" will find little favor with an online merchant in the weeks leading up to Christmas.
Count the Cost in Monetary Terms
The alternative model of measuring system availability that Kretzman suggests is surprisingly simple while staying relevant. The idea is to make use of actual dollars to determine the impact of downtime, achieved by examining the potential monetary loss aggregated from the historical revenue stream of a site. Various considerations have to be considered of course, since traffic and buying patterns vary throughout the day. Indirect factors such as slow-downs should also be factored into the equation.
In the same token, websites that depend more on newsletter and promotional mailers will suffer more heavily should the site experience an outage immediately after a marketing drive. In the same vein, I would venture that a manufacturing business can benefit from measuring its downtime in terms of adding the direct downtime cost with resultant overtime pay of having to keep the machines running longer.
Obviously, counting the cost in monetary terms won't work in all settings, such as for intranet sites and in-house resources such as a NAS. On the other hand, it can be a superior metric in other situations and will certainly help IT departments state their case when it comes to allocating budgets to pay for technologies geared towards downtime remediation and prevention.