I've been collecting anecdotes from a variety of sources on big failures in IT they have been involved in or witnessed. Not everyday, e-mail went down for 30 minutes failures-the big stuff.
I was involved with a project involving a mid-sized on-line, payment processor that in an effort to reduce their capital investments elected to not acquire the redundant components of an architected solution. To add insult to injury, the same organization attempted to minimize their ongoing power costs by over-subscribing their allocated power to the point where they were unable to restart their infrastructure without manual intervention. This particular organization was processing in excess of $500,000 per hour in on-line gaming payments, yet due to their "frugalness" experienced two outages in 12 months, each of which lasted longer than one hour. In this case, saving $ 50,000 cost them $1 million in lost transaction value and significant SLA credits that exceeded the original cost savings.
As a line-of-business manager, my initial reaction to this story is, Where the hell was the CFO, particularly if, as Craig says, this was a mid-sized company? Surely somebody with an eye toward risk/reward was monitoring these issues, given they reside primarily at the contractual level, not in the execution of a given technology. It's a vendor contract, after all.
At mid-sized companies, execution of these deals still falls under IT, of course, but my experience has been that Finance maintains a more intimate role in reviewing contracts. And smaller companies are more flat, in general-some LOB should have been wailing about this before things went south.
However, I am routinely guilty of assigning responsibility for anything but purely technical failure on the business, not IT, and that's not the path folks in IT want to be on as they strive for their "seat at the table" when it comes to strategic growth.
So, I asked Andrea Receveur, our very sharp program manager here at IT Business Edge, what she made of the anecdote. Andrea wrote:
From this information I would say that the project team did not consider the risks of the project well enough. They didn't think through the potential of an unplanned outage and how they would continue to work seamlessly (or as close to it as possible) through the outage. The company should have evaluated the cost of downtime which would have told them that if they were down for 60 minutes it costs them $1 million. This would have enabled them to come up with acceptable SLA times and it would have been a red flag to identify the need for redundancy.
I would ask, where was the technical team during the requirements process? If the requirement was set that the system could not be down for more than five minutes, then the tech team would have been able to justify the need for redundancy. If there was no requirement to address downtime acceptance then at the very least the tech team should have questioned it, especially when the decision to not acquire the redundancy components was being discussed.
Finally, I would say that the system was not subjected to any kind of stress testing. Load balancing services are readily available, as are test plans specifically designed for stress testing. This type of testing doesn't have to be complicated. Even though hardware is becoming so high-powered that one machine can push through a tremendous load amount, companies cannot rely on single machines to do all the work. Scalability should be one of the first things a project team discusses when they begin the requirements/design phase of any new or upgraded system.
As I said, sharp.
The one question that Andrea's note fires in my little LOB brain is the degree to which what I generally consider backbone issues, like hosting, are included in requirements and close, pan-division "project team" evaluation. Most of the project teams I've been on -- a painful lot, at companies ranging from CNET to an assortment of SMBs -- have concerned themselves with functionality and payoff metrics. Issues like the hosting contract were, candidly, left to IT. After all, it's IT's budget and IT's expertise -- I don't argue lines of code with developers, either.
So, I am curious. To what degree are operational contracts, like hosting, subject to project team review in your company?