I wrote recently about the massive outage experienced by DBS Bank in Singapore for which vendor IBM has taken responsibility. The merits of outsourcing mission-critical systems to a third party aside, I felt that the entire incident makes an excellent case study in dispelling the perceived invulnerability of enterprise systems.
So what went wrong from a technical perspective? And what lessons can small and mid-sized businesses glean from this debacle? Understandably, some critics have charged that the bank has not revealed sufficient details pertaining to the outage. Based on the additional details that were released, my opinion is that there is at least adequate information for us to avoid committing the same mistakes.
According to the official news release, the first sign of problems started the morning of the day before:
IBM software monitoring tools sent an alert message to IBM's Asia Pacific support centre located outside of Singapore. It indicated there was instability in a communications link in the storage system which was connected to a mainframe. At this point, the storage system was functioning. An IBM field engineer was despatched to the DBS data centre and was given approval by DBS to repair the machine.
What transgressed next was a series of repeated - and erroneous - attempts to reseat and replace the affected data cable. The system finally exhausted its tolerance limits the fifth time this happened and promptly shut the entire storage array down. The system behaved in such a manner as it was configured for data reliability, and every attempt to reseat the data cable while the system was "live" is likely leaking all kind of extraneous bits into core systems.
If the correct procedures had been used, the storage system would have automatically suspended the communications link and the machine would have instructed the engineer to replace the cable and both cards together and maintain redundancy of the system.
As data integrity is considered a higher priority than availability, the storage system is designed to automatically cease communicating under these conditions. In doing so, the system preserved full data integrity. In spite of the machine's high availability and redundancy, these incorrect procedures caused the outage.
As a result, IBM has announced that it will take steps to enhance the training of related personnel on "current procedures" as well as bringing in "experts from its global team" to conduct a further incident study and come up with recommended actions.
So what can SMBs learn from this debacle? Simple: Staff training is non-negotiable.
And just because the same systems have worked for you in the past doesn't necessary mean that they will continue to do so in the face of non-optimal or erroneous procedures. In addition, prior experience counts for nothing if you were doing it wrong all the while.
And this wasn't an isolated case of one poorly trained professional having slipped through the gaps either. It was clear that the on-site engineer was in communication with the support center; both parties agreed to simply "yank the cable" to check if the error was transient. Indeed, the timing of the various site visits suggests that on-site engineers and staffers belonging to different shifts committed the same mistake.
The abuse finally resulted in the entire system grinding to a halt due to an aggressive configuration that favors data integrity. Not that I disagree - I do have a bank account with DBS, and would really hate mistakes where my money is concerned. However, this was also obviously not something that the staffers were cognizant of; or nobody would have swapped cables around like they did.
Whatever the case, data integrity checks while booting up resulted in lengthy delays in rebooting the system, so much so that normal operations were disrupted when the business day started. So what next for DBS? You can be sure that some drastic changes are taking place right now in the rank and file.
Already, Singapore's central bank, the Monetary Authority of Singapore, has censured the bank and imposed measures to punish the bank for the downtime; the bank's international reputation is also no doubt tarnished as a result.
For now, there is still time for your SMB to stop relegating staff training to the backburner, and do something about it today.