Massive Bank Failure Due to Human Error, IBM Blamed

Share it on Twitter  
Share it on Facebook  
Share it on Linked in  
Slide Show

10 Common Disaster Recovery Mistakes

Learn to avoid these key errors.

Details of what happened were finally revealed, one week after a massive failure knocked Singapore's DBS Bank off the banking grid for seven hours. A faulty component within the disk storage subsystem serving the bank's mainframe was resulting in periodic alert messages, which saw a scheduled job to replace it at 3 a.m. that fateful day. The situation spiraled out of control as a direct result of human error in the routine operation.


In an announcement to its customers, DBS Group chief executive Piyush Gupta wrote:


"So far, we understand from IBM that an outdated procedure was used to carry out the repair. In short, a procedural error in what was to have been a routine maintenance operation subsequently caused a complete system outage."


The error must have been bad, since a technical command team "comprising DBS and IBM staff was activated by 3.40 am," which went on to perform a restart of the systems at 5:20 a.m. Unfortunately, complications arose during the machine restart, and the bank-wide disaster recovery command center had to be activated at 6:30 a.m. Seven hours would have elapsed in total before DBS Bank's bank branches were able to resume normal operations and ATMs brought online at 10 a.m.


In some ways, this reminded me of Intermedia, which encountered a failure in its storage subsystems in May, and Intuit's major outage in June that brought down its cloud-based accounting and tax software. Like DBS Bank, Intuit's downtime appears to have been caused by human error, though I praised Intermedia's proactive approach to keep its customers informed even as the company rushed to identify and rectify the hardware problem.


While most small and mid-sized businesses don't operate on the same scale as DBS Bank or even Intuit, I feel these recent disasters do offer a couple of timely reminders.


First of all, it is evident that even the best-laid plans can be undone by human error. And the mistake outlined in DBS Bank's instance appears to be relatively simple - though one fraught with severe repercussions. While there is no foolproof way to prevent Murphy from sticking his head in on even the best-laid plans, one way that SMBs can reduce this possibility is to exercise appropriate wisdom when it comes to hiring or retaining IT personnel.


I don't have figures to back me up, but my work experience in a number of SMBs has made me uncomfortably aware that experience and specialized skills might not mean much to management bean counters intent on reducing head count. More often than not, employers make the mistake of consigning IT staffers into the same box. After all, the procedures are already written down for the (lower paid) IT employees to follow, right? That's usually true, until the inexperienced staffer panics and makes an elementary mistake that brings your system down with a bang.


While there is no evidence to suggest that this was what happened with DBS Bank, IBM was reported to have "taken steps to enhance training of our personnel related to current procedures and brought in experts from our global team to provide further assistance" to preempt against a reoccurrence.


Finally, the other lesson to be gleaned here is the sheer importance of data storage hardware in this information-driven age. Both DBS Bank and Intermedia experienced challenges that originated from storage subsystems; the Intuit outage took some time to properly restore and validate backed-up data.


I'll be writing more about this issue, though since we are talking about the importance of storage hardware, you might also want to check out my review of the Synology DiskStation DS210+ NAS for SMBs here in the meantime.