Dick Cosby, systems administrator and disaster-recovery manager for Electronic Data Processing Services in Richmond, Va., learned the hard way that disaster has a surprising way of finding you.
The company, an IBM shop with about 100 employees, provides support for trucking companies and other business. Housed in a square building with a central open-air courtyard on the second floor, its data center was in the center of the first floor under the courtyard. So the company had prepared for possible leakage from above, from fire, from power outage and bought redundant equipment.
It's not located near any rivers or streams, not in a low-lying area and has a storm drain on the property with sump pumps on the ground floor. (For a look at how one medical facility dealt with a "no-man's land" flood, see Iowa Hospital's Disaster Plan Helps Keep Its Head Above Water .)
"We back up to tape every night, seven days a week. We keep the tapes off site," says Cosby. "At that time, we did not have a disaster-recovery center, but that was because we didn't think we'd ever have a loss-of-site event. We had equipment to take over should our primary equipment fail. The trouble is, once you have 4½ feet of water, it doesn't matter what you have."
He's referring to Hurricane Gaston in 2004.
"Gaston came through and they said it stalled over Richmond, but I know better. It stalled right over the top of my data center," Cosby says.
According to a University of North Carolina research paper, the storm dumped more than 12 inches of rain on Richmond.
Downtown Richmond flooded, and in response the city shut off the storm drains, Cosby says. With no place to go, the water kept backing up. Water was getting into the building, including the data center, but Cosby had sump pumps running.
"What I didn't know was that outside, the water had gotten up about 2 feet above the glass windows and about 10 minutes later, two of the windows crashed in. Then it was all over. The whole ground floor was completely flooded," he says.
The data center — about 2,500 square feet — held not only the IT but also the phone equipment. All the breakers for the building were on the first floor, as well as the generator, in a vault.
"Everything the building used to run on was under water," Cosby says. "The phone-system outage was very devastating to us. All the backup systems, the tape libraries, all of it was flooded."
In the end, all the hardware had to be replaced. They placed the order to IBM the following evening from a gas station fax machine. Cosby cites the help of business partner dp Systems and quick response by IBM and Cisco, the phone-equipment vendor, as important factors in only being down for a week.
"The experts thought we'd be out for a month, but we knew we couldn't last that long," he says of the recovery, which involved working 30 or 40 hours at a stretch.
The company had more than 16 terabytes of data lost on site. But it had been backed up the night before.
"Thank goodness for LTO and at that time we had some older half-inch cartridges," Cosby says. "Trucking companies typically run 24/7 so we're using some IBM SAN storage using flash copy. ... We had been running this about six months by then, so the process was working very well. It allowed us to have a full backup of our data off site. ... Flash copy made a point-in-time copy of our data available to a different LPAR and then we'd ferry it on that LPAR and back it up to tape without any save window. The production systems do not know that we do this. No user jobs are affected. The storage itself has utilities inside that make this point-in-time copy available.
"We now have a disaster-recovery center in Mesa, Az., with duplicate hardware. The IBM SAN storage, with the hardware utilities, if a sector in Richmond is changed, that sector's immediately sent to Mesa. So the production systems don't have software; it's all hardware replication. IBM calls it PowerHA."
Though Cosby considers it a form of cloud storage, he's still a big fan of tape.
"There are two (disaster) problems: One is loss of site, which was what we had. But the other is corruption. But if the data's corrupt in Richmond, it's also corrupt in Mesa, so we need tape backups to go back to the original data. Cloud computing might have ways to some degree to go back to that, but with a tape, I can go get exactly the files and objects that I need. I feel like it's not only better, but much easier and much less expensive to manage."
Since that expensive lesson in disaster planning, the company has moved the data center to the second floor, elevated the generator and bought redundant phone systems.
"We have ways to switch access so if we lose service, customers can still get through. That was a big deal for us," he says.
His advice on disaster-recovery planning?
"Regardless of how well you plan, something else could happen. Don't think you've got it all covered. ... The other thing is practice failovers. Don't say, 'We've got a plan' and never test or practice it. You've got to practice to get good at it."
See also Hurricanes' Wrath Drives Disaster Planning.
To ShareThis, click on a service below: