Properly managed data centers are highly reliable — a key trait considering they house large volumes of critical business data required for everyday operations. Given that downtime can cost businesses thousands or even millions of dollars, these assets need to be consistently monitored and managed to protect against severe weather, natural disasters and even human errors.
As data centers continue to expand their digital footprints with more interconnected solutions, it’s more critical than ever that proper systems and protocols be put in place for any type of emergency that might arise. In this slideshow, Asa Donohugh, director of property operations at Digital Realty, highlights nine steps that data center property managers need to consider to ensure data centers are properly cared for, especially in a culture that promotes transparency and safety.
Data Center Emergency Best Practices
Click through for nine steps that managers need to consider to ensure data centers are properly cared for, especially in a culture that promotes transparency and safety, as identified by Asa Donohugh, director of property operations at Digital Realty.
Implement an Event Classification System
Large data centers should develop a standardized response for various types of event scenarios. Classifying scenarios during events as green, yellow or red, for instance, will help prepare the response team and set expectations on the type of response required. For example, there’s a big difference between an event that results in a loss of redundancy versus a customer impacting event.
Train the Operations Team on Emergency Response Scenarios
Having a trained response team that understands critical data center infrastructure on site 24/7 will allow staff to respond effectively and efficiently to events if they do arise. Conducting regular mock drills with staff, and sometimes even customers, on a potential incident keeps everyone alert and prepared to tackle the set of emergency response procedures required to combat issues, such as power failures, control system failures or natural disasters.
Identify a Communications Team
The customers housed within data centers need to be kept apprised during an event, so that they can manage their risk profile. A dedicated communications team will coordinate internal and external communication and free up the local team to resolve the issue.
Have a Communications System in Place
In addition to a communications team, data centers should also have an Event Management System (EMS) to provide internal and external communication during events. This system enables data center providers to provide transparency around events to their customers rapidly. An EMS can generate multiple emails, texts and/or calls to customers simultaneously. This helps the communications team keep the data center customers and other internal teams informed during an event.
Understand the Potential Risks
Before implementing a repair on the data center, the response team will evaluate if there are any potential risks to ensure that the repair does not increase the risk profile to its customers or create a hazardous situation for the staff. The top priority in an event management scenario is maintaining or restoring uptime in a safe manner.
Prepare for Severe Weather
In cases of severe weather, the data center team needs to be prepared to ensure uptime by making sure that each data center site is equipped with the necessary vendor supplies needed in case of a power failure, such as HVAC, generator and uninterruptible power supply systems. The most important precautionary measure is to make sure the generator fuel tanks are full, since utility interruptions are a likely possibility. It is also a good practice to keep customers apprised of severe weather events so that they can proactively manage their risk.
Prepare for Natural Disasters
Data centers need to be prepared for natural disasters, such as earthquakes, tsunamis, floods and hurricanes. For data centers located in known seismic regions, it is important that the building is designed to maintain operations both during and after a large seismic event. Some data centers have been structurally renovated to withstand a large seismic event with a base isolation system. The objective of seismic isolation is to structurally isolate a building’s frame from its foundation (and the ground) to prevent horizontal ground motion from causing the building to shake. In addition to the structural enhancements, it is also important to ensure the team has emergency supplies, such as adequate water and fuel, in place for critical operations, as well as drinking water and food on premise to keep the 24/7 staff fed for an extended period.
Perform and Share Root Cause Analysis
Human errors account for most data center incidents. Performing a root-cause analysis on data center incidents can help identify where the problem originated and help mitigate the risk of future problems. A detailed root-cause analysis should be shared internally within the organization and externally with impacted customers. This will ensure that the operational team can apply any lessons learned and help customers reduce their risk of future events.
Prevent Human Errors in the Data Center
A detailed change management policy is the best way to proactively prevent human errors in the data center. The policy will dictate that any work that may result in a change-of-state must go through a detailed approval process. As part of the change management policy, a detailed method of procedure (MOP) must be completed with precise instructions that the staff will need to follow to perform maintenance or other functions that may affect the data center. This detailed document with step-by-step instructions on procedures will need to go through an extensive internal approval to ensure that the instructions are accurate and in compliance with data center policies.