Event Recap

RECAP | ROUNDTABLE | What Are You Doing to Assess Data Center Risk?

Roundtable Introduction: The Oxford dictionary defines “risk” as the chance or possibility of danger, loss, injury or other adverse consequences. This definition scares the life out of most data center and IT infrastructure operators, and therefore has become a primary driver in all aspects of running and investing in data centers and IT infrastructure. The purpose of the roundtable is to discuss what operators are doing to assess risks, and what is being done to then mitigate risks, as they pertain to management and operations, facility topology and infrastructure (i.e., uptime and resiliency), and site location.

Scott Good, Uptime Institute Senior Consultant, along with Todd Traver, VP Digital Resiliency, attended the roundtable and provided their experience and expertise in this area of risk management. Some questions to be addressed: Do you have a structure and program in place to specifically assess data center and IT infrastructure risks? Who is responsible and accountable for administering the program? How do you measure the program’s success?

Roundtable discussion highlights:

Attendees discussed how they are presently structured to assess risk. A number of attendees have pre-established data center risk assessment frameworks, and their interest was to hear about what others are doing in this regard.

The attendees discussed the following methods for assessing data center risk:
• Risk severity level assigned to operations tasks
• Data center site risk assessments conducted periodically

Operations task risks:
Assigning a risk severity level to operations tasks is something most data center operators tend to have implemented in their environments. Utilizing the existing Change Management process, operations tasks (i.e., maintenance activities, switching and isolation of equipment activities) are assigned a severity level. This severity level typically determines the task approval level required before the activity can be performed, when the task can be performed, and with how much oversight is required. There is a change management document that defines the severity levels usually based on IT recommended practices. Most mature change management systems tend to have risk severity levels already defined, which then get mapped to the activity. The governance concern is typically around equipment maintenance and failures, as well as operations and safety practices.

In regard to operations task risks, the question was asked what triggers a reassessment of risk after an event or change occurs. If an event occurs, typically an incident report is created and logged. This incident report acts as the tracking mechanism for corrective actions and mitigation, with the report not being closed until all items are addressed. If a change occurs, like a change in operation configuration, this change should be logged in the change management system to record and track the change.

Data center site risk assessment:
In regard to overall site risk assessments, based on Uptime Institute’s experience this is a practice not as widely utilized as you might think. A couple of attendees indicated they have established their own risk assessment framework, with one presently utilizing the framework to conduct annual assessments, and the other not yet utilizing their framework. Here are some bullet points from the discussion.
• The site risk assessment framework typically consists of established data center categories. There are components within each category, and the category and the components are measured against a standard that was created.
• One attendee indicated they created an infrastructure risk matrix to establish the level (from 1-10) based on magnitude and likelihood.
• The site risk assessment is typically conducted as a series of questions with answers provided by the appropriate subject matter expert at each site. If the answers meet or exceed the standard, then full credit is given. If the standard is not met, then a lower score is given, the level of risk is established along with a plan to correct and mitigate the deficiency as necessary.
• The output of the assessment is typically a scorecard that highlights areas that met expectations and areas that need improvement.
• The risk assessment is conducted at all data center sites (owned and leased) on a defined frequency. One attendee indicated they conduct assessments on all their data centers annually, which is labor intensive but deemed important, so it is done. As an option, it was mentioned how one company conducts the assessment annually in a remote manner, and then within a 3-year time period the assessment is conducted onsite to add validity to the process.
• The framework needs to be updated and refreshed periodically in order to keep it current. One attendee indicated they refresh the risk assessment model annually.

The question was raised about how risks imposed by the IT stack are monitored. The attendees indicated IT risk is typically tracked separate and different than what is done on the facility side. It typically has a business continuity function that would be responsible. The concern was then expressed that folks doing the IT assessment and folks doing the facility assessment are most likely not interacting enough. This could create a disconnect in what is covered and included in the assessments to address potential risk – IT thinks Facilities has this item addressed, and Facilities thinks IT has it addressed. An example of areas where this could cause an impact are connectivity and resiliency, computer room management practices, and security.

The conversation then shifted to the site risk assessment scorecards. Different companies will have different views as to what’s important, which will be reflective in the scorecard and its effectiveness. It was stated there doesn’t appear to be much data and knowledge on risk probability and severity. Scorecard determination appears to mostly be based on internal experience and knowledge, with perhaps getting an assist from outside consultants.

Overall, having a governance framework around assessing data center risk and resiliency is a vital tool to help avoid critical and embarrassing incidents and outages.

Request an evaluation to view this report

Apply for a four-week evaluation of Uptime Intelligence; the leading source of research, insight and data-driven analysis focused on digital infrastructure.

Posting comments is not available for Network Guests