Event Recap

RECAP | ROUNDTABLE | Fault Avoidance through Smart Infrastructure

Following a scramble to effectively staff data centers during a pandemic, many wary managers are beginning to see remote monitoring and automation systems in a more positive light, including those driven by AI. An adoption cycle that has been slow and cautious is starting to accelerate. Scott Good, Senior Consultant for Uptime Institute, joined the roundtable discussion where attendees engaged and interacted on what they are implementing to make their infrastructure “smart” and more fault tolerant.

What are the primary systems of concern driving more fault tolerance, and how are you addressing these systems? Is “smart” infrastructure being driven by staffing and organization decisions? Are you considering or utilizing artificial intelligence (AI) to manage your infrastructure?

Discussion:

Scott Good provided an introduction on the subject, indicating years ago Uptime Institute started looking into fault avoidance – identifying faults before they occur. Scott wrote a paper on the subject, referenced as an attachment below.

To achieve Fault Avoidance, systems can compare real-time data against known healthy operating values, thus making them “smart”. When values are outside the normal range and prior to actual fault, the “smart” system can autonomously transfer the critical load to the redundant systems, isolate the affected component, and signal engineering personnel to investigate. Automated detection and bypass would allow the critical environment to remain stable as systems are maintained at a steady state condition, thus Fault Avoidance is achieved. In a Tier IV Fault Tolerant environment, the system waits for a fault then autonomously responds without operator intervention. It was also noted that Hyperscalers typically have a lot of monitoring and sensing, and maturity is needed for fault avoidance. Tons of data is collected, so they now need to identify where the opportunities are for fault avoidance.

What are the primary systems of concern?

An attendee noted that from an IT perspective, fault avoidance is about not impacting IT service. You first need to define the priority and criticality of the IT systems, and then look where those systems are placed. You may need to move IT systems around so not all are on the same rack, as an example. Next, you need to get all systems to talk – IT and facilities. The process is really about connecting the dots so a fault doesn’t impact the IT service.

For colocation providers, an attendee asked what’s the actual benefit of fault avoidance over fault tolerance? Why do it in advance over waiting for the event to occur? They have made the investment already in equipment redundancy coupled with fault tolerance. There is additional cost and risk for these smart solutions. They are having a hard time justifying the cost and managing the risk around implementing fault avoidance.

An attendee noted that in enterprise data centers, they still see IT providing single corded equipment, which tends to negate fault avoidance efforts. Scott Good stated this requires Facilities and IT to create more of a partnership to drive no single corded equipment.

Fault avoidance benefits mentioned are identifying failures before they occur, reducing maintenance costs (preventive, predictive and condition-based maintenance), as well as a benefit around equipment end-of-life estimating. It allows you to make data-driven decisions on equipment and replacement.

Typical fault avoidance type of installations:
• Rack and equipment sensors
• Point of use STSs for single corded equipment
• Power and cooling sensors – understand what’s going on and what’s the steady state
• Monitoring of IT workloads – understand utilization and location

An attendee indicated his work is around tying IT hardware with power and cooling data in a site and across sites, essentially connecting IT workloads. This is accomplished by having sensors throughout racks and the infrastructure. He’s taking a proactive approach, to help teams make intelligent decisions and maximize resiliency.

Is there a correlation between smart infrastructure and staffing?

An attendee that works for a colocation provider indicated they already have onsite minimal staffing. He indicated smart infrastructure, coupled with conditioned based maintenance, will drive less contracted vendor maintenance and therefore reduce data center operations costs.
Another attendee indicated smart infrastructure is causing a slight increase in staff (3-4 engineers) as they develop and build the team to support the data analysis.

Request an evaluation to view this report

Apply for a four-week evaluation of Uptime Intelligence; the leading source of research, insight and data-driven analysis focused on digital infrastructure.

Posting comments is not available for Network Guests