Event Recap

RECAP | ROUNDTABLE | Implementation of Multi-Site Resiliency

Digital infrastructure resiliency has been in the forefront, driven by the pandemic, climate change, and IT’s desire for 100% availability regardless of the circumstances. Multi-site resiliency is a strategy, adopted originally by hyperscalers for the most part, that can provide this availability, but implementation is not easy.

Todd Traver, VP IT Strategy & Optimization for Uptime Institute, joined the roundtable to discuss with attendees what is required to effectively implement a multi-site resiliency approach. The following questions were provided to get the conversation started.

• Is multi-site resiliency utilized in your company?
• How is it implemented (enterprise+colo+cloud)?
• Who leads multi-site resiliency design effort?
• Who participates in multi-site resiliency decision making?
• What factors are considered?
• In your experience, does facilities reliability or application design play the biggest role in service availability?

Attendees indicated availability and resiliency is always a key topic. Most of the attendees indicated they have hybrid IT environments (on-prem, colocation, and cloud). One attendee said resiliency is an important consideration to reconcile your data center portfolio and sees 2 areas of potential benefit:
• Application resiliency as the end game
• Potentially allows you to relax redundancy and resiliency in the infrastructure, which would reduce capital outlay

Todd reviewed a slide (see the attachment) that displayed outage causes and where workloads are located. He also agreed that digital resiliency is a service that can provide end-to-end uptime. There needs to be more focus on managing the entire IT stack to provide that true end-to-end uptime service.

The latest Uptime Institute data center industry survey shows 61% feel workloads spread across multiple sites has made the IT more resilient. Software is where resiliency needs to reside creating less of a dependency on infrastructure brick and mortar. If you look at outage data, 1/3 of outages are caused by software related items. If your application and networks are properly developed and deployed, you should not have any dependencies on infrastructure. However, power and cooling still represents almost 50% of outages. Therefore, power and cooling resiliency is still very important.

For digital resiliency to be implemented and effective, you need to understand the following challenges.
• There needs to be a close working relationship between the actual IT architects and engineers, along with the people that manage the data center portfolio. Data center teams typically know where the data center resiliency and reliability issues are.
• You need to assess at least 2 data centers at once to determine what common risks exist. Common constraints need to be assessed (i.e., network, power, cooling, design perspectives).
• Look at resiliency from the end-user’s perspective. For the end-user, it is all about whether the service is available or not.
• Define where each application is located and who owns the application.
• Know the impact to each application if one of the constraints has an issue.

An attendee indicated they have multiple sites and they classify each application. They distribute applications across cloud and different Tiers of data centers, and they also are in the process of abstracting it out so application owners don’t know or care where the application is located.

Another attendee commented workload optimization is key to manage flow by geography and time zone. The attendee sees optimization and resiliency dovetailing. A prerequisite to moving workload to the public cloud is application resiliency.

An additional attendee stated new applications are being built correctly for the most part, so they are built and implemented resiliently. They have an application that needs to pull data between 2 cloud providers, plus from on-prem, which led them to think on-prem and network could be the weakest link. The attendee thinks networks are their most pervasive cause of downtime presently.

The session then shifted to who leads the multi-site resiliency design effort. Todd stated there needs to be someone, probably in the IT stack, that is looking at the application end-to-end. There needs to be someone with a senior title, hopefully looking over both IT and facilities, who owns the end-to-end resiliency of the application. Most of the outages Uptime Institute sees indicate people are still siloed to some degree, applications and data center.

Several attendees indicated their organizations have no singular owner of resiliency. One attendee stated business units work collegiately but not always in the same way, and that central IT ends up taking the brunt of the ownership. Another attendee said they have tons of teams and their focus is to give developers the full visibility all the way through the infrastructure.

Also, a big part of understanding your resiliency is how you test it. Real time resiliency testing needs to be conducted. Todd stated you really need an active testing program (ex., pull the plug, fail a network link). Attendees concurred that testing is a challenge. One attendee indicated their disaster recovery team conducts modeling for testing applications virtually. Another indicated they have not gotten to application testing, but they are presently using mapping of applications to prove resiliency.

Todd then summarized the session by saying with all the hybrid IT, getting your arms around all the pieces that make up end-to-end service delivery is the biggest challenge - how does the service function as a complex system, handle individual failures, and not impact the end-user. What Uptime Institute is seeing across the board is that everyone seems to be thinking about this concept of digital resiliency. End-to-end digital resiliency is starting to be pursued by organizations, but it is not yet a mature discipline. When outages occur, for the most part executives are blindsided. The industry needs to be proactive by having a structure in the company to address how applications truly function end-to-end, testing included. And lastly, there needs to be a defined owner for application resiliency.

Request an evaluation to view this report

Apply for a four-week evaluation of Uptime Intelligence; the leading source of research, insight and data-driven analysis focused on digital infrastructure.

Posting comments is not available for Network Guests