
Organizations should reassess redundancy
However, he pointed out, “the deeper concern is that CME had a secondary data center ready to take the load, yet the failover threshold was set too high, and the activation sequence remained manually gated. The decision to wait for the cooling issue to self-correct rather than trigger the backup site immediately revealed a governance model that had not evolved to keep pace with the operational tempo of modern markets.”
Thermal failures, he said, “do not unfold on the timelines assumed in traditional disaster recovery playbooks. They escalate within minutes and demand automated responses that do not depend on human certainty about whether a facility will recover in time.”
Matt Kimball, VP and principal analyst at Moor Insights & Strategy, said that to some degree what happened in Aurora highlights an issue that may arise on occasion: “the communications gap that can exist between IT executives and data center operators. Think of ‘rack in versus rack out’ mindsets.”
Often, he said, the operational elements of that data center environment, such as cooling, power, fire hazards, physical security, and so forth, fall outside the realm of an IT executive focused on delivering IT services to the business. “And even if they don’t fall outside the realm, these elements are certainly not a primary focus,” he noted. “This was certainly true when I was living in the IT world.”
Additionally, said Kimball, “this highlights the need for organizations to reassess redundancy and resilience in a new light. Again, in IT, we tend to focus on resilience and redundancy at the app, server, and workload layers. Maybe even cluster level. But as we continue to place more and more of a premium on data, and the terms ‘business critical’ or ‘mission critical’ have real relevance, we have to zoom out and look more at the infrastructure level.”
A lesson in risk management
When looking at datacenter management tools like Siemens DCIM, he said that a lot of telemetry data can be captured from the equipment that provides the power and cooling to racks and servers. “[There’s] deep down telemetry with some machine learning to predict failures before they happen. So, that chiller [failure] in the CyrusOne datacenter could have and should have been anticipated. Further, redundant equipment should be in operation to enable failover.”





















