The Data Center Survival Guide lays out a strategic 6-12 month roadmap to help facility and operations leaders stay ahead of 2025’s biggest risks.

UPS Failures Are Still Taking Down the Cloud—Here’s What We’ve Learned

April 17, 2025
UPS failures causing downtime

Updated on May 5th, 2025

On March 20, 2025, a UPS system failure triggered a regional outage for Google Cloud, once again spotlighting a persistent and critical weakness in data center infrastructure. Despite advances in redundancy and monitoring, uninterruptible power supply (UPS) systems remain a leading source of downtime—particularly in highly available, hyperscale environments.

Google Cloud Platform us-east5 region UPS failure

The incident, as reported by Data Center Dynamics, involved a control system fault that prevented the UPS from transferring to battery during a utility failure, ultimately impacting customer workloads in the us-east5 region. Google’s postmortem noted that while alarms were generated, they were not escalated or acted upon in time—highlighting the dual challenge of both mechanical and human reliability in power chain management.

MCIM’s Static UPS System Reliability Benchmarking Report—one of the most comprehensive in the industry—provides context for how widespread this issue really is. The report analyzed over 18,000 static UPS units across 2,000+ facilities and found:

  • 18.1% of all UPS failures are due to control system faults, the same root cause as the Google incident.
  • A full 40.7% of UPS-related outages resulted in at least partial loss of power to critical loads.
  • Despite the mission-critical nature of UPS equipment, only 21% of organizations track component-level reliability metrics that could help predict failures before they occur.

These findings underscore a troubling gap in both visibility and proactive maintenance.

The MCIM platform helps operators close this gap. By digitizing UPS and power system monitoring, maintenance, and incident workflows, MCIM enables:

  • Early warning detection via structured incident tracking and real-time equipment performance analytics.
  • Benchmarking and trend analysis across a client’s entire asset base to identify higher-risk models, configurations, or maintenance practices.
  • Data-driven decision-making for maintenance scheduling, capital planning, and component replacement—reducing both the risk and impact of UPS failure.

The cloud doesn’t fail often—but when it does, the cost is enormous. As Google Cloud’s outage shows, even the most sophisticated providers aren’t immune. The industry must move from reactive incident response to predictive infrastructure management.

With MCIM, the data center industry has the opportunity to learn from past failures—before they happen again.

To learn more or to schedule a demo, please visit: www.mcim24x7.com

More Resources

Escalating Thermal Loads
As AI and HPC drive thermal loads to new heights, learn how MCIM helps data centers manage cooling risk and
field service management software needs to be built for compliance
Optimize mission-critical operations with compliance-focused field service management software. Enhance safety, efficiency, and asset utilization today.
Overcome Data Center Power Constraints
Discover how data center leaders are tackling power constraints in 2025. Forecast demand, optimize energy use, and ensure operational resilience