At 18:31 CEST, the on-call engineer was paged due to rapidly increasing temperatures across Rack Room A. Upon receiving the alerts, the incident was escalated internally at 18:33 CEST. Simultaneously, the on-call engineer departed for the datacenter to respond on-site.
Following the escalation, additional team members joined the incident response. As an immediate mitigation measure, backup air conditioning units were activated at 18:34 CEST. Due to the elevated ambient temperatures, the firewall providing remote access to the datacenter infrastructure shut down automatically.
At 18:39 CEST, the on-duty engineer arrived on-site and verified the availability of electrical power and the chilled water generation system. Both systems were operating normally and showed no indications of failure.
At 18:43 CEST, the on-duty engineer identified an electrical fault affecting the cold-air circulation within the cold aisle of Rack Room A. Power to the free-cooling infrastructure was restored immediately. In addition, the chilled water supply temperature was reduced to increase cooling capacity and accelerate the recovery of room temperatures.
We will use the findings from this incident to further strengthen the resilience of our infrastructure. The identified lessons learned will be incorporated into our engineering, monitoring, and operational procedures to reduce the likelihood and impact of similar incidents in the future.