Datacenters play a vital role in today's society. At large, a datacenter room is a complex controlled environment composed of thousands of computing nodes, which consume kW of power. To dissipate the power, forced air/liquid flow is employed, with a cost of millions of euros per year. Reducing this cost involves using free-cooling and average case design, which can create a cooling shortage and thermal hazards. When a thermal hazard happens, the system administrators and the facility manager must stop the production to avoid IT equipment damage and wear-out. In this paper, we study the thermal hazards signatures on a Tier-0 datacenter room's monitored data during a full year of production. We define a set of rules for detecting the thermal hazards based on the inlet and outlet temperature of all nodes of a room. We then propose a custom Temporal Convolutional Network (TCN) to predict the hazards in advance. The results show that our TCN can predict the thermal hazards with an Fl-score of 0.98 for a randomly sampled test set. When causality is enforced between the training and validation set the F1-score drops to 0.74, demanding for an in-place online re-training of the network, which motivates further research in this context.
Prediction of Thermal Hazards in a Real Datacenter Room Using Temporal Convolutional Networks / Seyedkazemi Ardebili M.; Zanghieri M.; Burrello A.; Beneventi F.; Acquaviva A.; Benini L.; Bartolini A.. - ELETTRONICO. - 2021-:(2021), pp. 9474116.1256-9474116.1259. (Intervento presentato al convegno 2021 Design, Automation and Test in Europe Conference and Exhibition, DATE 2021 tenutosi a Europe - Virtual nel 2021) [10.23919/DATE51398.2021.9474116].
Prediction of Thermal Hazards in a Real Datacenter Room Using Temporal Convolutional Networks
Seyedkazemi Ardebili M.;Zanghieri M.;Burrello A.;Beneventi F.;Acquaviva A.;Benini L.;Bartolini A.
2021
Abstract
Datacenters play a vital role in today's society. At large, a datacenter room is a complex controlled environment composed of thousands of computing nodes, which consume kW of power. To dissipate the power, forced air/liquid flow is employed, with a cost of millions of euros per year. Reducing this cost involves using free-cooling and average case design, which can create a cooling shortage and thermal hazards. When a thermal hazard happens, the system administrators and the facility manager must stop the production to avoid IT equipment damage and wear-out. In this paper, we study the thermal hazards signatures on a Tier-0 datacenter room's monitored data during a full year of production. We define a set of rules for detecting the thermal hazards based on the inlet and outlet temperature of all nodes of a room. We then propose a custom Temporal Convolutional Network (TCN) to predict the hazards in advance. The results show that our TCN can predict the thermal hazards with an Fl-score of 0.98 for a randomly sampled test set. When causality is enforced between the training and validation set the F1-score drops to 0.74, demanding for an in-place online re-training of the network, which motivates further research in this context.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.