In the era of digital transformation, datacenters and High Performance Computing (HPC) Systems have emerged as the backbone of global technology infrastructure, powering essential services across various industries, including finance and healthcare. Therefore, ensuring the uninterrupted service of these datacenters has become a critical challenge. Thermal anomalies pose a significant risk to datacenter operation, potentially leading to hardware deterioration, system downtime, and catastrophic failures. This threat is exacerbated by the growing number of datacenters, increased power density, and heat waves fostered by global warming. Detecting thermal anomalies in datacenters involves several challenges. Large-scale data collection is difficult, requiring diverse monitoring signals from thousands of nodes over long periods. The absence of labeled data complicates the identification of normal and abnormal states. Establishing accurate classification thresholds to minimize false positives and negatives is another significant hurdle. Traditional statistical methods often fail to capture temporal dependencies and complex correlations in monitoring signals. Additionally, finding anomalies at both the system and subsystem levels adds to the complexity. Deploying machine learning models in production environments presents technical and operational challenges, making real-time anomaly detection a demanding task. This paper introduces ThermADNet, a Thermal Anomaly Detection framework that combines statistical rules-based methods with Deep Neural Network (DNN) techniques for thermal anomaly detection in datacenters. ThermADNet utilizes a semi-supervised learning approach by training on a "semi-normal" dataset, addressing the challenges of large-scale data collection, semi-normal dataset identification, and classification threshold establishment. This framework's efficacy is validated by its success in identifying real physical thermal failure events within a Tier-0 datacenter, pinpointing anomalies at both the system and subsystem levels, including compute nodes and datacenter infrastructure. In the critical evaluation window covering the July 28 failure, ThermADNet achieves precision and recall up to 0.97, with F1-scores as high as 0.97. By providing detailed information about anomalies, the framework clarifies the characteristics and reasoning behind the DNN outputs, thereby building trust in the AI model and ensuring that users can understand and rely on the system's decisions. By offering a sophisticated method for thermal anomaly detection, ThermADNet significantly contributes to enhancing datacenter reliability and efficiency. This advancement supports the uninterrupted operation of critical HPC systems, averting considerable economic and societal losses.

Seyedkazemi Ardebili, M., Acquaviva, A., Benini, L., Bartolini, A. (2026). Elevating Datacenter Resilience with ThermADNet: A Thermal Anomaly Detection System. FUTURE GENERATION COMPUTER SYSTEMS, 179, 179-19 [10.1016/j.future.2025.108311].

Elevating Datacenter Resilience with ThermADNet: A Thermal Anomaly Detection System

Mohsen Seyedkazemi Ardebili
;
Andrea Acquaviva;Luca Benini;Andrea Bartolini
2026

Abstract

In the era of digital transformation, datacenters and High Performance Computing (HPC) Systems have emerged as the backbone of global technology infrastructure, powering essential services across various industries, including finance and healthcare. Therefore, ensuring the uninterrupted service of these datacenters has become a critical challenge. Thermal anomalies pose a significant risk to datacenter operation, potentially leading to hardware deterioration, system downtime, and catastrophic failures. This threat is exacerbated by the growing number of datacenters, increased power density, and heat waves fostered by global warming. Detecting thermal anomalies in datacenters involves several challenges. Large-scale data collection is difficult, requiring diverse monitoring signals from thousands of nodes over long periods. The absence of labeled data complicates the identification of normal and abnormal states. Establishing accurate classification thresholds to minimize false positives and negatives is another significant hurdle. Traditional statistical methods often fail to capture temporal dependencies and complex correlations in monitoring signals. Additionally, finding anomalies at both the system and subsystem levels adds to the complexity. Deploying machine learning models in production environments presents technical and operational challenges, making real-time anomaly detection a demanding task. This paper introduces ThermADNet, a Thermal Anomaly Detection framework that combines statistical rules-based methods with Deep Neural Network (DNN) techniques for thermal anomaly detection in datacenters. ThermADNet utilizes a semi-supervised learning approach by training on a "semi-normal" dataset, addressing the challenges of large-scale data collection, semi-normal dataset identification, and classification threshold establishment. This framework's efficacy is validated by its success in identifying real physical thermal failure events within a Tier-0 datacenter, pinpointing anomalies at both the system and subsystem levels, including compute nodes and datacenter infrastructure. In the critical evaluation window covering the July 28 failure, ThermADNet achieves precision and recall up to 0.97, with F1-scores as high as 0.97. By providing detailed information about anomalies, the framework clarifies the characteristics and reasoning behind the DNN outputs, thereby building trust in the AI model and ensuring that users can understand and rely on the system's decisions. By offering a sophisticated method for thermal anomaly detection, ThermADNet significantly contributes to enhancing datacenter reliability and efficiency. This advancement supports the uninterrupted operation of critical HPC systems, averting considerable economic and societal losses.
2026
Seyedkazemi Ardebili, M., Acquaviva, A., Benini, L., Bartolini, A. (2026). Elevating Datacenter Resilience with ThermADNet: A Thermal Anomaly Detection System. FUTURE GENERATION COMPUTER SYSTEMS, 179, 179-19 [10.1016/j.future.2025.108311].
Seyedkazemi Ardebili, Mohsen; Acquaviva, Andrea; Benini, Luca; Bartolini, Andrea
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0167739X25006053-main.pdf

accesso aperto

Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 13.15 MB
Formato Adobe PDF
13.15 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1037985
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact