In the era of digital transformation, datacenters and High Performance Computing (HPC) Systems have emerged as the backbone of global technology infrastructure, powering essential services across various industries, including finance and healthcare. Therefore, ensuring the uninterrupted service of these datacenters has become a critical challenge. Thermal anomalies pose a significant risk to datacenter operation, potentially leading to hardware deterioration, system downtime, and catastrophic failures. This threat is exacerbated by the growing number of datacenters, increased power density, and heat waves fostered by global warming. Detecting thermal anomalies in datacenters involves several challenges. Large-scale data collection is difficult, requiring diverse monitoring signals from thousands of nodes over long periods. The absence of labeled data complicates the identification of normal and abnormal states. Establishing accurate classification thresholds to minimize false positives and negatives is another significant hurdle. Traditional statistical methods often fail to capture temporal dependencies and complex correlations in monitoring signals. Additionally, finding anomalies at both the system and subsystem levels adds to the complexity. Deploying machine learning models in production environments presents technical and operational challenges, making real-time anomaly detection a demanding task. This paper introduces ThermADNet, a Thermal Anomaly Detection framework that combines statistical rules-based methods with Deep Neural Network (DNN) techniques for thermal anomaly detection in datacenters. ThermADNet utilizes a semi-supervised learning approach by training on a "semi-normal" dataset, addressing the challenges of large-scale data collection, semi-normal dataset identification, and classification threshold establishment. This framework's efficacy is validated by its success in identifying real physical thermal failure events within a Tier-0 datacenter, pinpointing anomalies at both the system and subsystem levels, including compute nodes and datacenter infrastructure. In the critical evaluation window covering the July 28 failure, ThermADNet achieves precision and recall up to 0.97, with F1-scores as high as 0.97. By providing detailed information about anomalies, the framework clarifies the characteristics and reasoning behind the DNN outputs, thereby building trust in the AI model and ensuring that users can understand and rely on the system's decisions. By offering a sophisticated method for thermal anomaly detection, ThermADNet significantly contributes to enhancing datacenter reliability and efficiency. This advancement supports the uninterrupted operation of critical HPC systems, averting considerable economic and societal losses.
Seyedkazemi Ardebili, M., Acquaviva, A., Benini, L., Bartolini, A. (2026). Elevating Datacenter Resilience with ThermADNet: A Thermal Anomaly Detection System. FUTURE GENERATION COMPUTER SYSTEMS, 179, 179-19 [10.1016/j.future.2025.108311].
Elevating Datacenter Resilience with ThermADNet: A Thermal Anomaly Detection System
Mohsen Seyedkazemi Ardebili
;Andrea Acquaviva;Luca Benini;Andrea Bartolini
2026
Abstract
In the era of digital transformation, datacenters and High Performance Computing (HPC) Systems have emerged as the backbone of global technology infrastructure, powering essential services across various industries, including finance and healthcare. Therefore, ensuring the uninterrupted service of these datacenters has become a critical challenge. Thermal anomalies pose a significant risk to datacenter operation, potentially leading to hardware deterioration, system downtime, and catastrophic failures. This threat is exacerbated by the growing number of datacenters, increased power density, and heat waves fostered by global warming. Detecting thermal anomalies in datacenters involves several challenges. Large-scale data collection is difficult, requiring diverse monitoring signals from thousands of nodes over long periods. The absence of labeled data complicates the identification of normal and abnormal states. Establishing accurate classification thresholds to minimize false positives and negatives is another significant hurdle. Traditional statistical methods often fail to capture temporal dependencies and complex correlations in monitoring signals. Additionally, finding anomalies at both the system and subsystem levels adds to the complexity. Deploying machine learning models in production environments presents technical and operational challenges, making real-time anomaly detection a demanding task. This paper introduces ThermADNet, a Thermal Anomaly Detection framework that combines statistical rules-based methods with Deep Neural Network (DNN) techniques for thermal anomaly detection in datacenters. ThermADNet utilizes a semi-supervised learning approach by training on a "semi-normal" dataset, addressing the challenges of large-scale data collection, semi-normal dataset identification, and classification threshold establishment. This framework's efficacy is validated by its success in identifying real physical thermal failure events within a Tier-0 datacenter, pinpointing anomalies at both the system and subsystem levels, including compute nodes and datacenter infrastructure. In the critical evaluation window covering the July 28 failure, ThermADNet achieves precision and recall up to 0.97, with F1-scores as high as 0.97. By providing detailed information about anomalies, the framework clarifies the characteristics and reasoning behind the DNN outputs, thereby building trust in the AI model and ensuring that users can understand and rely on the system's decisions. By offering a sophisticated method for thermal anomaly detection, ThermADNet significantly contributes to enhancing datacenter reliability and efficiency. This advancement supports the uninterrupted operation of critical HPC systems, averting considerable economic and societal losses.| File | Dimensione | Formato | |
|---|---|---|---|
|
1-s2.0-S0167739X25006053-main.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
13.15 MB
Formato
Adobe PDF
|
13.15 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


