Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.

Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-23220-6_18].

Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems

Ardebili, Mohsen Seyedkazemi;Bartolini, Andrea;Acquaviva, Andrea;Benini, Luca
2022

Abstract

Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.
2022
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
262
276
Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-23220-6_18].
Ardebili, Mohsen Seyedkazemi; Bartolini, Andrea; Acquaviva, Andrea; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/969028
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact