Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.
Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-23220-6_18].
Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems
Ardebili, Mohsen Seyedkazemi;Bartolini, Andrea;Acquaviva, Andrea;Benini, Luca
2022
Abstract
Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.