The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.
Molan M., Mohsen Seyedkazemi Ardebili, Khan J.A., Beneventi F., Cesarini D., Borghesi A., et al. (2024). GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems. FUTURE GENERATION COMPUTER SYSTEMS, 160, 644-653 [10.1016/j.future.2024.06.032].
GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems
Molan M.
;Mohsen Seyedkazemi Ardebili;Khan J. A.;Beneventi F.;Borghesi A.;Bartolini A.
2024
Abstract
The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.File | Dimensione | Formato | |
---|---|---|---|
GRAAFE.pdf
embargo fino al 26/06/2025
Tipo:
Postprint
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione
1.17 MB
Formato
Adobe PDF
|
1.17 MB | Adobe PDF | Visualizza/Apri Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.