The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

Molan M., Mohsen Seyedkazemi Ardebili, Khan J.A., Beneventi F., Cesarini D., Borghesi A., et al. (2024). GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems. FUTURE GENERATION COMPUTER SYSTEMS, 160, 644-653 [10.1016/j.future.2024.06.032].

GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems

Molan M.
;
Mohsen Seyedkazemi Ardebili;Khan J. A.;Beneventi F.;Borghesi A.;Bartolini A.
2024

Abstract

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.
2024
Molan M., Mohsen Seyedkazemi Ardebili, Khan J.A., Beneventi F., Cesarini D., Borghesi A., et al. (2024). GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems. FUTURE GENERATION COMPUTER SYSTEMS, 160, 644-653 [10.1016/j.future.2024.06.032].
Molan M.; Mohsen Seyedkazemi Ardebili; Khan J.A.; Beneventi F.; Cesarini D.; Borghesi A.; Bartolini A.
File in questo prodotto:
File Dimensione Formato  
GRAAFE.pdf

embargo fino al 26/06/2025

Tipo: Postprint
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione 1.17 MB
Formato Adobe PDF
1.17 MB Adobe PDF   Visualizza/Apri   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/980974
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact