CRIS Current Research Information System

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

Molan M., Mohsen Seyedkazemi Ardebili, Khan J.A., Beneventi F., Cesarini D., Borghesi A., et al. (2024). GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems. FUTURE GENERATION COMPUTER SYSTEMS, 160, 644-653 [10.1016/j.future.2024.06.032].

GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems

Molan M.;Mohsen Seyedkazemi Ardebili;Khan J. A.;Beneventi F.;Cesarini D.;Borghesi A.;Bartolini A.

2024

Abstract

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Rivista
	
				FUTURE GENERATION COMPUTER SYSTEMS
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.future.2024.06.032
			
	Citazione
	
				Molan M.,  Mohsen Seyedkazemi Ardebili,  Khan J.A.,  Beneventi F.,  Cesarini D.,  Borghesi A., et al. (2024). GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems. FUTURE GENERATION COMPUTER SYSTEMS, 160, 644-653 [10.1016/j.future.2024.06.032].
			
	Tutti gli autori
	
						Molan M.; Mohsen Seyedkazemi Ardebili; Khan J.A.; Beneventi F.; Cesarini D.; Borghesi A.; Bartolini A.
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
GRAAFE.pdf embargo fino al 26/06/2025 Tipo: Postprint Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND) Dimensione 1.17 MB Formato Adobe PDF Visualizza/Apri Contatta l'autore	1.17 MB	Adobe PDF	Visualizza/Apri Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/980974

Citazioni

ND

3

0

social impact