CRIS Current Research Information System

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

An Explainable Model for Fault Detection in HPC Systems / Molan M.; Borghesi A.; Beneventi F.; Guarrasi M.; Bartolini A.. - ELETTRONICO. - 12761:(2021), pp. 378-391. (Intervento presentato al convegno International Conference on High Performance Computing, ISC High Performance 2021 tenutosi a ONLINE nel 2021) [10.1007/978-3-030-90539-2_25].

An Explainable Model for Fault Detection in HPC Systems

Molan M.;Borghesi A.;Beneventi F.;Guarrasi M.;Bartolini A.

2021

Abstract

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2021
		
	Titolo del volume
	
			High Performance Computing. ISC High Performance 2021
		
	Pagina iniziale
	
			378
		
	Pagina finale
	
			391
		
	Collana/Serie
	
			LECTURE NOTES IN COMPUTER SCIENCE
		
	Codice DOI
	
			https://dx.doi.org/10.1007/978-3-030-90539-2_25
		
	Citazione
	
			An Explainable Model for Fault Detection in HPC Systems / Molan M.; Borghesi A.; Beneventi F.; Guarrasi M.; Bartolini A.. - ELETTRONICO. - 12761:(2021), pp. 378-391. (Intervento presentato al  convegno International Conference on High Performance Computing, ISC High Performance 2021 tenutosi a ONLINE nel 2021) [10.1007/978-3-030-90539-2_25].
		
	Tutti gli autori
	
			Molan M.; Borghesi A.; Beneventi F.; Guarrasi M.; Bartolini A.
		
	Appare nelle tipologie:
	
			4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
molan_MODA2021.pdf accesso aperto Tipo: Postprint Licenza: Licenza per accesso libero gratuito Dimensione 453.33 kB Formato Adobe PDF Visualizza/Apri	453.33 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/844129

Citazioni

ND

3

2

social impact