CRIS Current Research Information System

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Molan M., Borghesi A., Beneventi F., Guarrasi M., Bartolini A. (2021). An Explainable Model for Fault Detection in HPC Systems. Cham : Springer [10.1007/978-3-030-90539-2_25].

An Explainable Model for Fault Detection in HPC Systems

Molan M.;Borghesi A.;Beneventi F.;Guarrasi M.;Bartolini A.

2021

Abstract

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Titolo del volume
	
				High Performance Computing. ISC High Performance 2021
			
	Pagina iniziale
	
				378
			
	Pagina finale
	
				391
			
	Collana/Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-030-90539-2_25
			
	Citazione
	
				Molan M.,  Borghesi A.,  Beneventi F.,  Guarrasi M.,  Bartolini A. (2021). An Explainable Model for Fault Detection in HPC Systems. Cham : Springer [10.1007/978-3-030-90539-2_25].
			
	Tutti gli autori
	
						Molan M.; Borghesi A.; Beneventi F.; Guarrasi M.; Bartolini A.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
molan_MODA2021.pdf accesso aperto Tipo: Postprint Licenza: Licenza per accesso libero gratuito Dimensione 453.33 kB Formato Adobe PDF Visualizza/Apri	453.33 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/844129

Citazioni

ND

3

2

social impact