Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Molan M., Borghesi A., Beneventi F., Guarrasi M., Bartolini A. (2021). An Explainable Model for Fault Detection in HPC Systems. Cham : Springer [10.1007/978-3-030-90539-2_25].

An Explainable Model for Fault Detection in HPC Systems

Molan M.;Borghesi A.;Beneventi F.;Bartolini A.
2021

Abstract

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.
2021
High Performance Computing. ISC High Performance 2021
378
391
Molan M., Borghesi A., Beneventi F., Guarrasi M., Bartolini A. (2021). An Explainable Model for Fault Detection in HPC Systems. Cham : Springer [10.1007/978-3-030-90539-2_25].
Molan M.; Borghesi A.; Beneventi F.; Guarrasi M.; Bartolini A.
File in questo prodotto:
File Dimensione Formato  
molan_MODA2021.pdf

accesso aperto

Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 453.33 kB
Formato Adobe PDF
453.33 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/844129
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact