Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.

Antici, F., Borghesi, A., Kiziltan, Z. (2024). Online Job Failure Prediction in an HPC System. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-48803-0_35].

Online Job Failure Prediction in an HPC System

Antici F.
Primo
;
Borghesi A.
Secondo
;
Kiziltan Z.
Ultimo
2024

Abstract

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.
2024
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
167
179
Antici, F., Borghesi, A., Kiziltan, Z. (2024). Online Job Failure Prediction in an HPC System. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-48803-0_35].
Antici, F.; Borghesi, A.; Kiziltan, Z.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1009722
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact