Modern High-Performance Computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs originating from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiencies in resource usage, system throughput and energy consumption. One approach to tackling this problem is to distinguish between memory-bound and compute-bound jobs at their submission time, with the goal of making informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before job execution, without user intervention. We propose a systematic characterization technique to generate a reference dataset from historical data for initial classification model training. Using the proposed characterization technique, we analyze the data of 2.2 million job runs on the Supercomputer Fugaku1, a production HPC system installed at the RIKEN Center for Computational Science, in Japan. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89 as prediction quality, while incurring a negligible overhead on the system's operations. Our Python-based implementation of MCBound can be seamlessly configured and deployed in other HPC systems.

Antici, F., Bartolini, A., Kiziltan, Z., Babaoglu, O., Kodama, Y. (2024). MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs. IEEE.

MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs

Francesco Antici
;
Andrea Bartolini;Zeynep Kiziltan;Ozalp Babaoglu;
2024

Abstract

Modern High-Performance Computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs originating from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiencies in resource usage, system throughput and energy consumption. One approach to tackling this problem is to distinguish between memory-bound and compute-bound jobs at their submission time, with the goal of making informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before job execution, without user intervention. We propose a systematic characterization technique to generate a reference dataset from historical data for initial classification model training. Using the proposed characterization technique, we analyze the data of 2.2 million job runs on the Supercomputer Fugaku1, a production HPC system installed at the RIKEN Center for Computational Science, in Japan. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89 as prediction quality, while incurring a negligible overhead on the system's operations. Our Python-based implementation of MCBound can be seamlessly configured and deployed in other HPC systems.
2024
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
1
15
Antici, F., Bartolini, A., Kiziltan, Z., Babaoglu, O., Kodama, Y. (2024). MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs. IEEE.
Antici, Francesco; Bartolini, Andrea; Kiziltan, Zeynep; Babaoglu, Ozalp; Kodama, Yuetsu
File in questo prodotto:
File Dimensione Formato  
SC41406.2024.00062.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per accesso libero gratuito
Dimensione 2.69 MB
Formato Adobe PDF
2.69 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/998153
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact