Modern High-Performance Computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs originating from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiencies in resource usage, system throughput and energy consumption. One approach to tackling this problem is to distinguish between memory-bound and compute-bound jobs at their submission time, with the goal of making informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before job execution, without user intervention. We propose a systematic characterization technique to generate a reference dataset from historical data for initial classification model training. Using the proposed characterization technique, we analyze the data of 2.2 million job runs on the Supercomputer Fugaku1, a production HPC system installed at the RIKEN Center for Computational Science, in Japan. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89 as prediction quality, while incurring a negligible overhead on the system's operations. Our Python-based implementation of MCBound can be seamlessly configured and deployed in other HPC systems.
Antici, F., Bartolini, A., Kiziltan, Z., Babaoglu, O., Kodama, Y. (2024). MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs. IEEE.
MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs
Francesco Antici
;Andrea Bartolini;Zeynep Kiziltan;Ozalp Babaoglu;
2024
Abstract
Modern High-Performance Computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs originating from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiencies in resource usage, system throughput and energy consumption. One approach to tackling this problem is to distinguish between memory-bound and compute-bound jobs at their submission time, with the goal of making informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before job execution, without user intervention. We propose a systematic characterization technique to generate a reference dataset from historical data for initial classification model training. Using the proposed characterization technique, we analyze the data of 2.2 million job runs on the Supercomputer Fugaku1, a production HPC system installed at the RIKEN Center for Computational Science, in Japan. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89 as prediction quality, while incurring a negligible overhead on the system's operations. Our Python-based implementation of MCBound can be seamlessly configured and deployed in other HPC systems.File | Dimensione | Formato | |
---|---|---|---|
SC41406.2024.00062.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale
Licenza:
Licenza per accesso libero gratuito
Dimensione
2.69 MB
Formato
Adobe PDF
|
2.69 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.