In the last decades, High Performance Computing (HPC) systems have accelerated scientific discoveries and innovations across different domains, from epidemic studies to climate science. For sustainable development of HPC systems, it is fundamental to address their environmental impact regarding carbon footprint emission and energy requirement, while ensuring high system throughput. Analyzing and predicting HPC job execution characteristics is instrumental in developing workload management strategies to simultaneously optimize the system throughput and minimize the environmental impact. However, model development for accurate predictions is hindered by lack of voluminous public datasets. In this paper, we present F-DATA, a public dataset containing the information of around 24 million jobs executed on Fugaku, the most powerful supercomputer during the data collection phase. The data contains an extensive set of features, allowing for a multitude of job characteristics prediction. The sensitive job data appears both in anonymized and irreversibly encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes without violating privacy concerns.
Antici, F., Bartolini, A., Domke, J., Kiziltan, Z., Yamamoto, K. (2025). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems. SCIENTIFIC DATA, 12(1), 1-13 [10.1038/s41597-025-05633-1].
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems
Antici F.
;Bartolini A.;Kiziltan Z.;
2025
Abstract
In the last decades, High Performance Computing (HPC) systems have accelerated scientific discoveries and innovations across different domains, from epidemic studies to climate science. For sustainable development of HPC systems, it is fundamental to address their environmental impact regarding carbon footprint emission and energy requirement, while ensuring high system throughput. Analyzing and predicting HPC job execution characteristics is instrumental in developing workload management strategies to simultaneously optimize the system throughput and minimize the environmental impact. However, model development for accurate predictions is hindered by lack of voluminous public datasets. In this paper, we present F-DATA, a public dataset containing the information of around 24 million jobs executed on Fugaku, the most powerful supercomputer during the data collection phase. The data contains an extensive set of features, allowing for a multitude of job characteristics prediction. The sensitive job data appears both in anonymized and irreversibly encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes without violating privacy concerns.| File | Dimensione | Formato | |
|---|---|---|---|
|
s41597-025-05633-1.pdf
accesso aperto
Descrizione: v. editoriale
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Creative commons
Dimensione
1.81 MB
Formato
Adobe PDF
|
1.81 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


