High-Performance Computing (HPC) systems are intricate machines that must be run at maximum efficiency to justify their high cost and to minimize environmental impact. Any anomalies that hinder the smooth operation of supercomputing nodes are a significant issue in modern HPC systems. Therefore, the development of automated anomaly detection methods is a crucial area of research within the HPC domain. Machine Learning (ML) models have shown great success in identifying anomalies on individual nodes, especially as contemporary super-computers are outfitted with advanced monitoring systems that provide large datasets for training. However, the potential to combine data from various nodes and to utilize collective ML models remains largely unexplored. Federated Learning (FL) presents a promising approach by enabling individual models to share and learn from one another. Although FL has been employed in areas like healthcare and IoT, its application in HPC is still novel. This study explores how FL can be leveraged to enhance anomaly detection in HPC systems. Using data from a real-world supercomputer, the approach has shown significant promise, boosting the average F1-score from 0.307 to 0.815, and the average AUC from 0.368 to 0.77. Moreover, FL drastically reduces the time required to gather sufficient data for training, allowing faster deployment of detection models. Traditional ML models typically need about 4.5 months of data to perform effectively, but FL can achieve the same with only 1.2 weeks of data, resulting in a 15-fold reduction in data requirements.

Farooq, E., Borghesi, A. (2024). LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach [10.1109/bigdata62323.2024.10825337].

LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach

Farooq, Emmen;Borghesi, Andrea
2024

Abstract

High-Performance Computing (HPC) systems are intricate machines that must be run at maximum efficiency to justify their high cost and to minimize environmental impact. Any anomalies that hinder the smooth operation of supercomputing nodes are a significant issue in modern HPC systems. Therefore, the development of automated anomaly detection methods is a crucial area of research within the HPC domain. Machine Learning (ML) models have shown great success in identifying anomalies on individual nodes, especially as contemporary super-computers are outfitted with advanced monitoring systems that provide large datasets for training. However, the potential to combine data from various nodes and to utilize collective ML models remains largely unexplored. Federated Learning (FL) presents a promising approach by enabling individual models to share and learn from one another. Although FL has been employed in areas like healthcare and IoT, its application in HPC is still novel. This study explores how FL can be leveraged to enhance anomaly detection in HPC systems. Using data from a real-world supercomputer, the approach has shown significant promise, boosting the average F1-score from 0.307 to 0.815, and the average AUC from 0.368 to 0.77. Moreover, FL drastically reduces the time required to gather sufficient data for training, allowing faster deployment of detection models. Traditional ML models typically need about 4.5 months of data to perform effectively, but FL can achieve the same with only 1.2 weeks of data, resulting in a 15-fold reduction in data requirements.
2024
2024 IEEE International Conference on Big Data (BigData)
7735
7744
Farooq, E., Borghesi, A. (2024). LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach [10.1109/bigdata62323.2024.10825337].
Farooq, Emmen; Borghesi, Andrea
File in questo prodotto:
File Dimensione Formato  
FL_LSTM.pdf

accesso aperto

Tipo: Postprint / Author's Accepted Manuscript (AAM) - versione accettata per la pubblicazione dopo la peer-review
Licenza: Licenza per accesso libero gratuito
Dimensione 906.56 kB
Formato Adobe PDF
906.56 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1002586
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact