High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.
Farooq, E., Milano, M., Borghesi, A. (2026). Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems. KNOWLEDGE-BASED SYSTEMS, 334, 1-19 [10.1016/j.knosys.2025.115043].
Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems
Farooq, Emmen;Milano, Michela;Borghesi, Andrea
2026
Abstract
High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.| File | Dimensione | Formato | |
|---|---|---|---|
|
Federated-LSTM-KBS.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
4.35 MB
Formato
Adobe PDF
|
4.35 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



