High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.

Farooq, E., Milano, M., Borghesi, A. (2026). Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems. KNOWLEDGE-BASED SYSTEMS, 334, 1-19 [10.1016/j.knosys.2025.115043].

Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems

Farooq, Emmen;Milano, Michela;Borghesi, Andrea
2026

Abstract

High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.
2026
Farooq, E., Milano, M., Borghesi, A. (2026). Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems. KNOWLEDGE-BASED SYSTEMS, 334, 1-19 [10.1016/j.knosys.2025.115043].
Farooq, Emmen; Milano, Michela; Borghesi, Andrea
File in questo prodotto:
File Dimensione Formato  
Federated-LSTM-KBS.pdf

accesso aperto

Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 4.35 MB
Formato Adobe PDF
4.35 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1037167
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact