CRIS Current Research Information System

High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.

Farooq, E., Milano, M., Borghesi, A. (2026). Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems. KNOWLEDGE-BASED SYSTEMS, 334, 1-19 [10.1016/j.knosys.2025.115043].

Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems

Farooq, Emmen;Milano, Michela;Borghesi, Andrea

2026

Abstract

High-Performance Computing (HPC) systems are becoming increasingly vulnerable to anomalies as their scale and complexity grow. In this work, we propose a federated learning (FL) framework that integrates Long Short-Term Memory (LSTM) autoencoders for time series anomaly detection, allowing decentralized model training without sharing raw data. Using real telemetry from the Marconi100 Tier-0 supercomputer, our approach improves the average F1-score from 0.388 to 0.867 (+123 %) and the AUC from 0.334 to 0.808 (+142 %). It also cuts the training data requirement by a factor of 15, reducing the collection period from 4.5 months to just 1.25 weeks. These improvements are consistent across unsupervised, semi-supervised, and supervised settings, and significance testing with the Wilcoxon signed-rank test confirms they are statistically robust (p < 0.01). To our knowledge, this is the first comprehensive evaluation of FL-based LSTM autoencoders for anomaly detection in real HPC environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Rivista
	
				KNOWLEDGE-BASED SYSTEMS
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.knosys.2025.115043
			
	Citazione
	
				Farooq, E., Milano, M., Borghesi, A. (2026). Federated LSTM autoencoders for time series anomaly detection in production-scale HPC systems. KNOWLEDGE-BASED SYSTEMS, 334, 1-19 [10.1016/j.knosys.2025.115043].
			
	Tutti gli autori
	
						Farooq, Emmen; Milano, Michela; Borghesi, Andrea
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Federated-LSTM-KBS.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 4.35 MB Formato Adobe PDF Visualizza/Apri	4.35 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1037167

Citazioni

ND

5

2

1

social impact