Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer

Farooq, Emmen; Milano, Michela; Borghesi, Andrea

doi:10.1016/j.eswa.2025.129754

High-Performance Computing (HPC) systems increasingly require intelligent, scalable anomaly detection to ensure operational reliability. However, conventional centralized approaches often struggle with data privacy constraints, poor generalization across heterogeneous nodes, and limited scalability. This study presents the first real-world application of federated transfer learning (FTL) for anomaly detection in a production-grade Tier-0 supercomputer. By combining federated learning with transfer learning, the proposed framework enables decentralized model training and personalized adaptation to unseen nodes, without accessing raw data.We validate the approach using two large-scale telemetry datasets collected from 100 nodes of the Marconi100 supercomputer, evaluating its effectiveness across supervised, semi-supervised, and unsupervised learning paradigms. Results show that FTL consistently improves anomaly detection performance on nodes that did not participate in federated training, with F1-score gains reaching up to 0.50. These improvements demonstrate the framework’s ability to generalize across non-identically distributed data and maintain detection accuracy under real-world conditions. This work establishes FTL as a scalable, privacy-preserving solution for fault detection in HPC environments. Its practical deployment on production hardware confirms its readiness for real-time monitoring applications in large-scale, heterogeneous computing systems.

Farooq, E., Milano, M., Borghesi, A. (2025). Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer. EXPERT SYSTEMS WITH APPLICATIONS, 298, 1-15 [10.1016/j.eswa.2025.129754].

Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer

Farooq, Emmen;Milano, Michela;Borghesi, Andrea

2025

Abstract

High-Performance Computing (HPC) systems increasingly require intelligent, scalable anomaly detection to ensure operational reliability. However, conventional centralized approaches often struggle with data privacy constraints, poor generalization across heterogeneous nodes, and limited scalability. This study presents the first real-world application of federated transfer learning (FTL) for anomaly detection in a production-grade Tier-0 supercomputer. By combining federated learning with transfer learning, the proposed framework enables decentralized model training and personalized adaptation to unseen nodes, without accessing raw data.We validate the approach using two large-scale telemetry datasets collected from 100 nodes of the Marconi100 supercomputer, evaluating its effectiveness across supervised, semi-supervised, and unsupervised learning paradigms. Results show that FTL consistently improves anomaly detection performance on nodes that did not participate in federated training, with F1-score gains reaching up to 0.50. These improvements demonstrate the framework’s ability to generalize across non-identically distributed data and maintain detection accuracy under real-world conditions. This work establishes FTL as a scalable, privacy-preserving solution for fault detection in HPC environments. Its practical deployment on production hardware confirms its readiness for real-time monitoring applications in large-scale, heterogeneous computing systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				EXPERT SYSTEMS WITH APPLICATIONS
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.eswa.2025.129754
			
	Citazione
	
				Farooq, E., Milano, M., Borghesi, A. (2025). Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer. EXPERT SYSTEMS WITH APPLICATIONS, 298, 1-15 [10.1016/j.eswa.2025.129754].
			
	Tutti gli autori
	
						Farooq, Emmen; Milano, Michela; Borghesi, Andrea

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1031730

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

1

1

CRIS Current Research Information System