High-Performance Computing (HPC) systems increasingly require intelligent, scalable anomaly detection to ensure operational reliability. However, conventional centralized approaches often struggle with data privacy constraints, poor generalization across heterogeneous nodes, and limited scalability. This study presents the first real-world application of federated transfer learning (FTL) for anomaly detection in a production-grade Tier-0 supercomputer. By combining federated learning with transfer learning, the proposed framework enables decentralized model training and personalized adaptation to unseen nodes, without accessing raw data.We validate the approach using two large-scale telemetry datasets collected from 100 nodes of the Marconi100 supercomputer, evaluating its effectiveness across supervised, semi-supervised, and unsupervised learning paradigms. Results show that FTL consistently improves anomaly detection performance on nodes that did not participate in federated training, with F1-score gains reaching up to 0.50. These improvements demonstrate the framework’s ability to generalize across non-identically distributed data and maintain detection accuracy under real-world conditions. This work establishes FTL as a scalable, privacy-preserving solution for fault detection in HPC environments. Its practical deployment on production hardware confirms its readiness for real-time monitoring applications in large-scale, heterogeneous computing systems.
Farooq, E., Milano, M., Borghesi, A. (2025). Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer. EXPERT SYSTEMS WITH APPLICATIONS, 298, 1-15 [10.1016/j.eswa.2025.129754].
Federated transfer learning for anomaly detection in HPC systems: First real-world validation on a tier-0 supercomputer
Farooq, Emmen;Milano, Michela;Borghesi, Andrea
2025
Abstract
High-Performance Computing (HPC) systems increasingly require intelligent, scalable anomaly detection to ensure operational reliability. However, conventional centralized approaches often struggle with data privacy constraints, poor generalization across heterogeneous nodes, and limited scalability. This study presents the first real-world application of federated transfer learning (FTL) for anomaly detection in a production-grade Tier-0 supercomputer. By combining federated learning with transfer learning, the proposed framework enables decentralized model training and personalized adaptation to unseen nodes, without accessing raw data.We validate the approach using two large-scale telemetry datasets collected from 100 nodes of the Marconi100 supercomputer, evaluating its effectiveness across supervised, semi-supervised, and unsupervised learning paradigms. Results show that FTL consistently improves anomaly detection performance on nodes that did not participate in federated training, with F1-score gains reaching up to 0.50. These improvements demonstrate the framework’s ability to generalize across non-identically distributed data and maintain detection accuracy under real-world conditions. This work establishes FTL as a scalable, privacy-preserving solution for fault detection in HPC environments. Its practical deployment on production hardware confirms its readiness for real-time monitoring applications in large-scale, heterogeneous computing systems.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


