CRIS Current Research Information System

Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computing (HPC) systems, where many components can fail or behave wrongly. Building a data-driven representation of the computing nodes can help with predictive maintenance and facility management. Luckily, most of the current supercomputers are endowed with monitoring frameworks that can build such representations in conjunction with Deep Learning (DL) models. In this work, we propose a novel semi-supervised DL approach based on autoencoder networks and clustering algorithms (applied to the latent representation) to build a digital twin of the computing nodes of the system. The DL model projects the node features into a lower-dimensional space. Then, clustering is applied to capture and reveal underlying, non-trivial correlations between the features.The extracted information provides valuable insights for system administrators and managers, such as anomaly detection and node classification based on their behaviour and operative conditions. We validated the approach on 240 nodes from the Marconi 100 system, a Tier-0 supercomputer located in CINECA (Italy), considering a 10-month period.

Molan, M., Borghesi, A., Benini, L., Bartolini, A. (2022). Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-12597-3_11].

Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

Molan, M;Borghesi, A;Benini, L;Bartolini, A

2022

Abstract

Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computing (HPC) systems, where many components can fail or behave wrongly. Building a data-driven representation of the computing nodes can help with predictive maintenance and facility management. Luckily, most of the current supercomputers are endowed with monitoring frameworks that can build such representations in conjunction with Deep Learning (DL) models. In this work, we propose a novel semi-supervised DL approach based on autoencoder networks and clustering algorithms (applied to the latent representation) to build a digital twin of the computing nodes of the system. The DL model projects the node features into a lower-dimensional space. Then, clustering is applied to capture and reveal underlying, non-trivial correlations between the features.The extracted information provides valuable insights for system administrators and managers, such as anomaly detection and node classification based on their behaviour and operative conditions. We validated the approach on 240 nodes from the Marconi 100 system, a Tier-0 supercomputer located in CINECA (Italy), considering a 10-month period.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Titolo del volume
	
				Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science
			
	Pagina iniziale
	
				171
			
	Pagina finale
	
				185
			
	Collana/Serie
	
				LECTURE NOTES IN ARTIFICIAL INTELLIGENCE
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-12597-3_11
			
	Citazione
	
				Molan, M., Borghesi, A., Benini, L., Bartolini, A. (2022). Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-12597-3_11].
			
	Tutti gli autori
	
						Molan, M; Borghesi, A; Benini, L; Bartolini, A
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
postPrint_Analysing_Supercomputer_Nodes_Behaviour_with_the_Latent_Representation_of_Deep_Learning_Models.pdf Open Access dal 01/08/2023 Tipo: Postprint Licenza: Licenza per accesso libero gratuito Dimensione 1.35 MB Formato Adobe PDF Visualizza/Apri	1.35 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/894555

Citazioni

ND

0

0

social impact