The INFN-CNAF computing center, one of the Worldwide LHC Computing Grid Tier-1 sites, is serving a large set of scientific communities, in High Energy Physics and beyond. In order to increase efficiency and to remain competitive in the long run, CNAF is launching various activities aiming at implementing a global predictive maintenance solution for the site. This requires a site-wide effort in collecting, cleaning and structuring all possibly useful data coming from log files of the various Tier-1 services and systems, as a necessary step prior to designing machine learning based approaches for predictive maintenance. Among the Tier-1 services, efficient storage systems are one of the key ingredients of Tier-1 operations. CNAF uses the StoRM service as a Grid Storage Resource Manager solution: its operations are logged in a very complex manner, as the log content is deeply unstructured and hard to be exploited for analytics purposes. Despite such difficulty, the StoRM logs are a precious source of information for operators (e. g. real-time monitoring and anomaly detection), for developers (e. g. debugging, service stability, code improvements) and for site managers (service optimization, storage usage efficiency, time and money saving ways to spot and prevent unwanted behaviors). Based on previous experiences on Big Data Analytics and Machine/Deep learning in the CMS experiment, this work describes how the StoRM logs can be handled and parsed to extract the relevant information, how such log handling can be designed to work automatically, how to define and implement metrics to tag critical states of the service, how to correlate StoRM events with external services events, and ultimately how to contribute to the future CNAF-wide predictive maintenance system. Initial results in this activity are presented and discussed. Furthermore, a mention to ongoing complementary work at the CNAF center is also mentioned.

Towards Predictive Maintenance with Machine Learning at the INFN-CNAF computing centre

Giommi, Luca;Bonacorsi, Daniele;Diotalevi, Tommaso;Rinaldi, Lorenzo;Ceccanti, Andrea;Tisbeni, Simone
2019

Abstract

The INFN-CNAF computing center, one of the Worldwide LHC Computing Grid Tier-1 sites, is serving a large set of scientific communities, in High Energy Physics and beyond. In order to increase efficiency and to remain competitive in the long run, CNAF is launching various activities aiming at implementing a global predictive maintenance solution for the site. This requires a site-wide effort in collecting, cleaning and structuring all possibly useful data coming from log files of the various Tier-1 services and systems, as a necessary step prior to designing machine learning based approaches for predictive maintenance. Among the Tier-1 services, efficient storage systems are one of the key ingredients of Tier-1 operations. CNAF uses the StoRM service as a Grid Storage Resource Manager solution: its operations are logged in a very complex manner, as the log content is deeply unstructured and hard to be exploited for analytics purposes. Despite such difficulty, the StoRM logs are a precious source of information for operators (e. g. real-time monitoring and anomaly detection), for developers (e. g. debugging, service stability, code improvements) and for site managers (service optimization, storage usage efficiency, time and money saving ways to spot and prevent unwanted behaviors). Based on previous experiences on Big Data Analytics and Machine/Deep learning in the CMS experiment, this work describes how the StoRM logs can be handled and parsed to extract the relevant information, how such log handling can be designed to work automatically, how to define and implement metrics to tag critical states of the service, how to correlate StoRM events with external services events, and ultimately how to contribute to the future CNAF-wide predictive maintenance system. Initial results in this activity are presented and discussed. Furthermore, a mention to ongoing complementary work at the CNAF center is also mentioned.
2019
Pos Sissa
003
018
Giommi, Luca; Bonacorsi, Daniele; Diotalevi, Tommaso; Rinaldi, Lorenzo; Morganti, Lucia; Falabella, Antonio; Ronchieri, Elisabetta; Ceccanti, Andrea; Martelli, Barbara; Tisbeni, Simone
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/724411
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact