Power and thermal management are critical components of High-Performance-Computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems “in production", the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1∘C) for a large scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.

Robust identification of thermal models for in-production High-Performance-Computing clusters with machine learning-based data selection / Pittino F.; Diversi R.; Benini L.; Bartolini A.. - In: IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. - ISSN 0278-0070. - STAMPA. - 39:10(2020), pp. 8887210.2042-8887210.2054. [10.1109/TCAD.2019.2950378]

Robust identification of thermal models for in-production High-Performance-Computing clusters with machine learning-based data selection

Diversi R.;Benini L.;Bartolini A.
Ultimo
2020

Abstract

Power and thermal management are critical components of High-Performance-Computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems “in production", the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1∘C) for a large scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.
2020
Robust identification of thermal models for in-production High-Performance-Computing clusters with machine learning-based data selection / Pittino F.; Diversi R.; Benini L.; Bartolini A.. - In: IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. - ISSN 0278-0070. - STAMPA. - 39:10(2020), pp. 8887210.2042-8887210.2054. [10.1109/TCAD.2019.2950378]
Pittino F.; Diversi R.; Benini L.; Bartolini A.
File in questo prodotto:
File Dimensione Formato  
TCAD_2020.pdf

Open Access dal 02/05/2020

Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 1.89 MB
Formato Adobe PDF
1.89 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/788591
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact