A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems

Borghesi, A.; Bartolini, A.; Lombardi, M.; Milano, M.; Benini, L.

doi:10.1016/j.engappai.2019.07.008

High Performance Computing (HPC) systems are complex machines with heterogeneous components that can break or malfunction. Automated anomaly detection in these systems is a challenging and critical task, as HPC systems are expected to work 24/7. The majority of the current state-of-the-art methods dealing with this problem are Machine Learning techniques or statistical models that rely on a supervised approach, namely the detection mechanism is trained to recognize a fixed number of different states (i.e. normal and anomalous conditions). In this paper a novel semi-supervised approach for anomaly detection in supercomputers is proposed, based on a type of neural network called autoencoder. The approach learns the normal state of the supercomputer nodes and after the training phase can be used to discern anomalous conditions from normal behavior; in doing so it relies only on the availability of data characterizing only the normal state of the system. This is different from supervised methods that require data sets with many examples of anomalous states, which are in general very rare and/or hard to obtain. The approach was tested on a real-life High Performance Computing system equipped with a monitoring infrastructure capable to generate large amount of data describing the system state. The proposed approach definitely outperforms the best current techniques for semi-supervised anomaly detection, with an increase in accuracy detection of around 12%. Two different implementations are discussed: one where each supercomputer node has a specific model and one with a single, generalized model for all nodes, in order to explore the trade-off between accuracy and ease of deployment

Borghesi A., Bartolini A., Lombardi M., Milano M., Benini L. (2019). A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 85, 634-644 [10.1016/j.engappai.2019.07.008].