Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.

Zaccarini, M., Poltronieri, F., Borsatti, D., Cerroni, W., Foschini, L., Grabarnik, G.Ya., et al. (2025). Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning. Piscataway : Institute of Electrical and Electronics Engineers Inc. [10.1109/noms57970.2025.11073590].

Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning

Borsatti, Davide;Cerroni, Walter;Foschini, Luca;Scotece, Domenico;Stefanelli, Cesare;
2025

Abstract

Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.
2025
Proceedings of IEEE/IFIP Network Operations and Management Symposium 2025, NOMS 2025
1
7
Zaccarini, M., Poltronieri, F., Borsatti, D., Cerroni, W., Foschini, L., Grabarnik, G.Ya., et al. (2025). Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning. Piscataway : Institute of Electrical and Electronics Engineers Inc. [10.1109/noms57970.2025.11073590].
Zaccarini, Mattia; Poltronieri, Filippo; Borsatti, Davide; Cerroni, Walter; Foschini, Luca; Grabarnik, Genady Ya.; Scotece, Domenico; Shwartz, Larisa;...espandi
File in questo prodotto:
File Dimensione Formato  
Chaos_Engineering_Based_Kubernetes_Pod_Rescheduling_Through_Deep_Sets_and_Reinforcement_Learning (002).pdf

embargo fino al 25/07/2027

Tipo: Postprint / Author's Accepted Manuscript (AAM) - versione accettata per la pubblicazione dopo la peer-review
Licenza: Licenza per accesso libero gratuito
Dimensione 316.34 kB
Formato Adobe PDF
316.34 kB Adobe PDF   Visualizza/Apri   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1025512
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact