Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.
Zaccarini, M., Poltronieri, F., Borsatti, D., Cerroni, W., Foschini, L., Grabarnik, G.Ya., et al. (2025). Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning. Piscataway : Institute of Electrical and Electronics Engineers Inc. [10.1109/noms57970.2025.11073590].
Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning
Borsatti, Davide;Cerroni, Walter;Foschini, Luca;Scotece, Domenico;Stefanelli, Cesare;
2025
Abstract
Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.| File | Dimensione | Formato | |
|---|---|---|---|
|
Chaos_Engineering_Based_Kubernetes_Pod_Rescheduling_Through_Deep_Sets_and_Reinforcement_Learning (002).pdf
embargo fino al 25/07/2027
Tipo:
Postprint / Author's Accepted Manuscript (AAM) - versione accettata per la pubblicazione dopo la peer-review
Licenza:
Licenza per accesso libero gratuito
Dimensione
316.34 kB
Formato
Adobe PDF
|
316.34 kB | Adobe PDF | Visualizza/Apri Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


