CRIS Current Research Information System

Distributed workload queues are nowadays widely used due to their significant advantages in terms of decoupling, resilience, and scaling. Task allocation to worker nodes in distributed queue systems is typically simplistic (e.g., Least Recently Used) or uses hand-crafted heuristics that require task-specific information (e.g., task resource demands or expected time of execution). When such task information is not available and worker node capabilities are not homogeneous, the existing placement strategies may lead to unnecessarily large execution timings and usage costs. In this work, we formulate the task allocation problem in the Markov Decision Process framework, in which an agent assigns tasks to an available resource, and receives a numerical reward signal upon task completion. Our adaptive and learning-based task allocation solution, Reinforcement Learning based Queues ( RLQ ), is implemented and integrated with the popular Celery task queuing system for Python. We compare RLQ against traditional solutions using both synthetic and real workload traces. On average, using synthetic workloads, RLQ reduces the execution cost by approximately 70%, the execution time by a factor of at least 3×, and the waiting time by almost 7×. Using real traces, we observe an improvement of about 20% for execution cost, around 70% improvement for execution time, and a reduction of approximately 20× in waiting time. We also compare RLQ with a strategy inspired by E-PVM, a state-of-the-art solution used in Google's Borg cluster manager, showing we are able to outperform it in five out of six scenarios.

Staffolani, A., Darvariu, V., Bellavista, P., Musolesi, M. (2023). RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 34(3), 856-868 [10.1109/TPDS.2022.3231981].

RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues

Staffolani, Alessandro;Darvariu, Victor-Alexandru;Bellavista, Paolo;Musolesi, Mirco

2023

Abstract

Distributed workload queues are nowadays widely used due to their significant advantages in terms of decoupling, resilience, and scaling. Task allocation to worker nodes in distributed queue systems is typically simplistic (e.g., Least Recently Used) or uses hand-crafted heuristics that require task-specific information (e.g., task resource demands or expected time of execution). When such task information is not available and worker node capabilities are not homogeneous, the existing placement strategies may lead to unnecessarily large execution timings and usage costs. In this work, we formulate the task allocation problem in the Markov Decision Process framework, in which an agent assigns tasks to an available resource, and receives a numerical reward signal upon task completion. Our adaptive and learning-based task allocation solution, Reinforcement Learning based Queues ( RLQ ), is implemented and integrated with the popular Celery task queuing system for Python. We compare RLQ against traditional solutions using both synthetic and real workload traces. On average, using synthetic workloads, RLQ reduces the execution cost by approximately 70%, the execution time by a factor of at least 3×, and the waiting time by almost 7×. Using real traces, we observe an improvement of about 20% for execution cost, around 70% improvement for execution time, and a reduction of approximately 20× in waiting time. We also compare RLQ with a strategy inspired by E-PVM, a state-of-the-art solution used in Google's Borg cluster manager, showing we are able to outperform it in five out of six scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Rivista
	
				IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
			
	Codice DOI
	
				https://dx.doi.org/10.1109/TPDS.2022.3231981
			
	Citazione
	
				Staffolani, A., Darvariu, V., Bellavista, P., Musolesi, M. (2023). RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 34(3), 856-868 [10.1109/TPDS.2022.3231981].
			
	Tutti gli autori
	
						Staffolani, Alessandro; Darvariu, Victor-Alexandru; Bellavista, Paolo; Musolesi, Mirco
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
RLQ-Staffolani_et_al_2023-accepted.pdf accesso aperto Tipo: Postprint Licenza: Licenza per accesso libero gratuito Dimensione 2.91 MB Formato Adobe PDF Visualizza/Apri	2.91 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/913504

Citazioni

ND

7

6

social impact