A Scheduling Strategy to Run Hadoop Jobs on Geodistributed Data

Marco, Cavallo; Lorenzo, Cusmà; Di Modica, G; Carmelo, Polito; Orazio, Tomarchio

doi:10.1007/978-3-319-33313-7_1

Internet-of-Things scenarios will be typically characterized by huge amounts of data made available. A challenging task is to efficiently manage such data, by analyzing, elaborating and extracting useful information from them. Distributed computing framework such as Hadoop, based on the MapReduce paradigm, have been used to process such amounts of data by exploiting the computing power of many cluster nodes. However, as long as the computing context is made of clusters of homogeneous nodes interconnected through high speed links, the benefit brought by the such frameworks is clear and tangible. Unfortunately, in many real big data applications the data to be processed reside in many computationally heterogeneous data centers distributed over the planet. In those contexts, Hadoop was proved to perform very poorly. The proposal presented in this paper addresses this limitation. We designed a context-aware Hadoop framework that is capable of scheduling and distributing tasks among geographically distant clusters in a way that minimizes overall jobs execution time. The proposed scheduler leverages on the integer partitioning technique and on an a-priori knowledge of big data application patterns to explore the space of all possible task schedules and estimate the one expected to perform best. Final experiments conducted on a scheduler prototype prove the benefit of the approach.

Cavallo, M., Cusmà, L., Di Modica, G., Polito, C., Tomarchio, O. (2016). A Scheduling Strategy to Run Hadoop Jobs on Geodistributed Data. Berlin : Springer Verlag [10.1007/978-3-319-33313-7_1].

A Scheduling Strategy to Run Hadoop Jobs on Geodistributed Data

Cavallo Marco;Cusmà Lorenzo;Di Modica G;Polito Carmelo;Tomarchio Orazio

2016

Abstract

Internet-of-Things scenarios will be typically characterized by huge amounts of data made available. A challenging task is to efficiently manage such data, by analyzing, elaborating and extracting useful information from them. Distributed computing framework such as Hadoop, based on the MapReduce paradigm, have been used to process such amounts of data by exploiting the computing power of many cluster nodes. However, as long as the computing context is made of clusters of homogeneous nodes interconnected through high speed links, the benefit brought by the such frameworks is clear and tangible. Unfortunately, in many real big data applications the data to be processed reside in many computationally heterogeneous data centers distributed over the planet. In those contexts, Hadoop was proved to perform very poorly. The proposal presented in this paper addresses this limitation. We designed a context-aware Hadoop framework that is capable of scheduling and distributing tasks among geographically distant clusters in a way that minimizes overall jobs execution time. The proposed scheduler leverages on the integer partitioning technique and on an a-priori knowledge of big data application patterns to explore the space of all possible task schedules and estimate the one expected to perform best. Final experiments conducted on a scheduler prototype prove the benefit of the approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Titolo del volume
	
				Advances in Service-Oriented and Cloud Computing: Workshops of ESOCC 2015, Revised Selected Papers
			
	Pagina iniziale
	
				5
			
	Pagina finale
	
				19
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-319-33313-7_1
			
	Citazione
	
				Cavallo, M., Cusmà, L., Di Modica, G., Polito, C., Tomarchio, O. (2016). A Scheduling Strategy to Run Hadoop Jobs on Geodistributed Data. Berlin : Springer Verlag [10.1007/978-3-319-33313-7_1].
			
	Tutti gli autori
	
						Cavallo, Marco; Cusmà, Lorenzo; Di Modica, G; Polito, Carmelo; Tomarchio, Orazio
					
	Appare nelle tipologie:
	
				2.01 Capitolo / saggio in libro

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/730314

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

8

5

CRIS Current Research Information System