In this paper we propose a novel cost model for Spark SQL. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark. The set of operations adopted by Spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20% of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14%, if the analytic model is coupled with our straggler handling strategy.

Matteo Golfarelli, Lorenzo Baldacci (2019). A Cost Model for SPARK SQL. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 31(5), 819-832 [10.1109/TKDE.2018.2850339].

A Cost Model for SPARK SQL

Matteo Golfarelli
Primo
Conceptualization
;
Lorenzo Baldacci
Secondo
Software
2019

Abstract

In this paper we propose a novel cost model for Spark SQL. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark. The set of operations adopted by Spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20% of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14%, if the analytic model is coupled with our straggler handling strategy.
2019
Matteo Golfarelli, Lorenzo Baldacci (2019). A Cost Model for SPARK SQL. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 31(5), 819-832 [10.1109/TKDE.2018.2850339].
Matteo Golfarelli; Lorenzo Baldacci
File in questo prodotto:
File Dimensione Formato  
TKDE_2019_Spark_SQL_Cost_model.pdf

accesso aperto

Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 813.39 kB
Formato Adobe PDF
813.39 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/658373
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? 17
social impact