This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

Filgueira R., Coll Ardanuy M., Colavizza G., Hetherington J., Terras M., Jackson M., et al. (2019). Defoe: A spark-based toolbox for analysing digital historical textual data. Institute of Electrical and Electronics Engineers Inc. [10.1109/eScience.2019.00033].

Defoe: A spark-based toolbox for analysing digital historical textual data

Colavizza G.;
2019

Abstract

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.
2019
Proceedings - IEEE 15th International Conference on eScience, eScience 2019
235
242
Filgueira R., Coll Ardanuy M., Colavizza G., Hetherington J., Terras M., Jackson M., et al. (2019). Defoe: A spark-based toolbox for analysing digital historical textual data. Institute of Electrical and Electronics Engineers Inc. [10.1109/eScience.2019.00033].
Filgueira R.; Coll Ardanuy M.; Colavizza G.; Hetherington J.; Terras M.; Jackson M.; Roubickova A.; Krause A.; Ahnert R.; Hauswedell T.; Nyhan J.; Bea...espandi
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/948741
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? ND
social impact