The BootCaT front-end is a graphical interface for the BootCaT toolkit (Baroni and Bernardini 2004). It automates the process of finding reference texts on the web and collating them in a single corpus. The pipeline allows varying levels of control. In the first step, users provide a list of single- or multi-word terms to be used as seeds for text collection. These are then combined into “tuples” of varying length and sent as queries to a search engine, which returns a list of potentially relevant URLs. At this point the user has the option of inspecting the URLs and trimming them; the actual web pages are then retrieved, converted to plain text and saved in "txt" format. The corpus can thus be interrogated using most concordancers. Using BootCat one can build a relatively large quick-and-dirty corpus (typically of about 80 texts, with default parameters and no manual quality checks) in less than half an hour. This flexible approach to the task makes BootCaT a very useful tool for translators and translation students, which has been used in the translation and terminology classroom to build small DIY corpora of varying size and specialization. As of June 2017, the software has been downloaded and installed by over 2800 single users, from 74 countries.

BootCaT v. 0.8 (Simple utilities to Bootstrap Corpora and Terms from the Web)

ZANCHETTA, EROS;BERNARDINI, SILVIA;FERRARESI, ADRIANO;LECCI, CLAUDIA;DALAN, ERIKA
2016

Abstract

The BootCaT front-end is a graphical interface for the BootCaT toolkit (Baroni and Bernardini 2004). It automates the process of finding reference texts on the web and collating them in a single corpus. The pipeline allows varying levels of control. In the first step, users provide a list of single- or multi-word terms to be used as seeds for text collection. These are then combined into “tuples” of varying length and sent as queries to a search engine, which returns a list of potentially relevant URLs. At this point the user has the option of inspecting the URLs and trimming them; the actual web pages are then retrieved, converted to plain text and saved in "txt" format. The corpus can thus be interrogated using most concordancers. Using BootCat one can build a relatively large quick-and-dirty corpus (typically of about 80 texts, with default parameters and no manual quality checks) in less than half an hour. This flexible approach to the task makes BootCaT a very useful tool for translators and translation students, which has been used in the translation and terminology classroom to build small DIY corpora of varying size and specialization. As of June 2017, the software has been downloaded and installed by over 2800 single users, from 74 countries.
2016
Zanchetta, Eros; Bernardini, Silvia; Ferraresi, Adriano; Lecci, Claudia; Dalan, Erika
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/600367
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact