This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight dierences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

Baroni M., Bernardini S., Ferraresi A., Zanchetta E. (2009). The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. LANGUAGE RESOURCES AND EVALUATION, 43(3), 209-226.

The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora

BARONI, MARCO;BERNARDINI, SILVIA;FERRARESI, ADRIANO;ZANCHETTA, EROS
2009

Abstract

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight dierences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
2009
Baroni M., Bernardini S., Ferraresi A., Zanchetta E. (2009). The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. LANGUAGE RESOURCES AND EVALUATION, 43(3), 209-226.
Baroni M.; Bernardini S.; Ferraresi A.; Zanchetta E.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/83043
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 844
  • ???jsp.display-item.citation.isi??? 496
social impact