Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.

Introducing and evaluating ukWaC, a very large Web-derived corpus of English / A. Ferraresi; E. Zanchetta; M. Baroni; S. Bernardini. - ELETTRONICO. - (2008), pp. 47-54. (Intervento presentato al convegno Web as Corpus (WAC-4) Workshop at LREC 2008 tenutosi a Marrakech, Marocco nel 1 May 2008).

Introducing and evaluating ukWaC, a very large Web-derived corpus of English

FERRARESI, ADRIANO;ZANCHETTA, EROS;BARONI, MARCO;BERNARDINI, SILVIA
2008

Abstract

Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.
2008
Proceedings of the 4th Web as Corpus (WAC-4) "Can we beat Google?"
47
54
Introducing and evaluating ukWaC, a very large Web-derived corpus of English / A. Ferraresi; E. Zanchetta; M. Baroni; S. Bernardini. - ELETTRONICO. - (2008), pp. 47-54. (Intervento presentato al convegno Web as Corpus (WAC-4) Workshop at LREC 2008 tenutosi a Marrakech, Marocco nel 1 May 2008).
A. Ferraresi; E. Zanchetta; M. Baroni; S. Bernardini
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/64955
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact