Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.
Introducing and evaluating ukWaC, a very large Web-derived corpus of English / A. Ferraresi; E. Zanchetta; M. Baroni; S. Bernardini. - ELETTRONICO. - (2008), pp. 47-54. (Intervento presentato al convegno Web as Corpus (WAC-4) Workshop at LREC 2008 tenutosi a Marrakech, Marocco nel 1 May 2008).
Introducing and evaluating ukWaC, a very large Web-derived corpus of English
FERRARESI, ADRIANO;ZANCHETTA, EROS;BARONI, MARCO;BERNARDINI, SILVIA
2008
Abstract
Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.