Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.
A. Ferraresi, E. Zanchetta, M. Baroni, S. Bernardini (2008). Introducing and evaluating ukWaC, a very large Web-derived corpus of English. MARRAKECH : s.n.
Introducing and evaluating ukWaC, a very large Web-derived corpus of English
FERRARESI, ADRIANO;ZANCHETTA, EROS;BARONI, MARCO;BERNARDINI, SILVIA
2008
Abstract
Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.