The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a Web corpus. The method is based on the comparison of the word frequency distributions of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We first show that the measure of randomness we devised gives the expected results when tested on random samples from the whole British National Corpus and from biased subsets of BNC documents. We then apply the method to the task of building a corpus via queries to the Google search engine. We obtain very encouraging results, indicating that our approach can be used, reliably, to distinguish between biased and unbiased document sets. More specifically, the results indicate that medium frequency query terms might lead to more random results (and thus to a less biased corpus) than either high frequency terms or terms selected from the whole frequency spectrum.

Ciaramita S., Baroni M. (2006). Measuring Web-corpus randomness: A progress report. BOLOGNA : Gedit.

Measuring Web-corpus randomness: A progress report

BARONI, MARCO
2006

Abstract

The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a Web corpus. The method is based on the comparison of the word frequency distributions of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We first show that the measure of randomness we devised gives the expected results when tested on random samples from the whole British National Corpus and from biased subsets of BNC documents. We then apply the method to the task of building a corpus via queries to the Google search engine. We obtain very encouraging results, indicating that our approach can be used, reliably, to distinguish between biased and unbiased document sets. More specifically, the results indicate that medium frequency query terms might lead to more random results (and thus to a less biased corpus) than either high frequency terms or terms selected from the whole frequency spectrum.
2006
Wacky! Working papers on the Web as Corpus
127
158
Ciaramita S., Baroni M. (2006). Measuring Web-corpus randomness: A progress report. BOLOGNA : Gedit.
Ciaramita S.; Baroni M.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/17104
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact