Measuring Web-corpus randomness: A progress report

Ciaramita, S.; Baroni, Marco

The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a Web corpus. The method is based on the comparison of the word frequency distributions of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We first show that the measure of randomness we devised gives the expected results when tested on random samples from the whole British National Corpus and from biased subsets of BNC documents. We then apply the method to the task of building a corpus via queries to the Google search engine. We obtain very encouraging results, indicating that our approach can be used, reliably, to distinguish between biased and unbiased document sets. More specifically, the results indicate that medium frequency query terms might lead to more random results (and thus to a less biased corpus) than either high frequency terms or terms selected from the whole frequency spectrum.

Ciaramita S., Baroni M. (2006). Measuring Web-corpus randomness: A progress report. BOLOGNA : Gedit.