Comparable Web corpora for bilingual lexicography: a pilot study of English/French collocation extraction and translation

Ferraresi, Adriano; Bernardini, Silvia; Baroni, Marco; Picci, G.

This paper describes two very large (> 1 billion words) Web-derived “reference” corpora of English and French, called ukWaC and frWaC, and reports on a pilot study in which these resources are applied to a bilingual lexicography task focusing on collocation extraction and translation. The two corpora were assembled through automated procedures, and little is known of their actual contents. The study aimed therefore at providing mainly qualitative evaluation of the corpora by applying them to a practical task, i.e. ascertaining whether resources built automatically from the Web can be profitably applied to lexicographic work, on a par with more costly and carefully-built resources such as the British National Corpus (for English). The lexicographic task itself was set up simulating part of the revision of an English-French bilingual dictionary. Focusing unidirectionally on English=>French, it first of all compared the coverage of ukWaC vs. the widely used BNC in terms of collocational information of a sample of English SL nodewords. The evidence thus assembled was submitted to a professional lexicographer who evaluated relevance. The validated collocational complexes selected for inclusion in the revised version were then translated into French drawing evidence from frWaC, and the translations were validated by a professional translator (native speaker of French). The results suggest that the two Web corpora provide relevant and comparable linguistic evidence for lexicographic purposes.

A. Ferraresi, S. Bernardini, M. Baroni, G. Picci (2008). Comparable Web corpora for bilingual lexicography: a pilot study of English/French collocation extraction and translation. HANGZHOU : s.n.