This paper describes two very large (> 1 billion words) Web-derived “reference” corpora of English and French, called ukWaC and frWaC, and reports on a pilot study in which these resources are applied to a bilingual lexicography task focusing on collocation extraction and translation. The two corpora were assembled through automated procedures, and little is known of their actual contents. The study aimed therefore at providing mainly qualitative evaluation of the corpora by applying them to a practical task, i.e. ascertaining whether resources built automatically from the Web can be profitably applied to lexicographic work, on a par with more costly and carefully-built resources such as the British National Corpus (for English). The lexicographic task itself was set up simulating part of the revision of an English-French bilingual dictionary. Focusing unidirectionally on English=>French, it first of all compared the coverage of ukWaC vs. the widely used BNC in terms of collocational information of a sample of English SL nodewords. The evidence thus assembled was submitted to a professional lexicographer who evaluated relevance. The validated collocational complexes selected for inclusion in the revised version were then translated into French drawing evidence from frWaC, and the translations were validated by a professional translator (native speaker of French). The results suggest that the two Web corpora provide relevant and comparable linguistic evidence for lexicographic purposes. The paper is structured as follows: section 2 sets the framework for the study, reviewing current approaches to the use of the Web for cross-linguistic tasks, describing the Web corpora used, and the applications of corpora in lexicography work. Section 3 presents the objectives of the pilot investigation, the method followed and its results. In section 4, we draw conclusions and suggest directions for further work.

Web Corpora for Bilingual Lexicography: A Pilot Study of English/French Collocation Extraction and Translation / Ferraresi A.; Bernardini S.; Picci G.; Baroni M.. - STAMPA. - (2010), pp. 337-359.

Web Corpora for Bilingual Lexicography: A Pilot Study of English/French Collocation Extraction and Translation

FERRARESI, ADRIANO;BERNARDINI, SILVIA;BARONI, MARCO
2010

Abstract

This paper describes two very large (> 1 billion words) Web-derived “reference” corpora of English and French, called ukWaC and frWaC, and reports on a pilot study in which these resources are applied to a bilingual lexicography task focusing on collocation extraction and translation. The two corpora were assembled through automated procedures, and little is known of their actual contents. The study aimed therefore at providing mainly qualitative evaluation of the corpora by applying them to a practical task, i.e. ascertaining whether resources built automatically from the Web can be profitably applied to lexicographic work, on a par with more costly and carefully-built resources such as the British National Corpus (for English). The lexicographic task itself was set up simulating part of the revision of an English-French bilingual dictionary. Focusing unidirectionally on English=>French, it first of all compared the coverage of ukWaC vs. the widely used BNC in terms of collocational information of a sample of English SL nodewords. The evidence thus assembled was submitted to a professional lexicographer who evaluated relevance. The validated collocational complexes selected for inclusion in the revised version were then translated into French drawing evidence from frWaC, and the translations were validated by a professional translator (native speaker of French). The results suggest that the two Web corpora provide relevant and comparable linguistic evidence for lexicographic purposes. The paper is structured as follows: section 2 sets the framework for the study, reviewing current approaches to the use of the Web for cross-linguistic tasks, describing the Web corpora used, and the applications of corpora in lexicography work. Section 3 presents the objectives of the pilot investigation, the method followed and its results. In section 4, we draw conclusions and suggest directions for further work.
2010
Using Corpora in Contrastive and Translation Studies
337
359
Web Corpora for Bilingual Lexicography: A Pilot Study of English/French Collocation Extraction and Translation / Ferraresi A.; Bernardini S.; Picci G.; Baroni M.. - STAMPA. - (2010), pp. 337-359.
Ferraresi A.; Bernardini S.; Picci G.; Baroni M.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/83040
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact