In this paper we describe an approach to the identification of "translationese" based on monolingual comparable corpora and machine learning techniques for text categorization. Our experiments are based on a corpus of Italian articles in the geopolitical domain. The articles were in part written in Italian (~2M tokens) and in part translated into Italian from various languages (~900K tokens). As we will argue, this corpus is particularly well-suited to study translationese, since the original and translated articles are extremely similar in all other respects. We use support vector machines (SVMs) as classifier of choice, and we explore a number of different ways to represent a document as a feature vector by varying both size (unigrams, bigrams and trigrams) and type (wordform, lemma, pos, mixed) of the units encoded as features. In a series of 16-fold cross-validation experiments, we find that an ensemble of SVMs combined using a novel recall maximization scheme achieves 86.7% accuracy with 89.3% precision and 83.3% recall. A preliminary analysis of the features used by the SVMs suggests that the distribution of function words (in particular personal pronouns and adverbs) and morphosyntactic categories are among the most important cues used by the SVMs to perform the discrimination task. A follow-up experiment shows that the performance attained by the SVMs is well above the average performance of ten human subjects, including five professional translators, on the same task. Our results offer solid evidence supporting the translationese hypothesis, and our method seems to have interesting applications, in particular, for translation studies, quantitative style analysis, and machine learning/text categorization. In more general terms, this study exemplifies a robust method of corpus comparison, whose applicability to corpus linguistics is still under-explored but we believe very promising.

Spotting translationese: A corpus-driven approach using support vector machines

BERNARDINI, SILVIA;BARONI, MARCO
2005

Abstract

In this paper we describe an approach to the identification of "translationese" based on monolingual comparable corpora and machine learning techniques for text categorization. Our experiments are based on a corpus of Italian articles in the geopolitical domain. The articles were in part written in Italian (~2M tokens) and in part translated into Italian from various languages (~900K tokens). As we will argue, this corpus is particularly well-suited to study translationese, since the original and translated articles are extremely similar in all other respects. We use support vector machines (SVMs) as classifier of choice, and we explore a number of different ways to represent a document as a feature vector by varying both size (unigrams, bigrams and trigrams) and type (wordform, lemma, pos, mixed) of the units encoded as features. In a series of 16-fold cross-validation experiments, we find that an ensemble of SVMs combined using a novel recall maximization scheme achieves 86.7% accuracy with 89.3% precision and 83.3% recall. A preliminary analysis of the features used by the SVMs suggests that the distribution of function words (in particular personal pronouns and adverbs) and morphosyntactic categories are among the most important cues used by the SVMs to perform the discrimination task. A follow-up experiment shows that the performance attained by the SVMs is well above the average performance of ten human subjects, including five professional translators, on the same task. Our results offer solid evidence supporting the translationese hypothesis, and our method seems to have interesting applications, in particular, for translation studies, quantitative style analysis, and machine learning/text categorization. In more general terms, this study exemplifies a robust method of corpus comparison, whose applicability to corpus linguistics is still under-explored but we believe very promising.
2005
Proceedings of Corpus Linguistics Conference Series 2005
1
12
Bernardini S.; Baroni M.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/4868
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact