A new approach to the study of translationese: Machine-learning the difference between original and translated text

Baroni, Marco; Bernardini, Silvia

In this paper we describe an approach to the identification of "translationese" based on monolingual comparable corpora and machine learning techniques for text categorization. The paper reports on experiments in which support vector machines (SVMs) are employed to recognize translated text in a corpus of Italian articles from the geopolitical domain. An ensemble of SVMs reaches 86.7% accuracy with 89.3% precision and 83.3% recall on this task. A preliminary analysis of the features used by the SVMs suggest that the distribution of function words and morphosyntactic categories in general, and personal pronouns and adverbs in particular are among the cues used by the SVMs to perform the discrimination task. A follow-up experiment shows that the performance attained by SVMs is well above the average performance of 10 human subjects, including 5 professional translators, on the same task. Our results offer solid evidence supporting the translationese hypothesis, and our method seems to have promising applications in translation studies and more in general in quantitative style analysis. Implications for the machine learning/text categorization community are equally important, because this is a novel application, and especially because we provide explicit evidence that a relatively knowledge-poor machine learning algorithm can outperform human beings in a text classification task.

Baroni M., Bernardini S. (2006). A new approach to the study of translationese: Machine-learning the difference between original and translated text. LITERARY & LINGUISTIC COMPUTING, 21(3), 259-274.