Corpus Linguistics, Translation and Interpreting

Bernardini, Silvia; Russo, Mariachiara

During the first two decades of the 21st century the corpus methodology established itself as one of the major paradigms in linguistics. Its fundamental assumption is that language should be studied by looking at genuine text samples stored electronically, rather than by relying on introspection and decontextualised, artificial examples. This view of linguistics as the study of language performance (or E-language to use Chomsky’s (1986) term) rather than language competence (or I-language) is compatible with a product-oriented approach to the study of translation and interpreting. In this approach, the focus of attention is on the products delivered by translators and interpreters, rather than on their mental processes. While the latter can be studied through questionnaires, interviews, think-aloud protocols, key-logging, eyetracking, and so forth, corpus-based translation and interpreting studies (hereinafter CTS and CIS) draw the bulk of their evidence from translated and interpreted texts assembled in corpora. A corpus is a collection of texts, including transcriptions of spoken discourse, selected according to pre-defined criteria to be representative of a language variety, and stored in electronic format for consultation through a corpus query tool. In its simplest form, a corpus can consist of a few dozen text files stored in a local folder and searched through a stand-alone concordancer such as AntConc (Anthony 2014) or Wordsmith Tools (Scott 2016). However, corpora can also be very large and enriched with contextual metadata (about authors, publication details, intended audience etc.) and structural and/or linguistic information (about textual subdivisions, graphical emphasis, pauses, hesitations, parts-of-speech, lemmas etc.). The former, sometimes referred to as “DIY” or “disposable” corpora, are often constructed by single users (students, language professionals, linguists) for a specific task while the latter, requiring both linguistic and computational expertise and substantial efforts, are constructed by teams of corpus linguists and made available to the research community through client/ server systems (see Baroni and Bernardini 2013 for further details on corpus preparation and corpus query systems). Due to the nature of the object of study, positioned at the boundaries of two or more linguacultures, corpora for translation and interpreting research tend to be more complex than those used in other corpus linguistics (CL) fields, such as discourse studies or (monolingual) lexicography. Two main corpus typologies are used in CTS/CIS. The first, monolingual comparable corpora, include a minimum of two subcorpora, i.e., two collections of texts (“text” here subsumes oral language transcripts) in the same language, similar in all respects but for the existence vs. absence of a constraining source text (henceforth ST). The second, (bilingual) parallel corpora, include (transcripts of) STs and corresponding target texts (henceforth TTs) in one or more languages or by one or more translators/interpreters, aligned to each other, usually at the sentence level. Alongside ST–TT alignment in parallel corpora, interpreting corpora and corpora used in audiovisual translation or sign-language research may also include text-to-sound/video alignment, in which case they may be referred to as multimodal corpora. These should not be confused with intermodal corpora, containing interpreted and translated language and/or samples from different interpreting modalities (see further “Current debates and future directions in CL, CTS AND CIS” below).