The EPIC corpus is the first parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions (from English into Italian and Spanish; from Italian into English and Spanish; and from Spanish into Italian and English). It contains a total of 357 speeches (177,295 words). The EPIC corpus includes video clips of each source language speaker, audio clips of the corresponding interpreted target speeches and transcripts of all the clips. The corpus has been orthographically transcribed. Annotation includes paralinguistic features (truncated, mispronounced words, ...) and metadata (a header at the beginning of each transcript and information about the speaker and the speech). The transcripts are POS (part-of-speech) tagged and lemmatised. Non-tagged transcripts in text format are also available. Size of the nine subcorpora in the EPIC corpus: sub-corpus / number of speeches / total word count / % of EPIC ORG-EN (source) / 81 / 42,705 / 25 INT-EN-IT (interpretation) / 81 / 35,765 / 20 INT-EN-ES (interpretation) / 81 / 38,066 / 21 ORG-IT (source) / 17 / 6,765 / 4 INT-IT-EN (interpretation) / 17 / 6,708 / 4 INT-IT-ES (interpretation) / 17 / 7,052 / 4 ORG-ES (source) / 21 / 14,406 / 8 INT-ES-IT (interpretation) / 21 / 12,833 / 7 INT-ES-EN (interpretation) / 21 / 12,995 / 7 TOTAL / 357 / 177,295 / 100 The EPIC corpus was developed by a multidisciplinary research group based at the Department of Interdisciplinary Studies in Translation, Languages and Cultures (University of Bologna at Forlì), involving interpreting scholars, corpus linguists and IT technicians: Mariachiara Russo (coordinator), Claudio Bendazzoli, Cristina Monti, Annalisa Sandrelli, Marco Baroni, Silvia Bernardini, Gabriele Mack, Lorenzo Piccioni, Eros Zanchetta, Elio Ballardini, Peter Mead. Applications Applications existing : Speech recognition#Automatic speech recognition#Automatic person recognition Technical Information Distribution medium : DVD Contents Click on the arrow to display content. speech corpus Language(s) : English >>>> Italian ; Italian >>>> English ; Spanish, Castilian >>>> English ; English >>>> Spanish, Castilian ; Spanish, Castilian >>>> Italian ; Italian >>>> Spanish, Castilian TEXT_QUANTISATION8-bit TEXT_CLIPPING_RATE_PERCENTAGE32 KhZ Source Channel : Microphone TEXT_SOUND_TYPE_ANNOTATIONMispronunciation#Truncation TEXT_TRANSCRIPTION_ENTRIESOrthographic TEXT_ANNOTATION_COVERAGEFull TEXT_ANNOTATION_LEVELOrthographic TEXT_ANNOTATION_LANGUAGEXML Video Number of languages : Parallel Language(s) :
M. Russo, C. Bendazzoli, A. Sandrelli, C. Monti (2011). European Parliament Interpreting Corpus (EPIC).
European Parliament Interpreting Corpus (EPIC)
RUSSO, MARIACHIARA;BENDAZZOLI, CLAUDIO;SANDRELLI, ANNALISA;MONTI, CRISTINA
2011
Abstract
The EPIC corpus is the first parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions (from English into Italian and Spanish; from Italian into English and Spanish; and from Spanish into Italian and English). It contains a total of 357 speeches (177,295 words). The EPIC corpus includes video clips of each source language speaker, audio clips of the corresponding interpreted target speeches and transcripts of all the clips. The corpus has been orthographically transcribed. Annotation includes paralinguistic features (truncated, mispronounced words, ...) and metadata (a header at the beginning of each transcript and information about the speaker and the speech). The transcripts are POS (part-of-speech) tagged and lemmatised. Non-tagged transcripts in text format are also available. Size of the nine subcorpora in the EPIC corpus: sub-corpus / number of speeches / total word count / % of EPIC ORG-EN (source) / 81 / 42,705 / 25 INT-EN-IT (interpretation) / 81 / 35,765 / 20 INT-EN-ES (interpretation) / 81 / 38,066 / 21 ORG-IT (source) / 17 / 6,765 / 4 INT-IT-EN (interpretation) / 17 / 6,708 / 4 INT-IT-ES (interpretation) / 17 / 7,052 / 4 ORG-ES (source) / 21 / 14,406 / 8 INT-ES-IT (interpretation) / 21 / 12,833 / 7 INT-ES-EN (interpretation) / 21 / 12,995 / 7 TOTAL / 357 / 177,295 / 100 The EPIC corpus was developed by a multidisciplinary research group based at the Department of Interdisciplinary Studies in Translation, Languages and Cultures (University of Bologna at Forlì), involving interpreting scholars, corpus linguists and IT technicians: Mariachiara Russo (coordinator), Claudio Bendazzoli, Cristina Monti, Annalisa Sandrelli, Marco Baroni, Silvia Bernardini, Gabriele Mack, Lorenzo Piccioni, Eros Zanchetta, Elio Ballardini, Peter Mead. Applications Applications existing : Speech recognition#Automatic speech recognition#Automatic person recognition Technical Information Distribution medium : DVD Contents Click on the arrow to display content. speech corpus Language(s) : English >>>> Italian ; Italian >>>> English ; Spanish, Castilian >>>> English ; English >>>> Spanish, Castilian ; Spanish, Castilian >>>> Italian ; Italian >>>> Spanish, Castilian TEXT_QUANTISATION8-bit TEXT_CLIPPING_RATE_PERCENTAGE32 KhZ Source Channel : Microphone TEXT_SOUND_TYPE_ANNOTATIONMispronunciation#Truncation TEXT_TRANSCRIPTION_ENTRIESOrthographic TEXT_ANNOTATION_COVERAGEFull TEXT_ANNOTATION_LEVELOrthographic TEXT_ANNOTATION_LANGUAGEXML Video Number of languages : Parallel Language(s) :I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.