CRIS Current Research Information System

Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.

Todorov Konstantin, Colavizza Giovanni (2020). Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition. CEUR-WS.

Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition

Todorov Konstantin;Colavizza Giovanni

2020

Abstract

Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Titolo del volume
	
				CEUR Workshop Proceedings
			
	Pagina iniziale
	
				310
			
	Pagina finale
	
				339
			
	Collana/Serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Citazione
	
				Todorov Konstantin,  Colavizza Giovanni (2020). Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition. CEUR-WS.
			
	Tutti gli autori
	
						Todorov Konstantin; Colavizza Giovanni
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Todorov and Colavizza - Transfer Learning for Historical Corpora An Asses.pdf accesso aperto Descrizione: Contributo Tipo: Versione (PDF) editoriale Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 797.62 kB Formato Adobe PDF Visualizza/Apri	797.62 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/948743

Citazioni

ND

7

ND

social impact