Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

Todorov, K., Colavizza, G. (2022). An Assessment of the Impact of OCR Noise on Language Models. AV D MANUELL, 27A 2 ESQ, SETUBAL, 2910-595, PORTUGAL : SCITEPRESS [10.5220/0010945100003116].

An Assessment of the Impact of OCR Noise on Language Models

Colavizza, G
2022

Abstract

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.
2022
Proceedings of the 14th International Conference on Agents and Artificial Intelligence (ICAART 2022) - Volume 2
674
683
Todorov, K., Colavizza, G. (2022). An Assessment of the Impact of OCR Noise on Language Models. AV D MANUELL, 27A 2 ESQ, SETUBAL, 2910-595, PORTUGAL : SCITEPRESS [10.5220/0010945100003116].
Todorov, K; Colavizza, G
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/948798
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact