A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

van Strien D., Beelen K., Ardanuy M.C., Hosseini K., McGillivray B., Colavizza G. (2020). Assessing the impact of OCR quality on downstream NLP tasks. SciTePress.

Assessing the impact of OCR quality on downstream NLP tasks

Colavizza G.
2020

Abstract

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.
2020
ICAART 2020 - Proceedings of the 12th International Conference on Agents and Artificial Intelligence
484
496
van Strien D., Beelen K., Ardanuy M.C., Hosseini K., McGillivray B., Colavizza G. (2020). Assessing the impact of OCR quality on downstream NLP tasks. SciTePress.
van Strien D.; Beelen K.; Ardanuy M.C.; Hosseini K.; McGillivray B.; Colavizza G.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/948704
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 97
  • ???jsp.display-item.citation.isi??? 72
social impact