Deep Reference Mining From Scholarly Literature in the Arts and Humanities

Rodrigues Alves, D.; Colavizza, G.; Kaplan, F.

doi:10.3389/frma.2018.00021

We consider the task of reference mining: the detection, extraction and classification of references within the full text of scholarly publications. Reference mining brings forward specific challenges, such as the need to capture the morphology of highly abbreviated words and the dependence among the elements of a reference, both following codified reference styles. This task is particularly difficult, and little explored, with respect to the literature in the arts and humanities, where references are mostly given in footnotes. We apply a deep learning architecture for reference mining from the full text of scholarly publications. We explore and discuss three architectural components: word and character-level word embeddings, different prediction layers (Softmax and Conditional Random Fields) and multi-task over single-task learning. Our best model uses both pre-trained word embeddings and characters embeddings, and a BiLSTM-CRF architecture. We test our solution on a dataset of annotated references from the historiography on Venice and, using a linear-chain CRF classifier as a baseline, we show that this deep learning architecture improves by a considerable margin. Furthermore, multi-task learning performs almost on par with a single-task approach. We thus confirm that there are important gains to be had by adopting deep learning for the task of reference mining.

Rodrigues Alves D., Colavizza G., Kaplan F. (2018). Deep Reference Mining From Scholarly Literature in the Arts and Humanities. FRONTIERS IN RESEARCH METRICS AND ANALYTICS, 3, 1-13 [10.3389/frma.2018.00021].

Deep Reference Mining From Scholarly Literature in the Arts and Humanities

Rodrigues Alves D.;Colavizza G.;Kaplan F.

2018

Abstract

We consider the task of reference mining: the detection, extraction and classification of references within the full text of scholarly publications. Reference mining brings forward specific challenges, such as the need to capture the morphology of highly abbreviated words and the dependence among the elements of a reference, both following codified reference styles. This task is particularly difficult, and little explored, with respect to the literature in the arts and humanities, where references are mostly given in footnotes. We apply a deep learning architecture for reference mining from the full text of scholarly publications. We explore and discuss three architectural components: word and character-level word embeddings, different prediction layers (Softmax and Conditional Random Fields) and multi-task over single-task learning. Our best model uses both pre-trained word embeddings and characters embeddings, and a BiLSTM-CRF architecture. We test our solution on a dataset of annotated references from the historiography on Venice and, using a linear-chain CRF classifier as a baseline, we show that this deep learning architecture improves by a considerable margin. Furthermore, multi-task learning performs almost on par with a single-task approach. We thus confirm that there are important gains to be had by adopting deep learning for the task of reference mining.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2018
			
	Rivista
	
				FRONTIERS IN RESEARCH METRICS AND ANALYTICS
			
	Codice DOI
	
				https://dx.doi.org/10.3389/frma.2018.00021
			
	Citazione
	
				Rodrigues Alves D.,  Colavizza G.,  Kaplan F. (2018). Deep Reference Mining From Scholarly Literature in the Arts and Humanities. FRONTIERS IN RESEARCH METRICS AND ANALYTICS, 3, 1-13 [10.3389/frma.2018.00021].
			
	Tutti gli autori
	
						Rodrigues Alves D.; Colavizza G.; Kaplan F.

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/991263

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

18

ND

CRIS Current Research Information System