CRIS Current Research Information System

Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents.

Moro, G., Valgimigli, L., Rossi, A., Casadei, C., Montefiori, A. (2022). Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-17849-8_23].

Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents

Moro, G;Valgimigli, L;Rossi, A;Casadei, C;Montefiori, A

2022

Abstract

Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Titolo del volume
	
				Similarity Search and Applications 15th International Conference, SISAP 2022, Bologna, Italy, October 5–7, 2022, Proceedings
			
	Pagina iniziale
	
				283
			
	Pagina finale
	
				290
			
	Collana/Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-17849-8_23
			
	Citazione
	
				Moro, G., Valgimigli, L., Rossi, A., Casadei, C., Montefiori, A. (2022). Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-17849-8_23].
			
	Tutti gli autori
	
						Moro, G; Valgimigli, L; Rossi, A; Casadei, C; Montefiori, A

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/907739

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

1

1

social impact