Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents.
Moro, G., Valgimigli, L., Rossi, A., Casadei, C., Montefiori, A. (2022). Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG [10.1007/978-3-031-17849-8_23].
Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents
Moro, G
;Valgimigli, L;
2022
Abstract
Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.