This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora. © 2012 Springer-Verlag.
Gupta P., Barron-Cedeno A., Rosso P. (2012). Cross-language high similarity search using a conceptual thesaurus [10.1007/978-3-642-33247-0_8].
Cross-language high similarity search using a conceptual thesaurus
Barron-Cedeno A.;
2012
Abstract
This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora. © 2012 Springer-Verlag.File | Dimensione | Formato | |
---|---|---|---|
BarronCedeno.pdf
accesso aperto
Tipo:
Postprint
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
182.89 kB
Formato
Adobe PDF
|
182.89 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.