The automatic detection of shared content in written docu- ments –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have de- signed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a sim- ilarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally im- practicable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplex- ity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

A. Barron-Cedeno, C. Basile, M. Degli Esposti, P. Rosso (2010). Word Length n-Grams for Text Re-use Detection. Heidelberg : Springer [10.1007/978-3-642-12116-6_58].

Word Length n-Grams for Text Re-use Detection

BARRON CEDENO, LUIS ALBERTO;DEGLI ESPOSTI, MIRKO;
2010

Abstract

The automatic detection of shared content in written docu- ments –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have de- signed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a sim- ilarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally im- practicable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplex- ity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
2010
Computational Linguistics and Intelligent Text Processing
687
699
A. Barron-Cedeno, C. Basile, M. Degli Esposti, P. Rosso (2010). Word Length n-Grams for Text Re-use Detection. Heidelberg : Springer [10.1007/978-3-642-12116-6_58].
A. Barron-Cedeno; C. Basile; M. Degli Esposti; P. Rosso
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/84409
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 26
  • ???jsp.display-item.citation.isi??? 20
social impact