Methods for cross-language plagiarism detection

Barrón-Cedeño, Alberto; Gupta, Parth; Rosso, Paolo

doi:10.1016/j.knosys.2013.06.018

Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired. extcopyright 2013 Elsevier B.V. All rights reserved.

Methods for cross-language plagiarism detection / Barrón-Cedeño, Alberto and Gupta, Parth and Rosso, Paolo. - In: KNOWLEDGE-BASED SYSTEMS. - ISSN 0950-7051. - ELETTRONICO. - 50:(2013), pp. 211-217. [10.1016/j.knosys.2013.06.018]

Methods for cross-language plagiarism detection

Barrón-Cedeño, Alberto;Gupta, Parth;Rosso, Paolo

2013

Abstract

Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired. extcopyright 2013 Elsevier B.V. All rights reserved.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2013
		
	Rivista
	
			KNOWLEDGE-BASED SYSTEMS
		
	Codice DOI
	
			https://dx.doi.org/10.1016/j.knosys.2013.06.018
		
	Citazione
	
			Methods for cross-language plagiarism detection / Barrón-Cedeño, Alberto and Gupta, Parth and Rosso, Paolo. - In: KNOWLEDGE-BASED SYSTEMS. - ISSN 0950-7051. - ELETTRONICO. - 50:(2013), pp. 211-217. [10.1016/j.knosys.2013.06.018]
		
	Tutti gli autori
	
			Barrón-Cedeño, Alberto and Gupta, Parth and Rosso, Paolo
		
	Appare nelle tipologie:
	
			1.01 Articolo in rivista

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/707694

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

66

48

CRIS Current Research Information System