CRIS Current Research Information System

The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work. extcopyright 2009 Elsevier Inc. All rights reserved.

A statistical approach to crosslingual natural language tasks

Pinto, David;Civera, Jorge;Barrón-Cedeño, Alberto;Juan, Alfons;Rosso, Paolo

2009

Abstract

The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work. extcopyright 2009 Elsevier Inc. All rights reserved.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2009
		
	Rivista
	
			JOURNAL OF ALGORITHMS
		
	Codice DOI
	
			https://dx.doi.org/10.1016/j.jalgor.2009.02.005
		
	Tutti gli autori
	
			Pinto, David and Civera, Jorge and Barrón-Cedeño, Alberto and Juan, Alfons and Rosso, Paolo
		
	Appare nelle tipologie:
	
			1.01 Articolo in rivista

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/707803

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

44

23

social impact