A comparison of approaches for measuring cross-lingual similarity of wikipedia articles

Barron-Cedeno, A.; Paramita, M. L.; Clough, P.; Rosso, P.

doi:10.1007/978-3-319-06028-6_36

Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis). © 2014 Springer International Publishing Switzerland.

Barron-Cedeno A., Paramita M.L., Clough P., Rosso P. (2014). A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. Springer Verlag [10.1007/978-3-319-06028-6_36].

A comparison of approaches for measuring cross-lingual similarity of wikipedia articles

Barron-Cedeno A.;Paramita M. L.;Clough P.;Rosso P.

2014

Abstract

Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis). © 2014 Springer International Publishing Switzerland.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2014
			
	Titolo del volume
	
				Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
			
	Pagina iniziale
	
				424
			
	Pagina finale
	
				429
			
	Collana/Serie
	
				LECTURE NOTES IN ARTIFICIAL INTELLIGENCE
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-319-06028-6_36
			
	Citazione
	
				Barron-Cedeno A.,  Paramita M.L.,  Clough P.,  Rosso P. (2014). A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. Springer Verlag [10.1007/978-3-319-06028-6_36].
			
	Tutti gli autori
	
						Barron-Cedeno A.; Paramita M.L.; Clough P.; Rosso P.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/709268

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

15

ND

ND

CRIS Current Research Information System