CRIS Current Research Information System

Spanish is one of the most widespread languages: the official language in 20 countries and the second most-spoken native language. Its contact with other languages across different regions and the rich regional and cultural diversity has produced varieties which divert from each other, particularly in terms of lexicon. Still, available corpora, and models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. To alleviate the situation, we compile and curate datasets in the different varieties of Spanish around the world at an unprecedented scale and create the CEREAL corpus. With such a resource at hand, we perform a stylistic analysis to identify and characterise varietal differences. We implement a classifier specially designed to deal with long documents and identify Spanish varieties (and therefore expand CEREAL further). We produce varietal-specific embeddings, and analyse the cultural differences that they encode. We make data, code and models publicly available.

España-Bonet, C., Barrón-Cedeño, A. (2024). Elote, Choclo and Mazorca: on the Varieties of Spanish. Kerrville : Association for Computational Linguistics [10.18653/v1/2024.naacl-long.204].

Elote, Choclo and Mazorca: on the Varieties of Spanish

Cristina España-Bonet^Primo;Alberto Barrón-Cedeño^Secondo

2024

Abstract

Spanish is one of the most widespread languages: the official language in 20 countries and the second most-spoken native language. Its contact with other languages across different regions and the rich regional and cultural diversity has produced varieties which divert from each other, particularly in terms of lexicon. Still, available corpora, and models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. To alleviate the situation, we compile and curate datasets in the different varieties of Spanish around the world at an unprecedented scale and create the CEREAL corpus. With such a resource at hand, we perform a stylistic analysis to identify and characterise varietal differences. We implement a classifier specially designed to deal with long documents and identify Spanish varieties (and therefore expand CEREAL further). We produce varietal-specific embeddings, and analyse the cultural differences that they encode. We make data, code and models publicly available.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Titolo del volume
	
				Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
			
	Pagina iniziale
	
				3689
			
	Pagina finale
	
				3711
			
	Codice DOI
	
				https://dx.doi.org/10.18653/v1/2024.naacl-long.204
			
	Citazione
	
				España-Bonet, C., Barrón-Cedeño, A. (2024). Elote, Choclo and Mazorca: on the Varieties of Spanish. Kerrville : Association for Computational Linguistics [10.18653/v1/2024.naacl-long.204].
			
	Tutti gli autori
	
						España-Bonet, Cristina; Barrón-Cedeño, Alberto
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2024.naacl-long.204.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 3.18 MB Formato Adobe PDF Visualizza/Apri	3.18 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/972875

Citazioni

ND

6

ND

ND

social impact