CRIS Current Research Information System

This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.

Miličević Petrović, M., Batanović, V., Trnavac, R., Kovačević, B. (2022). Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project. Ljubljana : Inštitut za novejšo zgodovino = Institute of Contemporary History.

Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project

Miličević Petrović, Maja^Primo;Batanović, Vuk;Trnavac, Radoslava;Kovačević, Borko

2022

Abstract

This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Titolo del volume
	
				Zbornik konference Jezikovne tehnologije in digitalna humanistika = Proceedings of the Conference on Language Technologies and Digital Humanities
			
	Pagina iniziale
	
				124
			
	Pagina finale
	
				131
			
	Citazione
	
				Miličević Petrović, M., Batanović, V., Trnavac, R., Kovačević, B. (2022). Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project. Ljubljana : Inštitut za novejšo zgodovino = Institute of Contemporary History.
			
	Tutti gli autori
	
						Miličević Petrović, Maja; Batanović, Vuk; Trnavac, Radoslava; Kovačević, Borko
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
JTDH2022_Milicevic-Petrovic-et-al_Cross-Level-Semantic-Similarity.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 395.25 kB Formato Adobe PDF Visualizza/Apri	395.25 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/901701

Citazioni

ND

ND

ND

ND

social impact