This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.

Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project / Miličević Petrović, Maja; Batanović, Vuk; Trnavac, Radoslava; Kovačević, Borko. - ELETTRONICO. - (2022), pp. 124-131. (Intervento presentato al convegno Language Technologies & Digital Humanities tenutosi a Ljubljana, Slovenia nel 15-16 September 2022).

Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project

Miličević Petrović, Maja
Primo
;
2022

Abstract

This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.
2022
Proceedings of the Conference on Language Technologies and Digital Humanities
124
131
Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project / Miličević Petrović, Maja; Batanović, Vuk; Trnavac, Radoslava; Kovačević, Borko. - ELETTRONICO. - (2022), pp. 124-131. (Intervento presentato al convegno Language Technologies & Digital Humanities tenutosi a Ljubljana, Slovenia nel 15-16 September 2022).
Miličević Petrović, Maja; Batanović, Vuk; Trnavac, Radoslava; Kovačević, Borko
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/901701
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact