In this paper, we investigate the spelling conventions on the Twitter micro-blogging platform. In order to gain insight into the universalities and speci-ficities of communication on social media, we perform a comparative analysis of three closely related languages: Slovene, Croatian and Serbian. The data collection and annotation protocols were developed jointly for all three lan-guages, allowing for maximum interoperability and comparability of results. The analysis reveals differences in the amount of deviation from the norm in the three languages, with Slovene twitterese being the most inclined to using non-standard spelling, and Serbian the least. Overall, closed word classes, espe-cially interjections and abbreviations, are found to be more non-standard than the open classes. In terms of types of standard > non-standard transforma-tions, character deletions are more frequent than insertions or replacements, and transformations mostly occur in word-final positions. The discrepancies between languages are largely due to the pronounced tendency of Slovene and Croatian to use spoken-like, regional and dialectal forms characterised by vowel omissions, especially at the end of words. This analysis and the resulting datasets can be used to further study the properties of non-standard Slovene, Croatian and Serbian, as well as to develop language technologies for non-standard data in these languages.

Birds of a feather don’t quite tweet together: An analysis of spelling variation in Slovene, Croatian and Serbian twitterese

Maja Miličević
;
2017

Abstract

In this paper, we investigate the spelling conventions on the Twitter micro-blogging platform. In order to gain insight into the universalities and speci-ficities of communication on social media, we perform a comparative analysis of three closely related languages: Slovene, Croatian and Serbian. The data collection and annotation protocols were developed jointly for all three lan-guages, allowing for maximum interoperability and comparability of results. The analysis reveals differences in the amount of deviation from the norm in the three languages, with Slovene twitterese being the most inclined to using non-standard spelling, and Serbian the least. Overall, closed word classes, espe-cially interjections and abbreviations, are found to be more non-standard than the open classes. In terms of types of standard > non-standard transforma-tions, character deletions are more frequent than insertions or replacements, and transformations mostly occur in word-final positions. The discrepancies between languages are largely due to the pronounced tendency of Slovene and Croatian to use spoken-like, regional and dialectal forms characterised by vowel omissions, especially at the end of words. This analysis and the resulting datasets can be used to further study the properties of non-standard Slovene, Croatian and Serbian, as well as to develop language technologies for non-standard data in these languages.
2017
Investigating Computer-Mediated Communication: Corpus-based Approaches to Language in the Digital World
14
43
Maja Miličević; Nikola Ljubešić; Darja Fišer
File in questo prodotto:
File Dimensione Formato  
4-Chapter Manuscript-16-1-10-20171010.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Condividi allo stesso modo (CCBYSA)
Dimensione 587.67 kB
Formato Adobe PDF
587.67 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/775383
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact