ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation mistakes corrected, and is now encoded in the CoNLL-U-Plus format, as are other linguistic training datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
Ljubešić, N., Erjavec, T., Batanović, V., Miličević, M., Samardžić, T. (2023). Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0.
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
Miličević, Maja;
2023
Abstract
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation mistakes corrected, and is now encoded in the CoNLL-U-Plus format, as are other linguistic training datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


