This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.

Maja Miličević (2015). Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts. ANALI FILOLOSKOG FAKULTETA, 27(2), 285-300 [10.18485/analiff.2015.27.2.14].

Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts

Maja Miličević
2015

Abstract

This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.
2015
Maja Miličević (2015). Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts. ANALI FILOLOSKOG FAKULTETA, 27(2), 285-300 [10.18485/analiff.2015.27.2.14].
Maja Miličević
File in questo prodotto:
File Dimensione Formato  
analiff-2015-27-2-14.pdf

accesso aperto

Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Licenza per accesso libero gratuito
Dimensione 176.81 kB
Formato Adobe PDF
176.81 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/776165
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact