This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.
Maja Miličević (2015). Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts. ANALI FILOLOSKOG FAKULTETA, 27(2), 285-300 [10.18485/analiff.2015.27.2.14].
Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts
Maja Miličević
2015
Abstract
This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.| File | Dimensione | Formato | |
|---|---|---|---|
|
analiff-2015-27-2-14.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per accesso libero gratuito
Dimensione
176.81 kB
Formato
Adobe PDF
|
176.81 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


