CRIS Current Research Information System

This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.

Maja Miličević (2015). Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts. ANALI FILOLOSKOG FAKULTETA, 27(2), 285-300 [10.18485/analiff.2015.27.2.14].

Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts

Maja Miličević

2015

Abstract

This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Serbian. The first factor that makes Serbian interesting in this respect is its rich inflectional morphology; the second concerns the use of two alphabets, Cyrillic and Latin. Four different methods for the creation of genre-oriented corpora are compared for each script, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also highly represented on the web. The analysis of the corpora created using the different methods shows that for the Latin script no single method substantially outperforms the others, with excellent results obtained across the board, while for Cyrillic there is a clear advantage for bigrams and trigrams over keywords and unigrams. As well as further confirming the potential of genre-oriented methods of corpus construction for languages with a rich system of inflectional morphology, the results also point to a functional split between the two scripts of Serbian.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2015
			
	Rivista
	
				ANALI FILOLOSKOG FAKULTETA
			
	Codice DOI
	
				https://dx.doi.org/10.18485/analiff.2015.27.2.14
			
	Citazione
	
				Maja Miličević (2015). Semi-automatic construction of comparable genre-oriented corpora of Serbian in Cyrillic and Latin scripts. ANALI FILOLOSKOG FAKULTETA, 27(2), 285-300 [10.18485/analiff.2015.27.2.14].
			
	Tutti gli autori
	
						Maja Miličević
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
analiff-2015-27-2-14.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per accesso libero gratuito Dimensione 176.81 kB Formato Adobe PDF Visualizza/Apri	176.81 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/776165

Citazioni

ND

ND

ND

ND

social impact