This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Italian and Serbian, chosen as examples of languages with very rich inflectional morphology. The two studies compare four different methods to create genre-, rather than topic-oriented corpora, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also very frequent on the web. The analysis of the corpora created using the four different methods shows that the best results for Italian are achieved by the keyword method, while for Serbian no single method substantially outperforms the others; furthermore, the results obtained for Serbian are consistently better than those for Italian. As well as confirming the potential of genre-oriented methods of corpus construction for languages other than English, these results can be interpreted in a contrastive perspective, as highlighting the importance of morphological differences between the two languages, particularly as concerns the richer nominal morphology of Serbian compared to Italian, as well as the absence vs. presence of articles.

Costruzione semi-automatica di corpora orientati al genere in lingue morfologicamente ricche: un paragone fra l'italiano e il serbo

Maja Miličević;BERNARDINI, SILVIA;FERRARESI, ADRIANO
2014

Abstract

This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallel studies on Italian and Serbian, chosen as examples of languages with very rich inflectional morphology. The two studies compare four different methods to create genre-, rather than topic-oriented corpora, based on keywords and sequences of words (n-grams) of different lengths (unigrams, bigrams and trigrams). The genre under scrutiny is that of cooking recipes, a genre that is very formulaic and also very frequent on the web. The analysis of the corpora created using the four different methods shows that the best results for Italian are achieved by the keyword method, while for Serbian no single method substantially outperforms the others; furthermore, the results obtained for Serbian are consistently better than those for Italian. As well as confirming the potential of genre-oriented methods of corpus construction for languages other than English, these results can be interpreted in a contrastive perspective, as highlighting the importance of morphological differences between the two languages, particularly as concerns the richer nominal morphology of Serbian compared to Italian, as well as the absence vs. presence of articles.
2014
Maja Miličević; Silvia Bernardini; Adriano Ferraresi
File in questo prodotto:
File Dimensione Formato  
italbg-2014-1-5.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per accesso libero gratuito
Dimensione 315.04 kB
Formato Adobe PDF
315.04 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/365719
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact