Lexicons and morphological analysers are at the core of many NLP applications, such as lemmatisation, POS tagging and morphology generation. Unfortunately, since the creation of a lexicon tends to be a long and labour intensive task (especially for highly inflectional languages), to date, there are no freely available lexicons for the Italian language. This is the reason why we embarked on the task of creating our own lexicon and then decided to make it freely available. In this paper we describe our method for the rapid creation of a lexicon using a mixture of corpus based techniques and manual checking. Our main source of linguistic data was the “Repubblica” corpus (Baroni et al. 2004), we extracted lemmas and inferred morphological information not present in the original corpus (i.e. gender) using distributional as well as morphological cues. With that information we then generated inflected forms for all extracted lemmas. The project is not yet complete and a first evaluation of the quality of the resource suggests that more words from everyday language should be added. Also proper nouns, loan words, diminutive adjectives and a large number of forms of verbs with clitics attached are still missing. So far the project has been carried out by two people working part time on it, for a total of about 600 person hours. In this paper we illustrate the process of creating a lexicon of the Italian language, the same methodology can however be easily adapted and replicated in other knowledge-poor morphological extraction projects in different languages.

Morph-it! A free corpus-based morphological resource for the Italian language

ZANCHETTA, EROS;BARONI, MARCO
2005

Abstract

Lexicons and morphological analysers are at the core of many NLP applications, such as lemmatisation, POS tagging and morphology generation. Unfortunately, since the creation of a lexicon tends to be a long and labour intensive task (especially for highly inflectional languages), to date, there are no freely available lexicons for the Italian language. This is the reason why we embarked on the task of creating our own lexicon and then decided to make it freely available. In this paper we describe our method for the rapid creation of a lexicon using a mixture of corpus based techniques and manual checking. Our main source of linguistic data was the “Repubblica” corpus (Baroni et al. 2004), we extracted lemmas and inferred morphological information not present in the original corpus (i.e. gender) using distributional as well as morphological cues. With that information we then generated inflected forms for all extracted lemmas. The project is not yet complete and a first evaluation of the quality of the resource suggests that more words from everyday language should be added. Also proper nouns, loan words, diminutive adjectives and a large number of forms of verbs with clitics attached are still missing. So far the project has been carried out by two people working part time on it, for a total of about 600 person hours. In this paper we illustrate the process of creating a lexicon of the Italian language, the same methodology can however be easily adapted and replicated in other knowledge-poor morphological extraction projects in different languages.
2005
Proceedings of Corpus linguistics Conference Series 2005 (ISSN 1747-9398)
1
12
E. Zanchetta; M. Baroni
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/15321
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact