Stemming and lemmatisation are fundamental tasks at low-level Natural Language Processing (NLP) in particular for morphologically complex languages involving rich inflectional and derivational phenomena. These tasks are usually based on powerful morphological analysers able to handle the complex information and processes involved in successful wordform analysis. Italian is one of the ten most widely spoken languages in the world. It is a highly-inflected Romance language and simple words can be modified by, essentially, three morphological processes: inflection, derivation and compounding. AnIta is a powerful morphological analyser for Italian implemented within the framework of finite-state-automata models. It is provided by a large lexicon that enable it to cover relevant portions of Italian texts. The development of the AnIta morphological analyser is based on the Helsinki Finite-State Transducer package. Considering the morphotactics combinations allowed for Italian, we have currently defined about 110,000 lemmas, 21,000 of which without inflection, 51 continuation classes to handle regular and irregular verb conjugations and 54 continuation classes for noun and adjective declensions. In Italian clitic pronouns can be attached to the end of some verbal forms and can be combined together to build complex clitic clusters. All these phenomena have been managed by the analyser through specific continuation classes. Nine morphographemic rules handle the transformations between abstract lexical strings and surface strings, mainly for managing the presence of velar and glide sounds in the edge between the base and the inflectional ending. The most interesting feature introduced into AnIta concerns the complex morphological annotation devised to mark the derivational and compounding processes. AnIta is able to produce wordforms where the various morphemes - base, prefixes and suffixes (both inflectional and derivational) are clearly marked and segmented. We devised a first level of annotation able to mark the internal segmentation of word forms and a second level that describe the morphological process(-es) involved in specific word formation through the definition of the Derivation Graph structure. The AnIta Morphological Analyser, when compared with similar tools for Italian, obtained the best performances, with Recall = 97.21% and Precision = 98.71%. This tool was a fundamental building block for designing a performant PoS-tagger and Lemmatiser for the Italian language that participated to two EVALITA evaluation campaigns ranking, in both cases, together with the best performing systems.

AnIta Morphological Analyser for Italian (v1.3)

TAMBURINI, FABIO
2013

Abstract

Stemming and lemmatisation are fundamental tasks at low-level Natural Language Processing (NLP) in particular for morphologically complex languages involving rich inflectional and derivational phenomena. These tasks are usually based on powerful morphological analysers able to handle the complex information and processes involved in successful wordform analysis. Italian is one of the ten most widely spoken languages in the world. It is a highly-inflected Romance language and simple words can be modified by, essentially, three morphological processes: inflection, derivation and compounding. AnIta is a powerful morphological analyser for Italian implemented within the framework of finite-state-automata models. It is provided by a large lexicon that enable it to cover relevant portions of Italian texts. The development of the AnIta morphological analyser is based on the Helsinki Finite-State Transducer package. Considering the morphotactics combinations allowed for Italian, we have currently defined about 110,000 lemmas, 21,000 of which without inflection, 51 continuation classes to handle regular and irregular verb conjugations and 54 continuation classes for noun and adjective declensions. In Italian clitic pronouns can be attached to the end of some verbal forms and can be combined together to build complex clitic clusters. All these phenomena have been managed by the analyser through specific continuation classes. Nine morphographemic rules handle the transformations between abstract lexical strings and surface strings, mainly for managing the presence of velar and glide sounds in the edge between the base and the inflectional ending. The most interesting feature introduced into AnIta concerns the complex morphological annotation devised to mark the derivational and compounding processes. AnIta is able to produce wordforms where the various morphemes - base, prefixes and suffixes (both inflectional and derivational) are clearly marked and segmented. We devised a first level of annotation able to mark the internal segmentation of word forms and a second level that describe the morphological process(-es) involved in specific word formation through the definition of the Derivation Graph structure. The AnIta Morphological Analyser, when compared with similar tools for Italian, obtained the best performances, with Recall = 97.21% and Precision = 98.71%. This tool was a fundamental building block for designing a performant PoS-tagger and Lemmatiser for the Italian language that participated to two EVALITA evaluation campaigns ranking, in both cases, together with the best performing systems.
Tamburini F.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/142259
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact