Learning an Italian Categorial Grammar

Bernardi, R.; Bolognesi, A.; Seidenari, C.; Tamburini, Fabio

The present paper focuses on the first results of a work in progress about grammar induction for Italian. This work aims at achieving a better understanding of the models behind Italian syntax and at building systems able to automatically create Italian Treebanks. The experiments of automatic induction for lexical categories, that was the focus of a FIRB project, are our starting points to induce complex syntactic types which will be then used to semi-automatically build Categorial Grammar (CG) derivations of a given corpus. The study carried so far is based on a specific version of CG, namely Categorial Type Logic (CTL, the logic version of CG, Moortgat ‘97). An important aspect of this system is the derivability relation that holds between syntactic categories and allows to avoid multiple assignments of lexical types when unnecessary. The starting point of our type induction algorithm (Buszkowski and Penn ‘90) must be a Treebank consisting of binary trees. In particular, so far we have based our preliminary work on the Turin University Treebank (TUT). The latter consists of dependency grammar trees for 1500 sentences that we have re-written into binary trees so to extract categorial types by exploiting the dependency relations of the original trees. In this contribution we present the set of categories induced from TUT and illustrate how we intend to exploit CTL derivability relation to simplify the lexicon and avoid unnecessary multiple assignments. In particular, we plan to make use of both syntactic and semantic filtering criteria as well as statistical clustering methods. Due to the richness of dependency and semantic information encoded into categorial grammar types, once we have induced a lexicon as described above, it will be possible to run a syntactic analyser (parser) based on CTL in order to study language diversities between Italian, English and Dutch, since for the latter two languages similar grammar induction works have been already carried out. Finally, a last aspect on which we will bring the attention is the need of enhancing a CTL parser’s performance by taking into account semantic information so to discard semantically implausible derivation. Connected to this aim is the need of extending the lexical assignments with semantic meaning representation (see Bos ‘05). It’s plausible to think that the integration of CG with semantic resources like FrameNet could help achieving this ultimate goal.

Bernardi R., Bolognesi A., Seidenari C., Tamburini F. (2008). Learning an Italian Categorial Grammar. BOLOGNA : Bononia University Press.