The work presented in this paper reviews in depth the computational methods and tecniques developed within a project aiming to annotate CORIS/CODIS with part-of-speech (PoS) tags. In a large number of studies devoted to automatic PoS annotation the tagsets tend to be pre-defined and, consequently, theory oriented. Our aim is to automatically derive an empirically founded PoS classification making few a priori assumptions about the Pos classes to be distinguished. Early approaches to this problem were based on the hypothesis that if two words are syntactically and semantically different, they will appear in different contexts. There are a number of studies based on this hypothesis in the fields of both computational linguistics and cognitive science aiming at building automatic or semi-automatic procedures for clustering. These papers examine the distributional behaviour of target words by comparing the lexical distribution of their respective collocates and by using quantitative measures of distributional similarity. The main drawback of these techniques is the limited context of analysis. Information is collected from a restricted context of, for instance, 3 words which can conceal syntactic dependencies longer than the context interval. Our approach to solving this problem is to use basic syntactic relations together with distributional information. The algorithm extracts information from loosely labelled dependency structure that encode only basic and broadly accepted syntactic relations, namely Head/Dependent and the distinction of dependents into Argument vs. Adjunct. Such information is exploited to further refine an exclusively distributional classification induced by means of Brill’s algorithm. Three main uncontroversial classes emerge from this broad range process: noun, verbs and all the others. Syntactic information extracted from the dependency structures is automatically processed and encoded in formulae (or syntactic types) lexically anchored. The algorithm proposed creates pairs of words and syntactic types and connects each pair in accordance with syntactic similarities between them, producing an extensive graph. The algorithm then exploits statistical information from the graph so far obtained in order to achieve a first level breakdown of PoS classes. The ultimate PoS tagset is obtained as a further decomposition of the previous phases by using distributional knowledge and a more sophisticated clustering metric. In order to evaluate the effectiveness of the proposed PoS tagsets a number of experiments have been carried out: the results obtained using state of the art tagging experiments will be presented.
Tamburini F., Seidenari C., Bolognesi A., Bernardi R. (2008). Italian Lexical-Classes Definition Using Automatic Methods. BOLOGNA : Bononia University Press.
Italian Lexical-Classes Definition Using Automatic Methods
TAMBURINI, FABIO;
2008
Abstract
The work presented in this paper reviews in depth the computational methods and tecniques developed within a project aiming to annotate CORIS/CODIS with part-of-speech (PoS) tags. In a large number of studies devoted to automatic PoS annotation the tagsets tend to be pre-defined and, consequently, theory oriented. Our aim is to automatically derive an empirically founded PoS classification making few a priori assumptions about the Pos classes to be distinguished. Early approaches to this problem were based on the hypothesis that if two words are syntactically and semantically different, they will appear in different contexts. There are a number of studies based on this hypothesis in the fields of both computational linguistics and cognitive science aiming at building automatic or semi-automatic procedures for clustering. These papers examine the distributional behaviour of target words by comparing the lexical distribution of their respective collocates and by using quantitative measures of distributional similarity. The main drawback of these techniques is the limited context of analysis. Information is collected from a restricted context of, for instance, 3 words which can conceal syntactic dependencies longer than the context interval. Our approach to solving this problem is to use basic syntactic relations together with distributional information. The algorithm extracts information from loosely labelled dependency structure that encode only basic and broadly accepted syntactic relations, namely Head/Dependent and the distinction of dependents into Argument vs. Adjunct. Such information is exploited to further refine an exclusively distributional classification induced by means of Brill’s algorithm. Three main uncontroversial classes emerge from this broad range process: noun, verbs and all the others. Syntactic information extracted from the dependency structures is automatically processed and encoded in formulae (or syntactic types) lexically anchored. The algorithm proposed creates pairs of words and syntactic types and connects each pair in accordance with syntactic similarities between them, producing an extensive graph. The algorithm then exploits statistical information from the graph so far obtained in order to achieve a first level breakdown of PoS classes. The ultimate PoS tagset is obtained as a further decomposition of the previous phases by using distributional knowledge and a more sophisticated clustering metric. In order to evaluate the effectiveness of the proposed PoS tagsets a number of experiments have been carried out: the results obtained using state of the art tagging experiments will be presented.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.