Introducing the La Repubblica corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian

Baroni, M.; Bernardini, Silvia; Comastri, F.; Piccioni, L.; Volpi, A.; Aston, CHRISTOPHER GUY; Mazzoleni, Marco

This paper describes the _La Repubblica_ Corpus, currently being constructed at the SSLMIT of the University of Bologna; it discusses the techniques used to annotate it and it presents examples of how it can be used. The corpus is a very large collection of newspaper text, currently amounting to 130 million words, but expected to grow to 400 million words within the next 2 years. When completed, it will contain all the articles published between 1985 and 2000 by the national daily _La Repubblica_, the second most widely-read Italian newspaper. This resource answers a widely-felt need for annotated contemporary Italian language data. While arguably not ideal as a reference corpus - being mono-source - the _La Repubblica_ Corpus is probably the largest freely accessible Italian corpus available to date for research purposes. We will first describe the procedures adopted in the preparation of the corpus (extraction of articles and meta-textual information from the original databases, tokenization, XML annotation following the TEI guidelines (http://www.tei-c.org/P4X/) and indexing). We will then present two major recent improvements to the corpus annotation, namely part-of-speech (POS) tagging and categorization of each article in terms of its genre (news-report or comment) and topic (church, culture, economics, education, news, politics, science, society, sport, weather). Both tasks were carried out using supervised machine learning techniques. In this respect, annotation of a very large corpus also proved to be an ideal testbed for recent tagging and categorization algorithms. For POS tagging, we first annotated by hand a training set of 180 randomly selected articles (about 115,000 tokens) using a 52-category tagset. We experimented with single taggers and with combinations of taggers using the ACOPOS suite (Schroeder 2002) and running 10-fold cross-validation tests. In general, tagger combinations performed slightly better than the single taggers. The best performance was achieved by combining via majority voting a Hidden Markov Model tagger, an example-based tagger and a HMM/transformation-based tagger stack. This scheme achieved a mean word-level accuracy of 95.46% in the 10-fold experiments. As far as we know, this is around the state-of-the-art performance level for tagging of Italian (Tamburini 2000 reports an accuracy of 95.19% with a smaller, 32-category tagset). We used the same tagger combination, trained on the full training set, to tag the remainder of the corpus. For genre and topic categorization, we created a manually annotated training set of 15,000 articles. 10-fold cross validation tests on this set indicated that a simple unigram TFIDF (term frequency times inverted document frequency) model with no stemming and no preliminary feature selection performed best for both genre and topic detection. In the genre detection experiments, this approach achieved an average accuracy of 90.03% with 90.89% precision and 93.03% recall. In topic detection, it achieved 95.75% average accuracy with 86.05% precision and 73.4% recall (measures micro-averaged across categories). The genre detection results are of particular interest, since the performance we attained is comparable if not superior to the one reported in genre detection experiments with more sophisticated feature sets (see, e.g., Finn and Kushmerick 2003). Also, it turned out that keeping genre and topic separated worked much better than creating combined genre-topic categories. We will finish by providing several examples of how the corpus can be used. Besides simple word or phrase searches on the whole corpus, the software and annotation provided allow the user, among other things, to limit searches to specific parts of the corpus (e.g. titles vs. bodies of articles, beginning vs. end of sentences etc.), to select subcorpora on the basis of information contained within certain tags (e.g....

CRIS Current Research Information System