Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.
Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data / Laura Anderlucci; Cinzia Viroli. - In: ADVANCES IN DATA ANALYSIS AND CLASSIFICATION. - ISSN 1862-5355. - ELETTRONICO. - 14:(2020), pp. 759-770. [10.1007/s11634-020-00399-3]
Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data
Laura Anderlucci
;Cinzia Viroli
2020
Abstract
Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.