Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.

Laura Anderlucci, Cinzia Viroli (2020). Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 14(4 (December)), 759-770 [10.1007/s11634-020-00399-3].

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Laura Anderlucci
;
Cinzia Viroli
2020

Abstract

Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.
2020
Laura Anderlucci, Cinzia Viroli (2020). Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 14(4 (December)), 759-770 [10.1007/s11634-020-00399-3].
Laura Anderlucci; Cinzia Viroli
File in questo prodotto:
File Dimensione Formato  
manuscript_unblinded_rev.pdf

Open Access dal 26/05/2021

Descrizione: AAM
Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 216.92 kB
Formato Adobe PDF
216.92 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/763446
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 6
social impact