The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF–IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task.

Staykovski T., Barron-Cedeno A., Da San Martino G., Nakov P. (2019). Dense vs. Sparse representations for news stream clustering. CEUR-WS.

Dense vs. Sparse representations for news stream clustering

Barron-Cedeno A.;Da San Martino G.;
2019

Abstract

The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF–IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task.
2019
CEUR Workshop Proceedings
47
52
Staykovski T., Barron-Cedeno A., Da San Martino G., Nakov P. (2019). Dense vs. Sparse representations for news stream clustering. CEUR-WS.
Staykovski T.; Barron-Cedeno A.; Da San Martino G.; Nakov P.
File in questo prodotto:
File Dimensione Formato  
paper6.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 426.79 kB
Formato Adobe PDF
426.79 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/709220
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? ND
social impact