CRIS Current Research Information System

The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF–IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task.

Staykovski T., Barron-Cedeno A., Da San Martino G., Nakov P. (2019). Dense vs. Sparse representations for news stream clustering. CEUR-WS.

Dense vs. Sparse representations for news stream clustering

Staykovski T.;Barron-Cedeno A.;Da San Martino G.;Nakov P.

2019

Abstract

The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF–IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Titolo del volume
	
				CEUR Workshop Proceedings
			
	Pagina iniziale
	
				47
			
	Pagina finale
	
				52
			
	Collana/Serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Citazione
	
				Staykovski T.,  Barron-Cedeno A.,  Da San Martino G.,  Nakov P. (2019). Dense vs. Sparse representations for news stream clustering. CEUR-WS.
			
	Tutti gli autori
	
						Staykovski T.; Barron-Cedeno A.; Da San Martino G.; Nakov P.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
paper6.pdf accesso aperto Tipo: Versione (PDF) editoriale Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 426.79 kB Formato Adobe PDF Visualizza/Apri	426.79 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/709220

Citazioni

ND

10

ND

social impact