CRIS Current Research Information System

Within text categorization and other data mining tasks, the use of suitable methods for term weighting can bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout literature, based on assumptions commonly derived from observation of distribution of words in documents. For example, the idf assumption states that words appearing in many documents are usually not as important as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval, schemes proposed more recently are supervised, i.e. based on knownledge of membership of training documents to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms frequently appearing only within it is not underestimated. A further proposed variant is additionally based on relevance frequency, considering occurrences of words within the category itself. In extensive experiments on two recurring text collections with several unsupervised and supervised weighting schemes, we show that the ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two different learning methods.

Domeniconi, G., Moro, G., Pasolini, R., Sartori, C. (2015). A study on term weighting for text categorization: A novel supervised variant of tf.idf. SciTePress [10.5220/0005511900260037].

A study on term weighting for text categorization: A novel supervised variant of tf.idf

DOMENICONI, GIACOMO;MORO, GIANLUCA;PASOLINI, ROBERTO;SARTORI, CLAUDIO

2015

Abstract

Within text categorization and other data mining tasks, the use of suitable methods for term weighting can bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout literature, based on assumptions commonly derived from observation of distribution of words in documents. For example, the idf assumption states that words appearing in many documents are usually not as important as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval, schemes proposed more recently are supervised, i.e. based on knownledge of membership of training documents to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms frequently appearing only within it is not underestimated. A further proposed variant is additionally based on relevance frequency, considering occurrences of words within the category itself. In extensive experiments on two recurring text collections with several unsupervised and supervised weighting schemes, we show that the ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two different learning methods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2015
			
	Titolo del volume
	
				DATA 2015 - 4th International Conference on Data Management Technologies and Applications, Proceedings
			
	Pagina iniziale
	
				26
			
	Pagina finale
	
				37
			
	Codice DOI
	
				https://dx.doi.org/10.5220/0005511900260037
			
	Citazione
	
				Domeniconi, G., Moro, G., Pasolini, R., Sartori, C. (2015). A study on term weighting for text categorization: A novel supervised variant of tf.idf. SciTePress [10.5220/0005511900260037].
			
	Tutti gli autori
	
						Domeniconi, Giacomo; Moro, Gianluca; Pasolini, Roberto; Sartori, Claudio
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/545299

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

40

ND

social impact