CRIS Current Research Information System

Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.

Franceschelli, G., Musolesi, M. (2025). DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation. TRANSACTIONS ON MACHINE LEARNING RESEARCH, December 2025, 1-48.

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Franceschelli Giorgio^Primo;Musolesi Mirco^Secondo

2025

Abstract

Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				TRANSACTIONS ON MACHINE LEARNING RESEARCH
			
	Citazione
	
				Franceschelli, G., Musolesi, M. (2025). DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation. TRANSACTIONS ON MACHINE LEARNING RESEARCH, December 2025, 1-48.
			
	Tutti gli autori
	
						Franceschelli, Giorgio; Musolesi, Mirco
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
DiffSampling__Enhancing_Diversity_and_Accuracy_in_Neural_Text_Generation-CameraReady.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri	1.11 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1034165

Citazioni

ND

ND

ND

ND

social impact