What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Ragazzi, Luca; Italiani, Paolo; Moro, Gianluca; Panni, Mattia

doi:10.18653/v1/2024.findings-acl.561

Scientific document summarization aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets suffer from a deficiency in source heterogeneity, as their data predominantly stem from a single common resource, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://github.com/disi-unibo-nlp/sci-lay.

Ragazzi, L., Italiani, P., Moro, G., Panni, M. (2024). What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization [10.18653/v1/2024.findings-acl.561].

What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Luca Ragazzi;Paolo Italiani;Gianluca Moro;Mattia Panni

2024

Abstract

Scientific document summarization aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets suffer from a deficiency in source heterogeneity, as their data predominantly stem from a single common resource, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://github.com/disi-unibo-nlp/sci-lay.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Titolo del volume
	
				Findings of the Association for Computational Linguistics: ACL 2024
			
	Pagina iniziale
	
				9427
			
	Pagina finale
	
				9440
			
	Codice DOI
	
				https://dx.doi.org/10.18653/v1/2024.findings-acl.561
			
	Citazione
	
				Ragazzi, L., Italiani, P., Moro, G., Panni, M. (2024). What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization [10.18653/v1/2024.findings-acl.561].
			
	Tutti gli autori
	
						Ragazzi, Luca; Italiani, Paolo; Moro, Gianluca; Panni, Mattia

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1007078

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

7

CRIS Current Research Information System