VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Wang, Run; Islamoglu, Gamze; Belano, Andrea; Potocnik, Viviane; Conti, Francesco; Garofalo, Angelo; Benini, Luca

doi:10.1109/arith64983.2025.00016

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC- V cores [1] of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1 %. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7× less latency and 74.3× less energy compared to the baseline cluster, achieving an 8.2 × performance improvement and 4.1 × higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8 × and 3.6 × reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Wang, R., Islamoglu, G., Belano, A., Potocnik, V., Conti, F., Garofalo, A., et al. (2025). VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA : Institute of Electrical and Electronics Engineers Inc. [10.1109/arith64983.2025.00016].

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Wang, Run;Islamoglu, Gamze;Belano, Andrea;Potocnik, Viviane;Conti, Francesco;Garofalo, Angelo;Benini, Luca

2025

Abstract

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC- V cores [1] of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1 %. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7× less latency and 74.3× less energy compared to the baseline cluster, achieving an 8.2 × performance improvement and 4.1 × higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8 × and 3.6 × reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Proceedings - Symposium on Computer Arithmetic
			
	Pagina iniziale
	
				37
			
	Pagina finale
	
				44
			
	Codice DOI
	
				https://dx.doi.org/10.1109/arith64983.2025.00016
			
	Citazione
	
				Wang, R., Islamoglu, G., Belano, A., Potocnik, V., Conti, F., Garofalo, A., et al. (2025). VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA : Institute of Electrical and Electronics Engineers Inc. [10.1109/arith64983.2025.00016].
			
	Tutti gli autori
	
						Wang, Run; Islamoglu, Gamze; Belano, Andrea; Potocnik, Viviane; Conti, Francesco; Garofalo, Angelo; Benini, Luca

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1039995

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

ND

CRIS Current Research Information System