A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU

Belano, Andrea; Tortorella, Yvan; Garofalo, Angelo; Benini, Luca; Rossi, Davide; Conti, Francesco

doi:10.1109/jetcas.2025.3562734

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24 x 8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency ( 121x speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8\times and 5.29x. These enhancements translate into a 1.5x increase in throughput (310 GOPS at 0.8 V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.

Belano, A., Tortorella, Y., Garofalo, A., Benini, L., Rossi, D., Conti, F. (2025). A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU. IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 15(2), 200-216 [10.1109/jetcas.2025.3562734].

A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU

Belano, Andrea;Tortorella, Yvan;Garofalo, Angelo;Benini, Luca;Rossi, Davide;Conti, Francesco

2025

Abstract

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24 x 8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency ( 121x speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8\times and 5.29x. These enhancements translate into a 1.5x increase in throughput (310 GOPS at 0.8 V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
			
	Codice DOI
	
				https://dx.doi.org/10.1109/jetcas.2025.3562734
			
	Citazione
	
				Belano, A., Tortorella, Y., Garofalo, A., Benini, L., Rossi, D., Conti, F. (2025). A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU. IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 15(2), 200-216 [10.1109/jetcas.2025.3562734].
			
	Tutti gli autori
	
						Belano, Andrea; Tortorella, Yvan; Garofalo, Angelo; Benini, Luca; Rossi, Davide; Conti, Francesco

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1025801

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

3

1

CRIS Current Research Information System