Transformer-based models excel in NLP, vision, and audio processing, but the softmax operator can be a performance bottleneck, especially with optimized matrix-multiplication hard-ware. We introduce SoftEx, a parametric accelerator for BF16 softmax, using approximate exponentiation «0.14% relative error) to boost softmax calculation. Integrated into a 12nm octa-core RISC-V cluster together with a matrix-multiplication systolic array, SoftEx reduces time and energy for attention probability computation by up to 10.8x and 26.8x, boosting MobileBERT throughput by 2.17x to 324 GOPS or 1.30 TOPS/W.
Belano, A., Tortorella, Y., Garofalo, A., Benini, L., Rossi, D., Conti, F. (2025). SoftEx: A Low Power and Flexible Softmax Accelerator with Fast Approximate Exponentiation. Institute of Electrical and Electronics Engineers Inc. [10.23919/date64628.2025.10993043].
SoftEx: A Low Power and Flexible Softmax Accelerator with Fast Approximate Exponentiation
Belano, Andrea;Tortorella, Yvan;Garofalo, Angelo;Benini, Luca;Rossi, Davide;Conti, Francesco
2025
Abstract
Transformer-based models excel in NLP, vision, and audio processing, but the softmax operator can be a performance bottleneck, especially with optimized matrix-multiplication hard-ware. We introduce SoftEx, a parametric accelerator for BF16 softmax, using approximate exponentiation «0.14% relative error) to boost softmax calculation. Integrated into a 12nm octa-core RISC-V cluster together with a matrix-multiplication systolic array, SoftEx reduces time and energy for attention probability computation by up to 10.8x and 26.8x, boosting MobileBERT throughput by 2.17x to 324 GOPS or 1.30 TOPS/W.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


