The widespread adoption of transformer and Deep Neural Network (DNN) models is driving applications like Generative AI (GenAI) and Contextual AI (ContextualAI). While dominated by matrix operations, these models also rely on complex nonlinear functions beyond ReLU, which are critical for accuracy. As core linear algebra has been heavily optimized in hardware, nonlinear computation is becoming the new bottleneck. With evolving applications introducing new nonlinearities, there is a growing need for a hardware solution that is both efficient and adaptable, as software emulation is insufficient to meet performance demands. To this end, we propose PACE-lite, a lightweight, highly parametric datapath designed to approximate a diverse range of nonlinear functions using Piecewise Polynomial Approximation (PwPA) with configurable degree and partition count. PACE-lite leverages a lightweight integer datapath to achieve 70−81% area savings over FP32 Floating Point Fused Multiply-Add (FP-FMA) implementations, with a tunable tradeoff in accuracy. When evaluated on state-of-the-art pretrained Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and a Large Language Model (LLM), PACE-lite configurations achieve approximation errors ranging from 1% (for minimal-area designs) to as low as 0.01% (for higher-precision variants), all without fine-tuning. PACE-lite, integrated as a low-overhead (5.9%) hardware accelerator in a RISC-V cluster, delivers 7.9/15.6/15.6 GPolyEval/s at an energy efficiency of 5.2/3.8/3.8 pJ/PolyEval for FP32/FP16/BFP16 respectively. This results in system-level performance improvements of 44.1× in throughput and 16.7× in energy efficiency, outperforming existing FP solutions by 3.5× and 3.1× respectively.
Prasad, A.S., İslamoğlu, G., Bertaccini, L., Rossi, D., Conti, F., Benini, L. (2025). PACE-Lite: Compact and Efficient Piecewise Polynomial Approximation for Transformer Nonlinearity Acceleration [10.1109/iccd65941.2025.00022].
PACE-Lite: Compact and Efficient Piecewise Polynomial Approximation for Transformer Nonlinearity Acceleration
Bertaccini, Luca;Rossi, Davide;Conti, Francesco;Benini, Luca
2025
Abstract
The widespread adoption of transformer and Deep Neural Network (DNN) models is driving applications like Generative AI (GenAI) and Contextual AI (ContextualAI). While dominated by matrix operations, these models also rely on complex nonlinear functions beyond ReLU, which are critical for accuracy. As core linear algebra has been heavily optimized in hardware, nonlinear computation is becoming the new bottleneck. With evolving applications introducing new nonlinearities, there is a growing need for a hardware solution that is both efficient and adaptable, as software emulation is insufficient to meet performance demands. To this end, we propose PACE-lite, a lightweight, highly parametric datapath designed to approximate a diverse range of nonlinear functions using Piecewise Polynomial Approximation (PwPA) with configurable degree and partition count. PACE-lite leverages a lightweight integer datapath to achieve 70−81% area savings over FP32 Floating Point Fused Multiply-Add (FP-FMA) implementations, with a tunable tradeoff in accuracy. When evaluated on state-of-the-art pretrained Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and a Large Language Model (LLM), PACE-lite configurations achieve approximation errors ranging from 1% (for minimal-area designs) to as low as 0.01% (for higher-precision variants), all without fine-tuning. PACE-lite, integrated as a low-overhead (5.9%) hardware accelerator in a RISC-V cluster, delivers 7.9/15.6/15.6 GPolyEval/s at an energy efficiency of 5.2/3.8/3.8 pJ/PolyEval for FP32/FP16/BFP16 respectively. This results in system-level performance improvements of 44.1× in throughput and 16.7× in energy efficiency, outperforming existing FP solutions by 3.5× and 3.1× respectively.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


