The widespread adoption of transformer and Deep Neural Network (DNN) models is driving applications like Generative AI (GenAI) and Contextual AI (ContextualAI). While dominated by matrix operations, these models also rely on complex nonlinear functions beyond ReLU, which are critical for accuracy. As core linear algebra has been heavily optimized in hardware, nonlinear computation is becoming the new bottleneck. With evolving applications introducing new nonlinearities, there is a growing need for a hardware solution that is both efficient and adaptable, as software emulation is insufficient to meet performance demands. To this end, we propose PACE-lite, a lightweight, highly parametric datapath designed to approximate a diverse range of nonlinear functions using Piecewise Polynomial Approximation (PwPA) with configurable degree and partition count. PACE-lite leverages a lightweight integer datapath to achieve 70−81% area savings over FP32 Floating Point Fused Multiply-Add (FP-FMA) implementations, with a tunable tradeoff in accuracy. When evaluated on state-of-the-art pretrained Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and a Large Language Model (LLM), PACE-lite configurations achieve approximation errors ranging from 1% (for minimal-area designs) to as low as 0.01% (for higher-precision variants), all without fine-tuning. PACE-lite, integrated as a low-overhead (5.9%) hardware accelerator in a RISC-V cluster, delivers 7.9/15.6/15.6 GPolyEval/s at an energy efficiency of 5.2/3.8/3.8 pJ/PolyEval for FP32/FP16/BFP16 respectively. This results in system-level performance improvements of 44.1× in throughput and 16.7× in energy efficiency, outperforming existing FP solutions by 3.5× and 3.1× respectively.

Prasad, A.S., İslamoğlu, G., Bertaccini, L., Rossi, D., Conti, F., Benini, L. (2025). PACE-Lite: Compact and Efficient Piecewise Polynomial Approximation for Transformer Nonlinearity Acceleration [10.1109/iccd65941.2025.00022].

PACE-Lite: Compact and Efficient Piecewise Polynomial Approximation for Transformer Nonlinearity Acceleration

Bertaccini, Luca;Rossi, Davide;Conti, Francesco;Benini, Luca
2025

Abstract

The widespread adoption of transformer and Deep Neural Network (DNN) models is driving applications like Generative AI (GenAI) and Contextual AI (ContextualAI). While dominated by matrix operations, these models also rely on complex nonlinear functions beyond ReLU, which are critical for accuracy. As core linear algebra has been heavily optimized in hardware, nonlinear computation is becoming the new bottleneck. With evolving applications introducing new nonlinearities, there is a growing need for a hardware solution that is both efficient and adaptable, as software emulation is insufficient to meet performance demands. To this end, we propose PACE-lite, a lightweight, highly parametric datapath designed to approximate a diverse range of nonlinear functions using Piecewise Polynomial Approximation (PwPA) with configurable degree and partition count. PACE-lite leverages a lightweight integer datapath to achieve 70−81% area savings over FP32 Floating Point Fused Multiply-Add (FP-FMA) implementations, with a tunable tradeoff in accuracy. When evaluated on state-of-the-art pretrained Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and a Large Language Model (LLM), PACE-lite configurations achieve approximation errors ranging from 1% (for minimal-area designs) to as low as 0.01% (for higher-precision variants), all without fine-tuning. PACE-lite, integrated as a low-overhead (5.9%) hardware accelerator in a RISC-V cluster, delivers 7.9/15.6/15.6 GPolyEval/s at an energy efficiency of 5.2/3.8/3.8 pJ/PolyEval for FP32/FP16/BFP16 respectively. This results in system-level performance improvements of 44.1× in throughput and 16.7× in energy efficiency, outperforming existing FP solutions by 3.5× and 3.1× respectively.
2025
2025 IEEE 43rd International Conference on Computer Design (ICCD)
111
118
Prasad, A.S., İslamoğlu, G., Bertaccini, L., Rossi, D., Conti, F., Benini, L. (2025). PACE-Lite: Compact and Efficient Piecewise Polynomial Approximation for Transformer Nonlinearity Acceleration [10.1109/iccd65941.2025.00022].
Prasad, Arpan Suravi; İslamoğlu, Gamze; Bertaccini, Luca; Rossi, Davide; Conti, Francesco; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1040897
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact