On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present Darkside, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. Darkside is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency – enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.
Garofalo, A., Tortorella, Y., Perotti, M., Valente, L., Nadalini, A., Benini, L., et al. (2022). Darkside: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training. IEEE OPEN JOURNAL OF SOLID-STATE CIRCUITS, 1, 1-1 [10.1109/OJSSCS.2022.3210082].
Darkside: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training
Garofalo, Angelo
;Tortorella, Yvan;Valente, Luca;Nadalini, Alessandro;Benini, Luca;Rossi, Davide;Conti, Francesco
2022
Abstract
On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present Darkside, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. Darkside is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency – enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.File | Dimensione | Formato | |
---|---|---|---|
Darkside_A_Heterogeneous_RISC-V_Compute_Cluster_for_Extreme-Edge_On-Chip_DNN_Inference_and_Training.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
8.93 MB
Formato
Adobe PDF
|
8.93 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.