Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22 nm FD-SOI technology show the accelerator’s capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these results we show that a version of NTX scaled down to 14 nm can achieve a 3× energy efficiency improvement over contemporary GPUs at 10.4× less silicon area, and a compute performance of 1.4 Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87% of its peak performance across general reduction workloads beyond machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios

Schuiki F., Schaffner M., Benini L. (2019). NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI. Institute of Electrical and Electronics Engineers Inc. [10.23919/DATE.2019.8715007].

NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI

Benini L.
2019

Abstract

Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22 nm FD-SOI technology show the accelerator’s capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these results we show that a version of NTX scaled down to 14 nm can achieve a 3× energy efficiency improvement over contemporary GPUs at 10.4× less silicon area, and a compute performance of 1.4 Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87% of its peak performance across general reduction workloads beyond machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios
2019
Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition (DATE)
662
667
Schuiki F., Schaffner M., Benini L. (2019). NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI. Institute of Electrical and Electronics Engineers Inc. [10.23919/DATE.2019.8715007].
Schuiki F.; Schaffner M.; Benini L.
File in questo prodotto:
File Dimensione Formato  
NTX_An Energy-efficient Streaming Accelerator.pdf

Open Access dal 18/11/2019

Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 1.03 MB
Formato Adobe PDF
1.03 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/702263
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 6
social impact