Hardware-accelerated multicore clusters have recently emerged as a viable approach to deploy advanced digital signal processing (DSP) capabilities in ultra-low-power extreme edge nodes. As a critical basic block for DSP, Fast Fourier Transforms (FFTs) are one of the best candidates for implementation on a dedicated accelerator core; however, their peculiar memory access patterns make direct integration of an FFT accelerator with a core cluster challenging. In this paper, we compare two different approaches for cluster-coupled FFT accelerators: one with a large internal buffer to store and shuffle partial results; and a buffer-less accelerator sharing all memory with the cluster cores. Both versions can work on complex data with 8/16/32-bit real and imaginary parts. We show that, thanks to a newly proposed scheme to reorder data access and exploit full bandwidth also for sub-word FFTs, the buffer-less accelerator can be made as fast as the buffered one at only 0.26× the area cost. We report post-layout performance and power results showing that the buffer-less accelerator can provide up to 4/2/1 butterfly/cycle performance, with an average power consumption of 4.1/5.5/6.8 mW @ 350 MHz, 0.65 V operating point in 22 nm CMOS technology, respectively for complex data with 8/16/32-bit real and imaginary part. The buffer-less accelerator is 8 × faster than an optimized multicore software implementation working on 16-bit data and compares favorably with FFT accelerators presented in the recent literature.
Bertaccini L., Benini L., Conti F. (2021). To buffer, or not to buffer? A case study on FFT accelerators for ultra-low-power multicore clusters. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA : Institute of Electrical and Electronics Engineers Inc. [10.1109/ASAP52443.2021.00008].
To buffer, or not to buffer? A case study on FFT accelerators for ultra-low-power multicore clusters
Benini L.;Conti F.
2021
Abstract
Hardware-accelerated multicore clusters have recently emerged as a viable approach to deploy advanced digital signal processing (DSP) capabilities in ultra-low-power extreme edge nodes. As a critical basic block for DSP, Fast Fourier Transforms (FFTs) are one of the best candidates for implementation on a dedicated accelerator core; however, their peculiar memory access patterns make direct integration of an FFT accelerator with a core cluster challenging. In this paper, we compare two different approaches for cluster-coupled FFT accelerators: one with a large internal buffer to store and shuffle partial results; and a buffer-less accelerator sharing all memory with the cluster cores. Both versions can work on complex data with 8/16/32-bit real and imaginary parts. We show that, thanks to a newly proposed scheme to reorder data access and exploit full bandwidth also for sub-word FFTs, the buffer-less accelerator can be made as fast as the buffered one at only 0.26× the area cost. We report post-layout performance and power results showing that the buffer-less accelerator can provide up to 4/2/1 butterfly/cycle performance, with an average power consumption of 4.1/5.5/6.8 mW @ 350 MHz, 0.65 V operating point in 22 nm CMOS technology, respectively for complex data with 8/16/32-bit real and imaginary part. The buffer-less accelerator is 8 × faster than an optimized multicore software implementation working on 16-bit data and compares favorably with FFT accelerators presented in the recent literature.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.