We present Mr. Wolf, a Parallel Ultra Low Power (PULP) SoC featuring a hierarchical architecture with a small (12KG) microcontroller class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an 8-cores floating-point capable processing engine available on demand. The proposed SoC, implemented in a 40 nm LP CMOS technology, features a 108 μW fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6Gbit/s in less than 2.5mW. The 8-core compute cluster achieves a peak performance of 850 millions of 32-bit integer multiply and accumulate per second (MMAC/s), 500 millions of 32-bit floating-point multiply and accumulate per second (MFMAC/s)-1 GFLOP/s-with an energy-efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-ON IOT end-nodes improving performance by several orders of magnitude with respect to traditional single core MCUs within a power envelope of 153 mW.
Pullini, A., Rossi, D., Loi, I., Di Mauro, A., Benini, L. (2018). Mr. Wolf: A 1 GFLOP/s Energy-Proportional Parallel Ultra Low Power SoC for IOT Edge Processing. Institute of Electrical and Electronics Engineers Inc. [10.1109/ESSCIRC.2018.8494247].
Mr. Wolf: A 1 GFLOP/s Energy-Proportional Parallel Ultra Low Power SoC for IOT Edge Processing
Pullini, Antonio;Rossi, Davide;Loi, Igor;Benini, Luca
2018
Abstract
We present Mr. Wolf, a Parallel Ultra Low Power (PULP) SoC featuring a hierarchical architecture with a small (12KG) microcontroller class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an 8-cores floating-point capable processing engine available on demand. The proposed SoC, implemented in a 40 nm LP CMOS technology, features a 108 μW fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6Gbit/s in less than 2.5mW. The 8-core compute cluster achieves a peak performance of 850 millions of 32-bit integer multiply and accumulate per second (MMAC/s), 500 millions of 32-bit floating-point multiply and accumulate per second (MFMAC/s)-1 GFLOP/s-with an energy-efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-ON IOT end-nodes improving performance by several orders of magnitude with respect to traditional single core MCUs within a power envelope of 153 mW.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.