Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg “Snitch” cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80 V/25  ∘ C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 180 GOPS/W with less than 2% of execution stalls.

Riedel, S., Cavalcante, M., Andri, R., Benini, L. (2023). MemPool: A Scalable Manycore Architecture With a Low-Latency Shared L1 Memory. IEEE TRANSACTIONS ON COMPUTERS, 72(12), 3561-3575 [10.1109/TC.2023.3307796].

MemPool: A Scalable Manycore Architecture With a Low-Latency Shared L1 Memory

Benini, Luca
2023

Abstract

Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg “Snitch” cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80 V/25  ∘ C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 180 GOPS/W with less than 2% of execution stalls.
2023
Riedel, S., Cavalcante, M., Andri, R., Benini, L. (2023). MemPool: A Scalable Manycore Architecture With a Low-Latency Shared L1 Memory. IEEE TRANSACTIONS ON COMPUTERS, 72(12), 3561-3575 [10.1109/TC.2023.3307796].
Riedel, Samuel; Cavalcante, Matheus; Andri, Renzo; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/956445
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact