Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memoryhierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, >1000 floating-point-capable RISC-V PEs scaled-up cluster de-sign, sharing a Multi-MegaByte >4000-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 °C. The energy-efficient hierarchical PE-to-L1-memoryinterconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1× the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers upto 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.

Zhang, Y., Bertuletti, M., Zhang, C., Riedel, S., Shen, D., Wang, B., et al. (2025). TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link. IEEE TRANSACTIONS ON COMPUTERS, 74(11), 3667-3681 [10.1109/tc.2025.3603692].

TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link

Vanelli-Coralli, Alessandro;Benini, Luca
2025

Abstract

Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memoryhierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, >1000 floating-point-capable RISC-V PEs scaled-up cluster de-sign, sharing a Multi-MegaByte >4000-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 °C. The energy-efficient hierarchical PE-to-L1-memoryinterconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1× the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers upto 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.
2025
Zhang, Y., Bertuletti, M., Zhang, C., Riedel, S., Shen, D., Wang, B., et al. (2025). TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link. IEEE TRANSACTIONS ON COMPUTERS, 74(11), 3667-3681 [10.1109/tc.2025.3603692].
Zhang, Yichao; Bertuletti, Marco; Zhang, Chi; Riedel, Samuel; Shen, Diyou; Wang, Bowen; Vanelli-Coralli, Alessandro; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1039409
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact