A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, lowlatency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh–crossbar onchip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32 bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 singlestage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. We maximize the utilization of wiring resources by using a configurable number of read and write channels, achieving a peak bandwidth of 3.74 TiB/s and a bisection bandwidth of 0.47 TiB/s. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, dataparallel key GenAI kernels. TeraNoC only consumes 7.6% of the total cluster power in kernels dominated by crossbar accesses, and 22.7% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8% and improves area efficiency (GFLOP/s/mm2) by up to 98.7%, while occupying only 10.9% of the logic area.

Zhang, Y., Fu, Z., Fischer, T., Li, Y., Bertuletti, M., Benini, L. (2025). TeraNOC: A Multi-Channel 32-Bit Fine-Grained, Hybrid Mesh-Crossbar Noc for Efficient Scale-Up of 1000+ Core Shared-L1-Memory Clusters [10.1109/iccd65941.2025.00093].

TeraNOC: A Multi-Channel 32-Bit Fine-Grained, Hybrid Mesh-Crossbar Noc for Efficient Scale-Up of 1000+ Core Shared-L1-Memory Clusters

Benini, Luca
2025

Abstract

A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, lowlatency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh–crossbar onchip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32 bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 singlestage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. We maximize the utilization of wiring resources by using a configurable number of read and write channels, achieving a peak bandwidth of 3.74 TiB/s and a bisection bandwidth of 0.47 TiB/s. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, dataparallel key GenAI kernels. TeraNoC only consumes 7.6% of the total cluster power in kernels dominated by crossbar accesses, and 22.7% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8% and improves area efficiency (GFLOP/s/mm2) by up to 98.7%, while occupying only 10.9% of the logic area.
2025
2025 IEEE 43rd International Conference on Computer Design (ICCD)
610
617
Zhang, Y., Fu, Z., Fischer, T., Li, Y., Bertuletti, M., Benini, L. (2025). TeraNOC: A Multi-Channel 32-Bit Fine-Grained, Hybrid Mesh-Crossbar Noc for Efficient Scale-Up of 1000+ Core Shared-L1-Memory Clusters [10.1109/iccd65941.2025.00093].
Zhang, Yichao; Fu, Zexin; Fischer, Tim; Li, Yinrong; Bertuletti, Marco; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1040902
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact