A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, lowlatency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh–crossbar onchip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32 bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 singlestage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. We maximize the utilization of wiring resources by using a configurable number of read and write channels, achieving a peak bandwidth of 3.74 TiB/s and a bisection bandwidth of 0.47 TiB/s. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, dataparallel key GenAI kernels. TeraNoC only consumes 7.6% of the total cluster power in kernels dominated by crossbar accesses, and 22.7% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8% and improves area efficiency (GFLOP/s/mm2) by up to 98.7%, while occupying only 10.9% of the logic area.
Zhang, Y., Fu, Z., Fischer, T., Li, Y., Bertuletti, M., Benini, L. (2025). TeraNOC: A Multi-Channel 32-Bit Fine-Grained, Hybrid Mesh-Crossbar Noc for Efficient Scale-Up of 1000+ Core Shared-L1-Memory Clusters [10.1109/iccd65941.2025.00093].
TeraNOC: A Multi-Channel 32-Bit Fine-Grained, Hybrid Mesh-Crossbar Noc for Efficient Scale-Up of 1000+ Core Shared-L1-Memory Clusters
Benini, Luca
2025
Abstract
A key challenge in on-chip interconnect design is to scale up bandwidth while maintaining low latency and high area efficiency. 2D-meshes scale with low wiring area and congestion overhead; however, their end-to-end latency increases with the number of hops, making them unsuitable for latency-sensitive core-to-L1-memory access. On the other hand, crossbars offer low latency, but their routing complexity grows quadratically with the number of I/Os, requiring large physical routing resources and limiting area-efficient scalability. This two-sided interconnect bottleneck hinders the scale-up of many-core, lowlatency, tightly coupled shared-memory clusters, pushing designers toward instantiating many smaller and loosely coupled clusters, at the cost of hardware and software overheads. We present TeraNoC, an open-source, hybrid mesh–crossbar onchip interconnect that offers both scalability and low latency, while maintaining very low routing overhead. The topology, built on 32 bit word-width multi-channel 2D-meshes and crossbars, enables the area-efficient scale-up of shared-memory clusters. A router remapper is designed to balance traffic load across interconnect channels. Using TeraNoC, we build a cluster with 1024 singlestage, single-issue cores that share a 4096-banked L1 memory, implemented in 12nm technology. We maximize the utilization of wiring resources by using a configurable number of read and write channels, achieving a peak bandwidth of 3.74 TiB/s and a bisection bandwidth of 0.47 TiB/s. The low interconnect stalls enable high compute utilization of up to 0.85 IPC in compute-intensive, dataparallel key GenAI kernels. TeraNoC only consumes 7.6% of the total cluster power in kernels dominated by crossbar accesses, and 22.7% in kernels with high 2D-mesh traffic. Compared to a hierarchical crossbar-only cluster, TeraNoC reduces die area by 37.8% and improves area efficiency (GFLOP/s/mm2) by up to 98.7%, while occupying only 10.9% of the logic area.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



