Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement&routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8×16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the maxperformance 16×32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.
Rahimi A. , Loi I. , Kakoee M.R. , Benini L. (2011). A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters. New York : IEEE Press [10.1109/DATE.2011.5763085].
A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters
LOI, IGOR;KAKOEE, MOHAMMAD REZA;BENINI, LUCA
2011
Abstract
Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement&routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8×16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the maxperformance 16×32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.