Several Chip-Multiprocessor designs today leverage tightly-coupled computing clusters as a building block. These clusters consist of a fairly large number N of simple cores, featuring fast communication through a shared multibanked L1 data memory and ≈ 1 Instruction-Per-Cycle (IPC) per core. Thus, aggregated I-fetch bandwidth approaches ƒ * N, where ƒ is the cluster clock frequency. An effective instruction cache architecture is key to support this I-fetch bandwidth. In this paper we compare two main architectures for instruction caching targeting tightly coupled CMP clusters: (i) private instruction caches per core and (ii) shared instruction cache per cluster. We developed a cycle-accurate model of the tightly coupled cluster with several configurable architectural parameters for exploration, plus a programming environment targeted at efficient data-parallel computing. We conduct an in-depth study of the two architectural templates based on the use of both synthetic microbenchmarks and real program workloads. Our results provide useful insights and guidelines for designers.

Exploring Instruction caching strategies for tightly-coupled shared-memory clusters

BORTOLOTTI, DANIELE;PATERNA, FRANCESCO;PINTO, CHRISTIAN;MARONGIU, ANDREA;RUGGIERO, MARTINO;BENINI, LUCA
2011

Abstract

Several Chip-Multiprocessor designs today leverage tightly-coupled computing clusters as a building block. These clusters consist of a fairly large number N of simple cores, featuring fast communication through a shared multibanked L1 data memory and ≈ 1 Instruction-Per-Cycle (IPC) per core. Thus, aggregated I-fetch bandwidth approaches ƒ * N, where ƒ is the cluster clock frequency. An effective instruction cache architecture is key to support this I-fetch bandwidth. In this paper we compare two main architectures for instruction caching targeting tightly coupled CMP clusters: (i) private instruction caches per core and (ii) shared instruction cache per cluster. We developed a cycle-accurate model of the tightly coupled cluster with several configurable architectural parameters for exploration, plus a programming environment targeted at efficient data-parallel computing. We conduct an in-depth study of the two architectural templates based on the use of both synthetic microbenchmarks and real program workloads. Our results provide useful insights and guidelines for designers.
2011
System on Chip (SoC), 2011 International Symposium on
34
41
D. Bortolotti; F. Paterna; C. Pinto; A. Marongiu; M. Ruggiero; L. Benini
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/109155
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 16
  • ???jsp.display-item.citation.isi??? ND
social impact