Coupling processors with acceleration hardware is an effective manner to improve energy efficiency of embedded systems. Many-core is nowadays a dominating design paradigm for SoCs, which opens new challenges and opportunities for designing HW blocks. Exploring acceleration solutions that naturally fit into well-established parallel programming models and that can be incrementally added on top of existing parallel applications is thus extremely important. In this paper we focus on tightly-coupled multi-core cluster architectures, representative of the basic building block of the most recent many-cores, and we enhance it with dedicated HW processing units (HWPU). We propose an architecture where the HWPUs share the same L1 data memory through which processors also communicate, implementing a zero-copy communication model. High-level synthesis (HLS) tools are used to generate HW blocks, then a custom wrapper interfaces the latter to the tightly coupled cluster. We validate our proposal on RTL models, running both synthetic workload and real applications. Experimental results demonstrate that on average our solution provides nearly identical performance to traditional private-memory coarse-grained accelerators, but it achieves up to 32 percent better performance/area/watt and it requires only minimal modifications to legacy parallel codes.

Architecture support for tightly-coupled multi-core clusters with shared-memory HW accelerators

MARONGIU, ANDREA;KAKOEE, MOHAMMAD REZA;BENINI, LUCA
2015

Abstract

Coupling processors with acceleration hardware is an effective manner to improve energy efficiency of embedded systems. Many-core is nowadays a dominating design paradigm for SoCs, which opens new challenges and opportunities for designing HW blocks. Exploring acceleration solutions that naturally fit into well-established parallel programming models and that can be incrementally added on top of existing parallel applications is thus extremely important. In this paper we focus on tightly-coupled multi-core cluster architectures, representative of the basic building block of the most recent many-cores, and we enhance it with dedicated HW processing units (HWPU). We propose an architecture where the HWPUs share the same L1 data memory through which processors also communicate, implementing a zero-copy communication model. High-level synthesis (HLS) tools are used to generate HW blocks, then a custom wrapper interfaces the latter to the tightly coupled cluster. We validate our proposal on RTL models, running both synthetic workload and real applications. Experimental results demonstrate that on average our solution provides nearly identical performance to traditional private-memory coarse-grained accelerators, but it achieves up to 32 percent better performance/area/watt and it requires only minimal modifications to legacy parallel codes.
2015
Dehyadegari, Masoud; Marongiu, Andrea; Kakoee, Mohammad Reza; Mohammadi, Siamak; Yazdani, Naser; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/544967
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 7
social impact