In heterogeneous computer architectures, the serial part of an application is coupled with domain-specific accelerators that promise high computing throughput and efficiency across a wide range of applications. In such systems, the serial part of a program is executed on a Central Processing Unit (CPU) core optimized for single-thread performance, while parallel sections are offloaded to Programmable Manycore Accelerators (PMCAs). This heterogeneity requires CPU cores and PMCAs to share data in memory efficiently, although CPUs rely on a coherent memory system where data is transferred in cache lines, while PMCAs are based on non-coherent scratchpad memories where data is transferred in bursts by DMA engines. In this paper, we tackle the challenges and hardware complexity of bridging the gap from a non-coherent, burst-based memory hierarchy to a coherent, cache-line-based one. We design and implement an open-source hardware module that reaches 97% peak throughput over a wide range of realistic linear algebra kernels and is suited for a wide spectrum of memory architectures. Implemented in a state-of-the-art 22 nm FD-SOI technology, our module bridges up to 650 Gbps at 130 fJ/bit and has a complexity of less than 1 kGE/Gbps.
Titolo: | Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems | |
Autore/i: | Cavalcante M.; Kurth A.; Schuiki F.; Benini L. | |
Autore/i Unibo: | ||
Anno: | 2020 | |
Titolo del libro: | 17th ACM International Conference on Computing Frontiers 2020, CF 2020 - Proceedings | |
Pagina iniziale: | 81 | |
Pagina finale: | 88 | |
Digital Object Identifier (DOI): | http://dx.doi.org/10.1145/3387902.3392631 | |
Abstract: | In heterogeneous computer architectures, the serial part of an application is coupled with domain-specific accelerators that promise high computing throughput and efficiency across a wide range of applications. In such systems, the serial part of a program is executed on a Central Processing Unit (CPU) core optimized for single-thread performance, while parallel sections are offloaded to Programmable Manycore Accelerators (PMCAs). This heterogeneity requires CPU cores and PMCAs to share data in memory efficiently, although CPUs rely on a coherent memory system where data is transferred in cache lines, while PMCAs are based on non-coherent scratchpad memories where data is transferred in bursts by DMA engines. In this paper, we tackle the challenges and hardware complexity of bridging the gap from a non-coherent, burst-based memory hierarchy to a coherent, cache-line-based one. We design and implement an open-source hardware module that reaches 97% peak throughput over a wide range of realistic linear algebra kernels and is suited for a wide spectrum of memory architectures. Implemented in a state-of-the-art 22 nm FD-SOI technology, our module bridges up to 650 Gbps at 130 fJ/bit and has a complexity of less than 1 kGE/Gbps. | |
Data stato definitivo: | 11-feb-2021 | |
Appare nelle tipologie: | 4.01 Contributo in Atti di convegno |