Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Scheffler, P.; Zaruba, F.; Schuiki, F.; Hoefler, T.; Benini, L.

doi:10.23919/DATE51398.2021.9474230

Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multicore cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel.

Scheffler P., Zaruba F., Schuiki F., Hoefler T., Benini L. (2021). Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra. Institute of Electrical and Electronics Engineers Inc. [10.23919/DATE51398.2021.9474230].

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Scheffler P.;Zaruba F.;Schuiki F.;Hoefler T.;Benini L.

2021

Abstract

Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multicore cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Titolo del volume
	
				Proceedings -Design, Automation and Test in Europe, DATE
			
	Pagina iniziale
	
				1787
			
	Pagina finale
	
				1792
			
	Collana/Serie
	
				PROCEEDINGS - DESIGN, AUTOMATION, AND TEST IN EUROPE CONFERENCE AND EXHIBITION
			
	Codice DOI
	
				https://dx.doi.org/10.23919/DATE51398.2021.9474230
			
	Citazione
	
				Scheffler P.,  Zaruba F.,  Schuiki F.,  Hoefler T.,  Benini L. (2021). Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra. Institute of Electrical and Electronics Engineers Inc. [10.23919/DATE51398.2021.9474230].
			
	Tutti gli autori
	
						Scheffler P.; Zaruba F.; Schuiki F.; Hoefler T.; Benini L.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/870410

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

5

ND

CRIS Current Research Information System