CRIS Current Research Information System

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.

Venieri, E., Manoni, S., Florian, A., Park, J., Sohn, K., Bartolini, A. (2026). AME-PIM: Can Memory be Your Next Tensor Accelerator? [10.1145/3801487.3806067].

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Venieri, Emanuele;Manoni, Simone;Florian, Alberto;Park, Jaehyun;Sohn, Kyomin;Bartolini, Andrea^Ultimo

2026

Abstract

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Titolo del volume
	
				Proceedings of the 23rd ACM International Conference on Computing Frontiers 2026(CF 2026)
			
	Pagina iniziale
	
				232
			
	Pagina finale
	
				240
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3801487.3806067
			
	Citazione
	
				Venieri, E., Manoni, S., Florian, A., Park, J., Sohn, K., Bartolini, A. (2026). AME-PIM: Can Memory be Your Next Tensor Accelerator? [10.1145/3801487.3806067].
			
	Tutti gli autori
	
						Venieri, Emanuele; Manoni, Simone; Florian, Alberto; Park, Jaehyun; Sohn, Kyomin; Bartolini, Andrea
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
3801487.3806067.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 5.39 MB Formato Adobe PDF Visualizza/Apri	5.39 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1069857

Citazioni

ND

ND

ND

0

social impact