FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Zhang, Chi; Colagrande, Luca; Andri, Renzo; Benz, Thomas; Islamoglu, Gamze; Nadalini, Alessandro; Conti, Francesco; Yawei, Li; Benini, Luca

doi:10.1109/isvlsi65124.2025.11130221

Multi-Head Attention (MHA) is a critical computational kernel in transformer-based AI models. Emerging scalable tile-based accelerator architectures integrate increasing numbers of tightly-packed processing elements (PEs) with tensor units. MHA dataflow mapping is crucial for achieving high utilization of the available units. We propose FlatAttention, a new dataflow for MHA on tile-based many-PE accelerators, minimizing costly main memory (HBM) accesses by leveraging collective primitives integrated into the on-chip network fabric. FlatAttention achieves up to 89.3% utilization, and 4.1 × performance speedup over FlashAttention-3 dataflow on tile-based accelerators whilst reducing HBM traffic by 16 ×. Through algorithm-architecture co-exploration, we identify an optimal configuration for a large scaled-out tile-based accelerator featuring a 32 × 32 tile mesh with 1024 TFLOPS @ FP16 peak performance, comparable to the state-of-the-art Nvidia H100 GPU. FlatAttention in this configuration achieves up to 1.3 × higher utilization over FlashAttention3 on the H100 GPU. Meanwhile, this tile-based accelerator configuration requires 40% less HBM bandwidth compared to the H100 GPU, enabling a 1.8 × reduction in die size, estimated on the same technology node.

Zhang, C., Colagrande, L., Andri, R., Benz, T., Islamoglu, G., Nadalini, A., et al. (2025). FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators. 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE Computer Society [10.1109/isvlsi65124.2025.11130221].

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Zhang, Chi;Colagrande, Luca;Andri, Renzo;Benz, Thomas;Islamoglu, Gamze;Nadalini, Alessandro;Conti, Francesco;Li, Yawei;Benini, Luca

2025

Abstract

Multi-Head Attention (MHA) is a critical computational kernel in transformer-based AI models. Emerging scalable tile-based accelerator architectures integrate increasing numbers of tightly-packed processing elements (PEs) with tensor units. MHA dataflow mapping is crucial for achieving high utilization of the available units. We propose FlatAttention, a new dataflow for MHA on tile-based many-PE accelerators, minimizing costly main memory (HBM) accesses by leveraging collective primitives integrated into the on-chip network fabric. FlatAttention achieves up to 89.3% utilization, and 4.1 × performance speedup over FlashAttention-3 dataflow on tile-based accelerators whilst reducing HBM traffic by 16 ×. Through algorithm-architecture co-exploration, we identify an optimal configuration for a large scaled-out tile-based accelerator featuring a 32 × 32 tile mesh with 1024 TFLOPS @ FP16 peak performance, comparable to the state-of-the-art Nvidia H100 GPU. FlatAttention in this configuration achieves up to 1.3 × higher utilization over FlashAttention3 on the H100 GPU. Meanwhile, this tile-based accelerator configuration requires 40% less HBM bandwidth compared to the H100 GPU, enabling a 1.8 × reduction in die size, estimated on the same technology node.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				6
			
	Codice DOI
	
				https://dx.doi.org/10.1109/isvlsi65124.2025.11130221
			
	Citazione
	
				Zhang, C., Colagrande, L., Andri, R., Benz, T., Islamoglu, G., Nadalini, A., et al. (2025). FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators. 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE Computer Society [10.1109/isvlsi65124.2025.11130221].
			
	Tutti gli autori
	
						Zhang, Chi; Colagrande, Luca; Andri, Renzo; Benz, Thomas; Islamoglu, Gamze; Nadalini, Alessandro; Conti, Francesco; Li, Yawei; Benini, Luca

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1040837

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

ND

CRIS Current Research Information System