ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Islamoglu, Gamze; Scherer, Moritz; Paulin, Gianna; Fischer, Tim; Jung, Victor J. B.; Garofalo, Angelo; Benini, Luca

doi:10.1109/ISLPED58423.2023.10244348

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm(2) in 22nm fully-depleted silicon-on-insulator technology at 0.8 V.

Islamoglu, G., Scherer, M., Paulin, G., Fischer, T., Jung, V.J.B., Garofalo, A., et al. (2023). ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers. 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE [10.1109/ISLPED58423.2023.10244348].

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Islamoglu, Gamze;Scherer, Moritz;Paulin, Gianna;Fischer, Tim;Jung, Victor J. B.;Garofalo, Angelo;Benini, Luca

2023

Abstract

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm(2) in 22nm fully-depleted silicon-on-insulator technology at 0.8 V.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Titolo del volume
	
				2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				6
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ISLPED58423.2023.10244348
			
	Citazione
	
				Islamoglu, G., Scherer, M., Paulin, G., Fischer, T., Jung, V.J.B., Garofalo, A., et al. (2023). ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers. 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE [10.1109/ISLPED58423.2023.10244348].
			
	Tutti gli autori
	
						Islamoglu, Gamze; Scherer, Moritz; Paulin, Gianna; Fischer, Tim; Jung, Victor J. B.; Garofalo, Angelo; Benini, Luca

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/958810

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

5

1

CRIS Current Research Information System