Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Bochem, Severin; Jung, Victor J. B.; Prasad, Arpan Suravi; Conti, Francesco; Benini, Luca

doi:10.23919/date64628.2025.10992712

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 × speedup when using 4 MCUs compared to a single-chip system.

Bochem, S., Jung, V.J.B., Prasad, A.S., Conti, F., Benini, L. (2025). Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs. Institute of Electrical and Electronics Engineers Inc. [10.23919/date64628.2025.10992712].

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Bochem, Severin;Jung, Victor J. B.;Prasad, Arpan Suravi;Conti, Francesco;Benini, Luca

2025

Abstract

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 × speedup when using 4 MCUs compared to a single-chip system.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Proceedings -Design, Automation and Test in Europe, DATE
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				7
			
	Collana/Serie
	
				PROCEEDINGS - DESIGN, AUTOMATION, AND TEST IN EUROPE CONFERENCE AND EXHIBITION
			
	Codice DOI
	
				https://dx.doi.org/10.23919/date64628.2025.10992712
			
	Citazione
	
				Bochem, S., Jung, V.J.B., Prasad, A.S., Conti, F., Benini, L. (2025). Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs. Institute of Electrical and Electronics Engineers Inc. [10.23919/date64628.2025.10992712].
			
	Tutti gli autori
	
						Bochem, Severin; Jung, Victor J. B.; Prasad, Arpan Suravi; Conti, Francesco; Benini, Luca

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1040756

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

ND

CRIS Current Research Information System