Real-time streaming applications play a pivotal role across diverse domains, including autonomous systems, speech processing, and bio-signal monitoring. Temporal Convolutional Networks (TCNs) effectively model sequences by capturing longterm dependencies, but real-time inference on ultra-low-power microcontrollers (MCUs) remains challenging due to high computational and memory requirements. This work presents a framework to optimize TCN inference for real-time streaming applications by introducing a multi-timestep approach combined with advanced quantization techniques. This solution enables a dynamic adaptation of the streaming application by finding a trade-off between latency and computational efficiency. Deploying a speech enhancement model (Conv-TasNet) on the GAP9 ultra-low-power MCU, we achieve a 2 ms inference time (33% of the real-time constraint of 6.25 ms), along with a 108.9 × reduction in MAC operations and a 27.7 × cycle reduction. Using four timesteps increases the MAC/Cycle ratio to 3.3 while maintaining a 4.3 ms inference time, less than 18% of the extended realtime budget (25 ms). Combining INT8-BFP16 mixed precision quantization and multi-timestep processing delivers a 4 × memory saving at the same performance.
Mirsalari, S.A., Fariselli, M., Bijar, L., Paci, F., Benini, L., Tagliavini, G. (2025). Enabling Real-Time Streaming Temporal Convolution Network Inference on Ultra-Low-Power Microcontrollers. New York (USA) : IEEE Computer Society [10.1109/isvlsi65124.2025.11130291].
Enabling Real-Time Streaming Temporal Convolution Network Inference on Ultra-Low-Power Microcontrollers
Mirsalari, Seyed Ahmad;Benini, Luca;Tagliavini, Giuseppe
2025
Abstract
Real-time streaming applications play a pivotal role across diverse domains, including autonomous systems, speech processing, and bio-signal monitoring. Temporal Convolutional Networks (TCNs) effectively model sequences by capturing longterm dependencies, but real-time inference on ultra-low-power microcontrollers (MCUs) remains challenging due to high computational and memory requirements. This work presents a framework to optimize TCN inference for real-time streaming applications by introducing a multi-timestep approach combined with advanced quantization techniques. This solution enables a dynamic adaptation of the streaming application by finding a trade-off between latency and computational efficiency. Deploying a speech enhancement model (Conv-TasNet) on the GAP9 ultra-low-power MCU, we achieve a 2 ms inference time (33% of the real-time constraint of 6.25 ms), along with a 108.9 × reduction in MAC operations and a 27.7 × cycle reduction. Using four timesteps increases the MAC/Cycle ratio to 3.3 while maintaining a 4.3 ms inference time, less than 18% of the extended realtime budget (25 ms). Combining INT8-BFP16 mixed precision quantization and multi-timestep processing delivers a 4 × memory saving at the same performance.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


