We present the WaveFormer, a neural network architecture based on a linear attention transformer to enable long sequence inference for TinyML devices. Waveformer achieves a new state-of-the-art accuracy of 98.8 % and 99.1 % on the Google Speech V2 keyword spotting (KWS) dataset for the 12 and 35 class problems with only 130 kB of weight storage, compatible with MCU class devices. Top-1 accuracy is improved by 0.1 and 0.9 percentage points while reducing the model size and number of operations by 2.5× and 4.7× compared to the state of the art. We also propose a hardware-friendly 8-bit integer quantization algorithm for the linear attention operator, enabling efficient deployment on low-cost, ultra-low-power microcontrollers without loss of accuracy.
Scherer, M., Cioflan, C., Magno, M., Benini, L. (2024). Work in Progress: Linear Transformers for TinyML. Institute of Electrical and Electronics Engineers Inc. [10.23919/DATE58400.2024.10546828].
Work in Progress: Linear Transformers for TinyML
Scherer M.;Benini L.
2024
Abstract
We present the WaveFormer, a neural network architecture based on a linear attention transformer to enable long sequence inference for TinyML devices. Waveformer achieves a new state-of-the-art accuracy of 98.8 % and 99.1 % on the Google Speech V2 keyword spotting (KWS) dataset for the 12 and 35 class problems with only 130 kB of weight storage, compatible with MCU class devices. Top-1 accuracy is improved by 0.1 and 0.9 percentage points while reducing the model size and number of operations by 2.5× and 4.7× compared to the state of the art. We also propose a hardware-friendly 8-bit integer quantization algorithm for the linear attention operator, enabling efficient deployment on low-cost, ultra-low-power microcontrollers without loss of accuracy.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.