The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware ar- chitectures that optimize computational e!ciency and energy consump- tion. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced nu- merical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, grid size, matrix dimensions, data formats, and numerical precision impact on computational e!ciency. Furthermore, we compare Grayskull’s perfor- mance against state-of-the-art architectures with tensor acceleration, in- cluding Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-o" between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.
Pizzini Cavagna, H., Cesarini, D., Bartolini, A. (2025). Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities. Springer Nature [10.1007/978-3-032-07612-0_10].
Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities
Pizzini Cavagna, Hiari
Primo
Writing – Original Draft Preparation
;Bartolini, Andrea
Ultimo
Writing – Review & Editing
2025
Abstract
The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware ar- chitectures that optimize computational e!ciency and energy consump- tion. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced nu- merical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, grid size, matrix dimensions, data formats, and numerical precision impact on computational e!ciency. Furthermore, we compare Grayskull’s perfor- mance against state-of-the-art architectures with tensor acceleration, in- cluding Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-o" between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


