As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11 × uncorrected fault reduction with only area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of while maintaining a 500 MHz frequency in a 12 nm technology.

Wiese, P., Item, M., Bertaccini, L., Tortorella, Y., Garofalo, A., Benini, L. (2025). RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine. 1601 Broadway, 10th Floor, NEW YORK, NY, UNITED STATES : Association for Computing Machinery, Inc [10.1145/3706594.3726981].

RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

Bertaccini, Luca;Tortorella, Yvan;Garofalo, Angelo;Benini, Luca
2025

Abstract

As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11 × uncorrected fault reduction with only area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of while maintaining a 500 MHz frequency in a 12 nm technology.
2025
Proceedings of the 22nd ACM International Conference on Computing Frontiers 2025, CF 2025
78
81
Wiese, P., Item, M., Bertaccini, L., Tortorella, Y., Garofalo, A., Benini, L. (2025). RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine. 1601 Broadway, 10th Floor, NEW YORK, NY, UNITED STATES : Association for Computing Machinery, Inc [10.1145/3706594.3726981].
Wiese, Philip; Item, Maurus; Bertaccini, Luca; Tortorella, Yvan; Garofalo, Angelo; Benini, Luca
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1025799
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact