As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) -a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.

Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults / Loreti D.; Artioli M.; Ciampolini A.. - ELETTRONICO. - 2020-:(2020), pp. 9251920.266-9251920.275. (Intervento presentato al convegno 39th International Symposium on Reliable Distributed Systems, SRDS 2020 tenutosi a Shanghai, China nel 21-24 Sept. 2020) [10.1109/SRDS51746.2020.00034].

Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults

Loreti D.
;
Artioli M.;Ciampolini A.
2020

Abstract

As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) -a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.
2020
2020 International Symposium on Reliable Distributed Systems (SRDS)
266
275
Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults / Loreti D.; Artioli M.; Ciampolini A.. - ELETTRONICO. - 2020-:(2020), pp. 9251920.266-9251920.275. (Intervento presentato al convegno 39th International Symposium on Reliable Distributed Systems, SRDS 2020 tenutosi a Shanghai, China nel 21-24 Sept. 2020) [10.1109/SRDS51746.2020.00034].
Loreti D.; Artioli M.; Ciampolini A.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/792330
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact