This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.

Fault tolerant adaptive parallel and distributed simulation through functional replication / D'Angelo, Gabriele*; Ferretti, Stefano; Marzolla, Moreno. - In: SIMULATION MODELLING PRACTICE AND THEORY. - ISSN 1569-190X. - ELETTRONICO. - 93:(2019), pp. 192-207. [10.1016/j.simpat.2018.09.012]

Fault tolerant adaptive parallel and distributed simulation through functional replication

D'Angelo, Gabriele;Ferretti, Stefano;Marzolla, Moreno
2019

Abstract

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.
2019
Fault tolerant adaptive parallel and distributed simulation through functional replication / D'Angelo, Gabriele*; Ferretti, Stefano; Marzolla, Moreno. - In: SIMULATION MODELLING PRACTICE AND THEORY. - ISSN 1569-190X. - ELETTRONICO. - 93:(2019), pp. 192-207. [10.1016/j.simpat.2018.09.012]
D'Angelo, Gabriele*; Ferretti, Stefano; Marzolla, Moreno
File in questo prodotto:
File Dimensione Formato  
paper.pdf

accesso aperto

Descrizione: post-print a cura degli autori
Tipo: Postprint
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione 1.19 MB
Formato Adobe PDF
1.19 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/683519
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact