ProActive SPMD and Fault Tolerance Protocol and Benchmarks

Size: px

Start display at page:

Download "ProActive SPMD and Fault Tolerance Protocol and Benchmarks"

Allan Lamb
5 years ago
Views:

1 1 ProActive SPMD and Fault Tolerance Protocol and Benchmarks Brian Amedro et al. INRIA - CNRS 1st workshop INRIA-Illinois June 10-12, 2009 Paris

2 2 Outline ASP Model Overview ProActive SPMD Fault Tolerance Benchmarks

3 3 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization P1 P2 Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

4 4 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization P1 P2 Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

5 5 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation RDV 3 P1 P2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

6 6 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation Futures and WBN 3 P1 P2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

7 7 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation point-to-point FIFO order Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL! causal ordering

8 8 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation point-to-point FIFO order Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL! causal ordering

9 9 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

10 10 ProActive OO-SPMD Objectives Provide an MPI-like programming model Ease the porting an MPI application to ProActive Give Object Oriented to SPMD model

11 11 ProActive OO-SPMD Features Asynchronous collective operations Asynchronous barrier Asynchronous group communication Scattering, gathering Take into account topology Optimized algorithm

12 12 Fault Tolerance with ASP Objectives Transparency No piece of code dedicated to Fault Tolerance in applications Portability No assumption about underlying hardware Consistency

13 13 Fault Tolerance with ASP Propositions Rollback Recovery TTC + Communication Induced Checkpointing: Transparency No programmer intervention Constrained Checkpointability: Portability No Checkpoint during a service Un-consistency of recovery lines Christian Delbé, PhD Thesis, 2007 Tolérance aux pannes pour objets actifs asynchrones - protocole, modèle et expérimentations

14 14 Fault Tolerance with ASP Principles of the Protocol Q1 Q4 n Ci[Q1,Q2,Q4] Q2 Q3 Q1 Q2 Q4 i

15 15 Fault Tolerance with ASP Principles of the Protocol Orphan and In-transit Messages Promised Requests Request Reception History

16 16 Fault Tolerance with ASP Orphan Messages Ci n+1 Q1 i n Cj[Q0] Q0 n+1 Cj[Q1] Q1 R1 j Q1 is an Orphan Request

17 17 Fault Tolerance with ASP Promised Request n+1 Q1 i n Q0 n+1 Q1 R1 j Ci n+1 Q1 i n+1 Cj [Qi,j] Qi,j Q1 synchronization with the actual request arrival R1 j

18 18 Fault Tolerance with ASP In-transit Messages Ci n Cj n Q0 Ck n Q0 R1 Q2 i j k Q0 is an In-transit Request

19 19 Fault Tolerance with ASP Request Reception History Ci n+1 Cj n+1 Ck n Q0 Q1 Q0 Q1 R1 Ck n+1 i j k

20 20 Fault Tolerance with ASP Synthesis Orphan Request Replace with a promised request with wait-by-necessity In-transit Requests Reception of such a request is journalized into a request reception history In-transit Reply Can t happen: occur after the reception of an orphan request, so a checkpoint have been performed

21 21 Benchmarking Fault Tolerance NAS Parallel Benchmarks 5 kernels : EP, CG, FT, MG, IS About 10,000 LOC

22 22 Benchmarking Fault Tolerance NAS Benchmark : CG.A Checkpoint size (MB) Time (sec) # nodes Checkpoint size (MB) T without checkpoints T standard T with checkpoints

23 23 Benchmarking Fault Tolerance NAS Benchmark : CG.C Checkpoint size (MB) Time (sec) # nodes Checkpoint size (MB) T without checkpoints T standard T with checkpoints

24 24 Thank you for your attention

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is