ProActive SPMD and Fault Tolerance Protocol and Benchmarks

1 ProActive SPMD and Fault Tolerance Protocol and Benchmarks Brian Amedro et al. INRIA - CNRS 1st workshop INRIA-Illinois June 10-12, 2009 Paris

2 Outline ASP Model Overview ProActive SPMD Fault Tolerance Benchmarks

3 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization 1 2 1 2 3 3 P1 P2 Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

4 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization 1 2 1 2 3 3 P1 P2 Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

5 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation 1 2 1 2 RDV 3 P1 P2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

6 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation 1 2 1 1 2 2 1 Futures and WBN 3 P1 P2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

7 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation 1 2 1 2 3 point-to-point FIFO order 3 1 3 1 2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04 3 2 causal ordering

8 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity Confluence and determinacy Causal ordering Activity state characterization Java implementation 1 2 1 2 3 point-to-point FIFO order 3 1 3 1 2 Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04 3 2 causal ordering

9 Asynchronous Sequential Process Communication with message-passing Request / Reply No memory sharing Asynchronous request services Request queue Rendezvous Future with wait-by-necessity http://proactive.inria.fr Confluence and determinacy Causal ordering Activity state characterization Java implementation Denis Caromel and Ludovic Henrio and Bernard Serpette Asynchronous and Deterministic Objects, POPL!04

10 ProActive OO-SPMD Objectives Provide an MPI-like programming model Ease the porting an MPI application to ProActive Give Object Oriented to SPMD model

11 ProActive OO-SPMD Features Asynchronous collective operations Asynchronous barrier Asynchronous group communication Scattering, gathering Take into account topology Optimized algorithm

12 Fault Tolerance with ASP Objectives Transparency No piece of code dedicated to Fault Tolerance in applications Portability No assumption about underlying hardware Consistency

13 Fault Tolerance with ASP Propositions Rollback Recovery TTC + Communication Induced Checkpointing: Transparency No programmer intervention Constrained Checkpointability: Portability No Checkpoint during a service Un-consistency of recovery lines Christian Delbé, PhD Thesis, 2007 Tolérance aux pannes pour objets actifs asynchrones - protocole, modèle et expérimentations

14 Fault Tolerance with ASP Principles of the Protocol Q1 Q4 n Ci[Q1,Q2,Q4] Q2 Q3 Q1 Q2 Q4 i

15 Fault Tolerance with ASP Principles of the Protocol Orphan and In-transit Messages Promised Requests Request Reception History

16 Fault Tolerance with ASP Orphan Messages Ci n+1 Q1 i n Cj[Q0] Q0 n+1 Cj[Q1] Q1 R1 j Q1 is an Orphan Request

17 Fault Tolerance with ASP Promised Request n+1 Q1 i n Q0 n+1 Q1 R1 j Ci n+1 Q1 i n+1 Cj [Qi,j] Qi,j Q1 synchronization with the actual request arrival R1 j

18 Fault Tolerance with ASP In-transit Messages Ci n Cj n Q0 Ck n Q0 R1 Q2 i j k Q0 is an In-transit Request

19 Fault Tolerance with ASP Request Reception History Ci n+1 Cj n+1 Ck n Q0 Q1 Q0 Q1 R1 Ck n+1 i j k

20 Fault Tolerance with ASP Synthesis Orphan Request Replace with a promised request with wait-by-necessity In-transit Requests Reception of such a request is journalized into a request reception history In-transit Reply Can t happen: occur after the reception of an orphan request, so a checkpoint have been performed

21 Benchmarking Fault Tolerance NAS Parallel Benchmarks 5 kernels : EP, CG, FT, MG, IS About 10,000 LOC

22 Benchmarking Fault Tolerance NAS Benchmark : CG.A Checkpoint size (MB) 40.0 30.0 20.0 10.0 0 2 4 8 16 32 30 23 15 8 0 Time (sec) # nodes Checkpoint size (MB) T without checkpoints T standard T with checkpoints

23 Benchmarking Fault Tolerance NAS Benchmark : CG.C Checkpoint size (MB) 400 300 200 100 0 2 4 8 16 32 400 300 200 100 0 Time (sec) # nodes Checkpoint size (MB) T without checkpoints T standard T with checkpoints

24 Thank you for your attention