Scalable In-memory Checkpoint with Automatic Restart on Failures

Size: px

Start display at page:

Download "Scalable In-memory Checkpoint with Automatic Restart on Failures"

Cori Hood
6 years ago
Views:

1 Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, th Joint Lab Workshop 1 / 35

2 Problem Number of Sockets 1e Exascale MTBF Machine (hours) MTBF/Socket:10 years 5 years 2 years 1 year Year Number of Sockets Increasing number of sockets Sequoia 98304, predicted Exascale More frequent failures MTBF of the Exacale machine will be 720 seconds if MTBF per socket remains at 5 years. 2 / 35

3 Our Philosophy Runtime system support for fault tolerance Checkpoint Restart Message Logging Proactive Migration 3 / 35

4 Our Philosophy Runtime system support for fault tolerance Keep progress rate despite of failures Optimize for the common case Minimize performance overhead No Fault Tolerance Support Fault Tolerance Support 100% Checkpoint Progress Slowdown Recovery Failure Time 4 / 35

5 Optimize for the common case Failures rarely bring down more than one node In Jaguar (now Titan, top 1 supercomputer), 92.27% of failures are individual node crashes 100 Frequency (%) System 12 System 18 System 19 System 20 System 21 MPP2 Tsubame Mercury 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 5 / 35

6 Minimize performance overhead Decrease interference with application Parallel recovery Automatic restart: Failure detection in runtime system Immediate rollback-recovery Faster checkpoint 6 / 35

7 Minimize performance overhead Decrease interference with application Parallel recovery Automatic restart: Failure detection in runtime system Immediate rollback-recovery Faster checkpoint Average waiting time(s) Number of cores require 6 / 35

8 Checkpoint and Restart for Leanmd Time(ms) LeanMD Checkpoint Time on BlueGene/Q 2.8 million 1.6 million Number of processes Time(ms) million 1.6 million LeanMD Restart Time on BlueGene/Q Number of processes 7 / 35

9 Limitation of Checkpoint/Restart Increase in memory size per year: 41% Increase in network bandwidth per year: 26% 8 / 35

10 Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 9 / 35

11 Checkpoint/Restart Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 10 / 35

12 Checkpoint/Restart Synchronous Synchronous Checkpoint barrier NODE 1 α NODE 2 β checkpoint done β α blocking δ blocking checkpoint interval checkpoint overhead δblocking blocking Each node has a buddy node to store the checkpoint. Resume computation after all the nodes have successfully saved the checkpoints in their buddy nodes. 11 / 35

13 Checkpoint/Restart Asynchronous Solution: Asynchronous Checkpoint barrier NODE 1 NODE 2 local checkpoint done α β δ Θ remote checkpoint done β α φ δ θ ϕ checkpoint interval local checkpoint overhead overlap period remote checkpoint interference Resume computation as soon as each node stores its own checkpoint (local checkpoint). Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint). 12 / 35

14 Checkpoint/Restart Asynchronous Drawback Probability to roll back to the previous checkpoint when checkpointing overlaps with application Interference of the remote checkpoint to application 13 / 35

15 Checkpoint/Restart Model T = T s + T local + T overhead + T rework + T restart barrier local checkpoint done remote checkpoint done T s Workload NODE 1 NODE 2 α β δ Θ β α φ T local Time for local checkpoint T overhead Interference of global checkpoint T rework Lost work for application T restart Time to restart application 14 / 35

16 Checkpoint/Restart Model Local Checkpoint Overhead barrier NODE 1 NODE 2 local checkpoint done α β δ Θ remote checkpoint done β α φ Local checkpoint T local Checkpoint interval Work finished in one checkpoint interval ϕ Number of checkpoints T s ϕ 15 / 35

17 Checkpoint/Restart Model Local Checkpoint Overhead barrier NODE 1 NODE 2 local checkpoint done α β δ Θ remote checkpoint done β α φ Local checkpoint T local Checkpoint interval Work finished in one checkpoint interval ϕ T s ϕ Number of checkpoints Local checkpoint overhead for one checkpoint δ T local = Ts ϕ δ 15 / 35

18 Checkpoint/Restart Model Remote Checkpoint Interference barrier NODE 1 NODE 2 local checkpoint done α β δ Θ remote checkpoint done β α φ Interference of remote checkpoint to application T overhead Interference for one checkpoint ϕ T overhead = Ts ϕ ϕ 16 / 35

19 Checkpoint/Restart Model Rework and Restart Time remote checkpoint done Failure1 rollback Failure2 NODE δ δ Θ Θ Rework time T rework Number of failures T MTBF 17 / 35

20 Checkpoint/Restart Model Rework and Restart Time remote checkpoint done Failure1 rollback Failure2 NODE δ δ Θ Θ Rework time T rework Number of failures Failure 1: at least θ ϕ T MTBF 17 / 35

21 Checkpoint/Restart Model Rework and Restart Time remote checkpoint done Failure1 rollback Failure2 NODE δ δ Θ Θ Rework time T rework Number of failures T MTBF Failure 1: at least θ ϕ Failure 2: at most θ ϕ + + δ 17 / 35

22 Checkpoint/Restart Model Rework and Restart Time remote checkpoint done Failure1 rollback Failure2 NODE δ δ Θ Θ Rework time T rework T Number of failures MTBF Failure 1: at least θ ϕ Failure 2: at most θ ϕ + + δ T rework = T MTBF ( +δ 2 + θ ϕ) 17 / 35

23 Checkpoint/Restart Model Rework and Restart Time remote checkpoint done Failure1 rollback Failure2 NODE δ δ Θ Θ Rework time T rework T Number of failures MTBF Failure 1: at least θ ϕ Failure 2: at most θ ϕ + + δ T rework = T MTBF ( +δ 2 + θ ϕ) Restart time T restart Restart time for one failure R T restart = T MTBF R 17 / 35

24 Checkpoint/Restart Model T = T s + T s ϕ δ + T s T s ϕ ϕ + T blocking = T s + δ blocking + T blocking blocking MTBF T ( R + + δ MTBF 2 Benefit = T blocking T T blocking ) + θ ϕ ( R + blocking + δ blocking 2 ) 18 / 35

25 Minimize checkpoint interference to application Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 19 / 35

26 Minimize checkpoint interference to application Charm++ Runtime System Object based over-decomposition Asynchronous method invocation Migratable-object runtime system Worker thread & Communication thread System implementation User View 20 / 35

27 Minimize checkpoint interference to application Priority sending queue Minimize Checkpoint Interference sending queue computation application msg checkpoint msg computation Separate checkpoint message queue Send checkpoint message only when there is no application message ready to be sent Better overlap with computation 21 / 35

28 Minimize checkpoint interference to application Opportunistic vs. random scheduling Interference(s) interference benefit Overlap(s) Opportunistic Benefit(%) Use lottery scheduling to change the overlap period Probabilistic deciding whether application or checkpoint queue can send message 22 / 35

29 Relieve memory pressure with SSD Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 23 / 35

30 Relieve memory pressure with SSD Choose Data to Store in SSD Solid State Drive: becoming increasingly available on individual nodes Full SSD strategy Half SSD strategy Only store remote checkpoint in SSD Faster checkpoint and restart 24 / 35

31 Relieve memory pressure with SSD Asynchronous Checkpointing to SSD with IO thread worker worker thread thread worker thread worker thread IO thread write to SSD SSD request Checkpoint finishes IO threads Write checkpoint to/read checkpoint from SSD When receive request from worker thread. Notify worker thread When SSD is done with certain request. 25 / 35

32 Experiments Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 26 / 35

33 Experiments Machine SDSC core nodes 120 GB flash memory (SSD) per node 100 Teraflops Applications Wave2D: stencil computation ChaNGa: N-Body simulation 27 / 35

34 Experiments Single Checkpoint Overhead blocking checkpoint semi blocking checkpoint blocking checkpoint semi blocking checkpoint Checkpoint Overhead(s) Checkpoint Overhead(s) Number of Cores Number of Cores Wave2D Weak Scale ChaNGa Strong Scale Semi-Blocking checkpoint reduces checkpoint overhead significantly. 28 / 35

35 Experiments Semi-Blocking Benefit Benefit (%) Benefit (%) Number of Cores Number of Cores MTBF:300s 600s 900s 1200s 1500s 1800s MTBF:300s 600s 900s 1200s 1500s 1800s Wave2D Weak Scale ChaNGa Strong Scale Semi-Blocking checkpoint reduces the total execution time up to 22%. 29 / 35

36 Experiments Checkpoint/Restart on SSD half aio full aio half sio full sio in memory half aio full aio Timing Penalty(s) Restart Time(s) Checkpoint Size/Node(GB) Half SSD strategy with asynchronous IO reduces the timing penalty for checkpointing to SSD Restart from SSD does not incur extra overhead Checkpoint Size/Node(GB) aio asynchronous IO sio synchronous IO 30 / 35

37 Conclusion Outline 1 Checkpoint/Restart Synchronous Asynchronous Model 2 Minimize checkpoint interference to application Priority sending queue Opportunistic vs. random scheduling 3 Relieve memory pressure with SSD 4 Experiments 5 Conclusion 31 / 35

38 Conclusion Conclusion Asynchronous checkpointing can help hide the checkpoint overhead. SSD can be used in checkpointing to relieve memory pressure with little overhead. 32 / 35

39 Conclusion Future Work Log analysis is very helpful Failure distributions Cluster usage Failure prediction with different fault tolerate actions Proactive migration Proactive checkpoint Multilevel checkpointing 33 / 35

40 Conclusion Thank you! 34 / 35

41 Conclusion Increasing Checkpoint Overhead 10 Checkpoint Overhead (s) 1 Checkpoint size: 16MB per core Checkpoint data is sent to another node across the network Number of Cores per Node 35 / 35

Adaptive Runtime Support

Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at