Alleviating Scalability Issues of Checkpointing

Size: px

Start display at page:

Download "Alleviating Scalability Issues of Checkpointing"

Agatha Morris
5 years ago
Views:

1 Rolf Riesen, Kurt Ferreira, Dilma Da Silva, Pierre Lemarinier, Dorian Arnold, Patrick G. Bridges 13 November 2012 Alleviating Scalability Issues of Checkpointing Protocols

2 Overview 2

3 3

4 Motivation: scaling 1.2 k 1.0 k Restart Rework Checkpoint Work Elapsed time in hours , ,000 50,000 20,000 10,000 5, , ,000 Number of sockets Double redundant No redundancy ,000 2, , ,000 50,000 20,000 10,000 5,000 Number of node pairs Elapsed

5 Motivation: message log growth Coordinated checkpoint/restart Redundant computing Uncoordinated checkpointing with message logging Messages must be logged to allow deterministic restart Message log size is a concern Latency and message rate for pessimistic logging is a problem Application HPCCG LAMMPS SAGE CTH Log growth rate per MPI process 0.6 MB/s 1.5 MB/s 13.0 MB/s 40.0 MB/s 5

6 Protocol 6

7 Combine coordinated C/R with optimistic message logging N = Sc + Ss + Sl nodes Protocol Only restart entire application when: 1. a node or logger fails and no more spares are available, 2. when a logger fails (and the log data is needed), or 3. when an event is lost but a later message has been successfully delivered. 7

8 Implementation Elapsed Socket hours I/O bandwidth Break even spares 8

9 Implementation Implementation Elapsed Socket hours I/O bandwidth Break even spares 9

10 Results: elapsed time Implementation Elapsed 10.0 kh coordinated combined protocol 2x replication Socket hours I/O bandwidth Break even spares Elapsed time 1.0 kh predicted range of exascale systems area for improvement h 500 1,000 2,000 5,000 10,000 20,000 50, , , ,000 Number of compute sockets Elapsed time for a 168-hour job, 5-year socket MTBF. 10

11 Results: elapsed, various MTBF Elapsed time in hours 10.0 k 1.0 k coordinated 50k 100k 150k 200k combined 250k 300k 10.0 k 1.0 k Socket MTBF in years x replication Implementation Elapsed Socket hours I/O bandwidth Break even spares Number of compute sockets 11

12 Results: socket hours Implementation 200% 180% Elapsed Socket hours I/O bandwidth Normalized socket hours 160% 140% 120% 100% 80% 60% 40% 20% 0% 500 2x replication coordinated combined protocol 1,000 2,000 5, , ,000 50,000 20,000 10, ,000 Break even spares Number of compute sockets 12

13 Results: socket hours, various MTBF Normalized socket hours 200% 150% 100% 50% 0% combined 2x replication 50k 100k 150k 200k 2x replication coordinated (normalized) 250k 300k % 150% 100% 50% 0% Socket MTBF in years Implementation Elapsed Socket hours I/O bandwidth Break even spares Number of compute sockets 13

14 Results: impact of I/O bandwidth 10.0 kh coordinated combined protocol 2x replication Implementation Elapsed Socket hours I/O bandwidth Break even spares Elapsed time 1.0 kh h 500 1,000 2,000 5,000 10,000 20, ,000 50, , ,000 Number of compute sockets Same as slide 10 but with aggregate I/O of 30 TB/s instead of 0.5 TB/s. 14

15 Results: break-even point Implementation Elapsed Number of sockets for break even 280, , , , , , , , , ,000 80,000 60,000 40,000 20,000 5, TB/s 10 TB/s 5.0 TB/s 1.0 TB/s 0.5 TB/s Socket hours I/O bandwidth Break even spares Socket MTBF (years) 15

16 Results: number of spares needed Implementation Elapsed Socket hours I/O bandwidth Break even spares Aggregate I/O bandwidth (TB/s) 280, , , , , , , , , ,000 80,000 60,000 40,000 20,000 5,000 Number of compute sockets Spares

17 Library Benchmarks 17

18 Library implementation A user configurable number of ranks are set aside as logger nodes. Send payload data is tracked and saved on the local node in the host s memory. Library Benchmarks All point-to-point communications contain message ID information. An event for each receive is sent to a logger node. mlmpi emulates local checkpoints, failures, and node restart. 18

19 mlmpi micro-benchmarks 16.4* *10 3 Native mlmpi Library Benchmarks Bandwidth (MB/sec) 1.0* * * * * * * Bandwidth Message Size (bytes) Latency Latency 32.0* * * * * * M M4 M 512 K K K K K K K K K K M 2 M4 M 512 K K2 K 4 K 8 K16 K K K K K Message Size (bytes) 19

20 Pros and cons Thanks 20

21 : pros and cons New protocol works well, compared to coordinated only Almost as good as redundant, but uses a fraction of the nodes It works because: Pros and cons Thanks Most vulnerable part: the few loggers. Optimistic logging with a clear fall-back position, if necessary. Can bound message log size. Drawbacks: Tricky to fully, and efficiently, implement. Not suitable for all application. Still requires message logs. 21

22 Thank you Pros and cons Thanks 22

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures