Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Size: px
Start display at page:

Download "Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance"

Transcription

1 Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* *University of Illinois Urbana-Champaign (UIUC) University of Pittsburgh Pacific Northwest National Laboratory (PNNL) September 23, 2014

2 Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

3 Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

4 Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Types of replay Data-driven replay Application/system data is recorded Content of messages sent/received, etc. Control-driven replay The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

5 Deterministic Replay & Fault Tolerance Our Focus Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Types of replay Data-driven replay Application/system data is recorded Content of messages sent/received, etc. Control-driven replay The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 3 / 33 S

6 Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

7 Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

8 Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

9 Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Projected to house more than 200,000 sockets Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

10 Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Projected to house more than 200,000 sockets Hard failures may be frequent and only affect a small percentage of nodes Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

11 Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

12 Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Motivation beyond C/R If a single node experiences a hard fault, why must all the nodes roll back? Recovering from C/R is expensive at large machine scales Complicated because it depends on many factors (e.g checkpointing frequency) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

13 Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Motivation beyond C/R If a single node experiences a hard fault, why must all the nodes roll back? Recovering from C/R is expensive at large machine scales Complicated because it depends on many factors (e.g checkpointing frequency) Solutions Application-specific fault tolerance Other system-level approaches Message-logging! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

14 Hard Failure System Model P processes that communicate via message passing Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

15 Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

16 Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Guaranteed to arrive sometime in the future if the recipient process has not failed Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

17 Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Guaranteed to arrive sometime in the future if the recipient process has not failed Fail-stop model for all failures Failed processes do not recover from failures They do not behave maliciously (non-byzantine failures) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

18 Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

19 Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

20 Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Periodic checkpoints reduce storage overhead Recovery effort is limited to work executed after the latest checkpoint Data stored before the checkpoint can be discarded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

21 Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Periodic checkpoints reduce storage overhead Recovery effort is limited to work executed after the latest checkpoint Data stored before the checkpoint can be discarded Scalable implementation in Charm++ Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

22 Example Execution with SB-ML Checkpoint Failure Time Task A Task B m 1 m 2 m 4 m 1 m 2 m 4 m 6 Task C Task D Task E m 3 m 5 m 5 m 3 m 7 Forward Path Restart Recovery Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 8 / 33 S

23 Motivation Overheads with SB-ML 100% Performance Overhead Progress Slowdown Checkpoint Failure No FT FT Recovery Time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 9 / 33 S

24 Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

25 Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

26 Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Must be stored stably based on the reliability requirements Propagated to n processors Unacknowledged determinants are augmented onto new messages (to avoid frequent synchronizations) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

27 Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Must be stored stably based on the reliability requirements Propagated to n processors Unacknowledged determinants are augmented onto new messages (to avoid frequent synchronizations) Recovery Messages must be replayed in a total order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

28 Forward Execution Microbenchmark (SB-ML) Component Overhead (%) Determinants 84.75% Bookkeeping 11.65% Message-envelope size increase 3.10% Message storage 0.50% Using the LeanMD (molecular dynamics) benchmark Measured on 256 cores of Ranger Largest source of overhead is determinants Creating, storing, sending, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 11 / 33 S

29 Benchmarks Runtime System Charm++ Decompose parallel computation into objects that communicate More objects than number of processors Objects communicate by sending messages Computation is oblivious to the processors Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 12 / 33 S

30 Benchmarks Runtime System Charm++ Decompose parallel computation into objects that communicate More objects than number of processors Objects communicate by sending messages Computation is oblivious to the processors Benefits Load balancing, message-driven execution, fault tolerance, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 12 / 33 S

31 Benchmarks Configuration & Experimental Setup Benchmark Configuration STENCIL3D matrix: , chunk: 64 3 LEANMD (mini-app for NAMD) 600K atoms, 2-away XY, 75 atoms/cell LULESH (shock hydrodynamics) matrix: 1024x512 2, chunk: 16x8 2 All experiments on IBM Blue Gene/P (BG/P), Intrepid node system Each node consists of one quad-core 850MHz PowerPC 450 2GB DDR2 memory Compiler: IBM XL C/C++ Advanced Edition for Blue Gene/P, V9.0 Runtime: Charm Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 13 / 33 S

32 Forward Execution Overhead with SB-ML Percent Overhead (%) k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 Stencil3D LeanMD LULESH The finer-grained benchmarks, LeanMD and LULESH, suffer from significant overhead Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 14 / 33 S

33 Reducing the Overhead of Determinants Design Criteria We must maintain full determinism Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

34 Reducing the Overhead of Determinants Design Criteria We must maintain full determinism We must devolve well for all cases (even very non-deterministic programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

35 Reducing the Overhead of Determinants Design Criteria We must maintain full determinism We must devolve well for all cases (even very non-deterministic programs) Need to consider tasks or lightweight objects Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

36 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

37 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

38 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

39 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

40 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

41 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

42 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs)... Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

43 Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs)... Send-determinism (2011: Guermouche, et al., Uncoordinated checkpointing without domino effect for send-deterministic MPI applications) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

44 Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

45 Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

46 Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Commutative events require no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

47 Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Commutative events require no determinants! Approach: use determinants to store a partial order for the non-deterministic events that are not commutative Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

48 Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

49 Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

50 Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

51 Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Intuitively, they are ordered sets of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

52 Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Intuitively, they are ordered sets of events Define sequencing operation, : O(1, d 1 ) O(1, d 2 ) = O(2, d 1 + d 2 + 1) Intuitively, if we have two atomic events, we need a single dependency to tell us which one comes first Generalization: O(n 1, d 1 ) O(n 2, d 2 ) = O(n 1 + n 2, d 1 + d 2 + 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

53 Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

54 Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

55 Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency between each successive pair: where d = d i U(n, d) = O(1, d 1 ) O(1, d 2 ) O(1, d n ) = O(n, d + n 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

56 Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency between each successive pair: U(n, d) = O(1, d 1 ) O(1, d 2 ) O(1, d n ) = O(n, d + n 1) where d = d i Result: additional n 1 dependencies required to fully order n events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

57 Ordering Algebra Interleaving Multiple Independent Sets, operator Lemma Any possible interleaving of two ordered sets of events A = O(m, d) and B = O(n, e), where A B =, is given by: O(m, d) O(n, e) = O(m + n, d + e + min(m, n)) Lemma Any possible ordering of n ordered set of events O(m 1, d 1 ), O(m 2, d 2 ),..., O(m n, d n ), when i O(m i, d i ) =, can be n represented as: O(m i, d i ) = O(m, d + m max i m i ) where i=1 m = n m i d = n i=1 d i i=1 Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 20 / 33 S

58 Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

59 Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! What happens if we interleave internal determinism with something else? Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

60 Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! What happens if we interleave internal determinism with something else? k interruption points => O(k, k 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

61 Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

62 Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

63 Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

64 Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Interleaving other events puts them in three buckets: (1) before the begin event (2) during the commutative region (3) after the end event Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

65 Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Interleaving other events puts them in three buckets: (1) before the begin event (2) during the commutative region (3) after the end event This corresponds exactly to an ordered set of two events! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

66 Applying the Theory PO-REPLAY: Partial-Order Message Identification Scheme Properties It tracks causality with Lamport clocks It uniquely identifies a sent message, whether or not its order is transposed It requires exactly the number of determinants and dependencies produced by the ordering algebra Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 23 / 33 S

67 Applying the Theory PO-REPLAY: Partial-Order Message Identification Scheme Properties It tracks causality with Lamport clocks It uniquely identifies a sent message, whether or not its order is transposed It requires exactly the number of determinants and dependencies produced by the ordering algebra Determinant Composition (3-tuple): <SRN,SPE,CPI> SRN: sender region number, incremented for every send outside a commutative region and incremented once when a commutative region starts SPE: sender processor endpoint CPI: commutative path identifier, sequence of bits that represents the path to the root of the commutative region Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 23 / 33 S

68 Experimental Results Forward Execution Overhead: Stencil3D Percent Overhead (%) k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 Stencil3D- PartialDetFT Stencil3D- FullDetFT Course-grained, shows small improvement over SB-ML Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 24 / 33 S

69 Experimental Results Forward Execution Overhead: LeanMD Percent Overhead (%) k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 LeanMD- PartialDetFT LeanMD- FullDetFT Fine-grained, reduction from 11-19% overhead to <5% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 25 / 33 S

70 Experimental Results Forward Execution Overhead: LULESH Percent Overhead (%) k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 LULESH- PartialDetFT LULESH- FullDetFT Medium-grained, many messages, 17% overhead to <4% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 26 / 33 S

71 Experimental Results Fault Injection Measure the recovery time for the different protocols Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

72 Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

73 Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node During approximately the middle of the period Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

74 Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node During approximately the middle of the period We calculate the optimal checkpoint period duration using Daly s formula Assuming 64K 1M socket count Assuming MTBF of 10 years Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

75 Experimental Results Recovery Time Speedup C/R Cores Cores Cores Cores Cores 2.5 Speedup LeanMD Stencil3D LULESH LeanMD has the most speedup due to its fine-grained, overdecomposed nature We achieve speedup in all cases in recovery time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 28 / 33 S

76 Experimental Results Recovery Time Speedup SB-ML Cores Cores Cores Cores Cores Speedup LeanMD Stencil3D LULESH Increased speedup with scale, due to expense of coordinating determinants and ordering Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 29 / 33 S

77 Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

78 Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Recover is significantly faster than C/R or causal Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

79 Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Recover is significantly faster than C/R or causal Depending on the frequency of faults, it may perform better than C/R Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

80 Future Work More benchmarks Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

81 Future Work More benchmarks Study for broader range of programming models Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

82 Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

83 Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Automated extraction of ordering and interleaving properties Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

84 Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Automated extraction of ordering and interleaving properties Programming language support? Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

85 Conclusion Comprehensive approach for reasoning about execution orderings and interleavings Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

86 Conclusion Comprehensive approach for reasoning about execution orderings and interleavings We observe that the information stored can be reduced in proportion to the knowledge of order flexibility Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

87 Conclusion Comprehensive approach for reasoning about execution orderings and interleavings We observe that the information stored can be reduced in proportion to the knowledge of order flexibility Programming paradigms should make this cost model clearer! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

88 Questions?

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Charm++ for Productivity and Performance

Charm++ for Productivity and Performance Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil

More information

Adaptive Runtime Support

Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Scalable Fault Tolerance Schemes using Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Isaac Dooley, Laxmikant V. Kale Department of Computer Science University of Illinois idooley2@illinois.edu kale@illinois.edu

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18

More information

c 2012 Harshitha Menon Gopalakrishnan Menon

c 2012 Harshitha Menon Gopalakrishnan Menon c 2012 Harshitha Menon Gopalakrishnan Menon META-BALANCER: AUTOMATED LOAD BALANCING BASED ON APPLICATION BEHAVIOR BY HARSHITHA MENON GOPALAKRISHNAN MENON THESIS Submitted in partial fulfillment of the

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An

More information

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application

More information

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

Call Paths for Pin Tools

Call Paths for Pin Tools , Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that

More information

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

Advanced Memory Management

Advanced Memory Management Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

Scalable Interaction with Parallel Applications

Scalable Interaction with Parallel Applications Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Outline Overview Case Studies Charm++

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé Parallel Programming Laboratory Department of Computer Science University

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers

Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers Osman Sarood PhD Final Defense Department of Computer Science December 3rd, 2013!1 PhD Thesis Committee

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

Memory Tagging in Charm++

Memory Tagging in Charm++ Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana Champaign Outline Overview Memory tagging Charm++ RTS CharmDebug Charm++ memory

More information

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello To cite this version: Amina Guermouche,

More information

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods Aaron K. Becker (abecker3@illinois.edu) Robert B. Haber Laxmikant V. Kalé University of Illinois, Urbana-Champaign Parallel

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,

More information

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel Fault Tolerant Java Virtual Machine Roy Friedman and Alon Kama Technion Haifa, Israel Objective Create framework for transparent fault-tolerance Support legacy applications Intended for long-lived, highly

More information

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009 Greg Bilodeau CS 5204 November 3, 2009 Fault Tolerance How do we prepare for rollback and recovery in a distributed system? How do we ensure the proper processing order of communications between distributed

More information

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun

More information

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun, Laxmikant V. Kalé University of Illinois at Urbana-Champaign sun51@illinois.edu November 26,

More information

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors Nima Honarmand and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/ 1 RnR: Record and Deterministic

More information

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy Enzo-P / Cello Formation of the First Galaxies James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department

More information

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 Prof. Karen L. Karavanic karavan@pdx.edu web.cecs.pdx.edu/~karavan Today s Agenda Why Study Performance? Why Study

More information

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics 1 SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu, John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noah s

More information

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer

More information

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard

More information

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics

More information

Life, Death, and the Critical Transition: Finding Liveness Bugs in System Code

Life, Death, and the Critical Transition: Finding Liveness Bugs in System Code Life, Death, and the Critical Transition: Finding Liveness Bugs in System Code Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat Presented by Nick Sumner 25 March 2008 Background We already

More information

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology Clock and Time THOAI NAM Faculty of Information Technology HCMC University of Technology Using some slides of Prashant Shenoy, UMass Computer Science Chapter 3: Clock and Time Time ordering and clock synchronization

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

the Cornell Checkpoint (pre-)compiler

the Cornell Checkpoint (pre-)compiler 3 the Cornell Checkpoint (pre-)compiler Daniel Marques Department of Computer Science Cornell University CS 612 April 10, 2003 Outline Introduction and background Checkpointing process state Checkpointing

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Onkar Patil 1, Saurabh Hukerikar 2, Frank Mueller 1, Christian Engelmann 2 1 Dept. of Computer Science, North Carolina State University

More information

2 Introduction to Processes

2 Introduction to Processes 2 Introduction to Processes Required readings: Silberschatz/Galvin: Chapter 4 With many things happening at once in a system, need some clean way of separating them all out cleanly. sequential process,

More information

Semi-Passive Replication in the Presence of Byzantine Faults

Semi-Passive Replication in the Presence of Byzantine Faults Semi-Passive Replication in the Presence of Byzantine Faults HariGovind V. Ramasamy Adnan Agbaria William H. Sanders University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana IL 61801, USA

More information

Incremental Record and Replay Mechanism for Message Passing Concurrent Programs

Incremental Record and Replay Mechanism for Message Passing Concurrent Programs Incremental Record and Replay Mechanism for Message Passing Concurrent Programs Yi Zeng School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China. 05277@njnu.edu.cn Abstract

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance

Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance Laxmikant V. Kale Anshu Arya, Nikhil Jain, Akhil Langer, Jonathan Lifflander, Harshitha Menon Xiang Ni, Yanhua

More information

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello INRIA Saclay-Île de France, F-91893

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Deterministic Process Groups in

Deterministic Process Groups in Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +

More information

Efficient Data Race Detection for Unified Parallel C

Efficient Data Race Detection for Unified Parallel C P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,

More information

Distributed Deadlock

Distributed Deadlock Distributed Deadlock 9.55 DS Deadlock Topics Prevention Too expensive in time and network traffic in a distributed system Avoidance Determining safe and unsafe states would require a huge number of messages

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications Amina Guermouche, Thomas Ropars, Marc Snir, Franck Cappello To cite this version: Amina Guermouche,

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

VALLIAMMAI ENGINEERING COLLEGE

VALLIAMMAI ENGINEERING COLLEGE VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK II SEMESTER CP7204 Advanced Operating Systems Regulation 2013 Academic Year

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Replication and consistency Fall 2013 1 Replication 2 What is replication? Introduction Make different copies of data ensuring that all copies are identical Immutable data

More information

Concurrent & Distributed Systems Supervision Exercises

Concurrent & Distributed Systems Supervision Exercises Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Fault Tolerance Techniques for Sparse Matrix Methods

Fault Tolerance Techniques for Sparse Matrix Methods Fault Tolerance Techniques for Sparse Matrix Methods Simon McIntosh-Smith Rob Hunt An Intel Parallel Computing Center Twitter: @simonmcs 1 ! Acknowledgements Funded by FP7 Exascale project: Mont Blanc

More information

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Alleviating Scalability Issues of Checkpointing

Alleviating Scalability Issues of Checkpointing Rolf Riesen, Kurt Ferreira, Dilma Da Silva, Pierre Lemarinier, Dorian Arnold, Patrick G. Bridges 13 November 2012 Alleviating Scalability Issues of Checkpointing Protocols Overview 2 3 Motivation: scaling

More information

Last Class: RPCs and RMI. Today: Communication Issues

Last Class: RPCs and RMI. Today: Communication Issues Last Class: RPCs and RMI Case Study: Sun RPC Lightweight RPCs Remote Method Invocation (RMI) Design issues Lecture 9, page 1 Today: Communication Issues Message-oriented communication Persistence and synchronicity

More information

Welcome to the 2017 Charm++ Workshop!

Welcome to the 2017 Charm++ Workshop! Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017

More information

Reflections on Failure in Post-Terascale Parallel Computing

Reflections on Failure in Post-Terascale Parallel Computing Reflections on Failure in Post-Terascale Parallel Computing 2007 Int. Conf. on Parallel Processing, Xi An China Garth Gibson Carnegie Mellon University and Panasas Inc. DOE SciDAC Petascale Data Storage

More information

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations Bilge Acun 1, Nikhil Jain 1, Abhinav Bhatele 2, Misbah Mubarak 3, Christopher D. Carothers 4, and Laxmikant V. Kale 1

More information

Transparent TCP Recovery

Transparent TCP Recovery Transparent Recovery with Chain Replication Robert Burgess Ken Birman Robert Broberg Rick Payne Robbert van Renesse October 26, 2009 Motivation Us: Motivation Them: Client Motivation There is a connection...

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François

More information