The Walking Dead Michael Nitschinger

Size: px

Start display at page:

Download "The Walking Dead Michael Nitschinger"

Hugh Lynch
5 years ago
Views:

1 The Walking Dead A Survival Guide to Resilient Reactive Applications Michael

2 the right Mindset 2

3 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3

4 4

5 5

6 Not so fast, mister fancy tests! 6

7 Always ask yourself What can go wrong? 7

8 Fault Tolerance 101 8

9 Fault Error Failure A fault is a latent defect that can cause an error when activated. 9

10 Fault Error Failure Errors are the manifestations of faults. 10

11 Fault Error Failure Failure occurs when the service no longer complies with its specifications. 11

12 Fault Error Failure Errors are inevitable. We need to detect, recover and mitigate them before they become failures. 12

13 Reliability is the probability that a system will perform failure free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13

14 Availability is the percentage of time the system is able to perform its function. availability = MTTF MTTF + MTTR 14

15 Expression Downtime/Year Three 9s 99.9% min Four 9s 99.99% min Four 9s and a % min Five 9s % min Six 9s % min 100% 0 15

16 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse????????? 16

17 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse 99.99% 99.99% 99.99% 17

18 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse ~99.999% ~99.999% ~99.999% 18

19 Fault Tolerant Architecture 19

20 Units of Mitigation are the basic units of error containment and recovery. 20

21 Escalation is used when recovery or mitigation is not possible inside the unit. 21

22 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 22

23 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 23

24 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 24

25 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 25

26 Redundancy Cost Time To Recover Cost Active/Active Active/Standby N+M Active/Passive 26

27 The Fault Observer receives system and error events and can guide and orchestrate detection and recovery Unit Listener Unit Unit Unit Observer 27 Listener

28 28

29 29

30 Detecting Errors 30

31 A silent system is a dead system. 31

32 A System Monitor helps to study behaviour and to make sure it is operating as speciﬁed. 32

33 33

34 Periodic Checking Heartbeats monitor tasks or remote services and initiate recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34

35 Endpoint Encoder( Decoder( Encoder( Decoder( No Traffic Event on Idle Ne*y( Writes( Ne*y( Reads( 35

36 Riding over Transients is used to defer error recovery if the error is temporary. Patience is a virtue to allow the true signature of an error to show itself. - Robert S. Hanmer 36

37 37

38 And more! Complete Parameter Checking Watchdogs Voting Checksums Routine Audits 38

39 Recovery and Mitigation of Errors 39

40 Timeout to not wait forever and keep holding up the resource. X 40

41 Failover to a redundant unit when the error has been detected and isolated. Redundancy Reminder Cost Time To Recover Cost Active/Active Active/Standby N+M 41

42 Intelligent Retries Fixed Linear Exponential Time between Retries Number of Attempts 42

43 Restart can be used as a last resort with the trade-off to lose state and time. 43

44 Fail Fast to shed load and give a partial great service than a complete bad one. Boundary 44

45 Backpressure & Batching! 45

46 Case Study: Hystrix 46

47 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 47

48 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 48

49 Recommended Reading 49

50 Patterns for Fault-Tolerant Software by Robert S. Hanmer 50

51 Release It! by Michael T. Nygard 51

52 Any Questions? 52

53 Thank you! 53

Dependable Systems. Fault Tolerance Patterns (II) Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007.

Dependable Systems. Fault Tolerance Patterns (II) Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Dependable Systems Fault Tolerance Patterns (II) Dr. Peter Tröger Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Error Recovery Patterns Quarantine / Concentrated Recovery