Anomaly Detection Fault Tolerance Anticipation

Size: px

Start display at page:

Download "Anomaly Detection Fault Tolerance Anticipation"

Matilda Rogers
5 years ago
Views:

1 Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012

2 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

3 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

4 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

5 Anomaly Detection

6 Anomaly Detection Getting at the state of health Evaluating the state of health Components AND systems

7 Supervisory Example: Active health check check_http Monitor Component (webserver) exit OK

8 Supervisory Monitor check_http Component (webserver) Pros: Easy to implement exit OK Easy to understand Well-known pattern Cons: Messaging can fail Scalability is limited

9 Supervisor Sensitivity 1 sec timeout 1 retry 1s 1s 3 sec interval X X 3s (7.9 sec exposure) Up to ~2.9s for the previous interval

10 Supervisor Sensitivity Request latency Schedule Latency (max = N) Monitor (max = 0.9s) check_http Component (webserver) exit OK Response latency (max = 0.9s)

11 Supervisor Sensitivity How many seconds of errors can you tolerate serving?

12 Supervisory Example: Interval Passive health check Monitor Component (webserver) exit 0 DISK consumption within bounds

13 Example: Interval Passive health check Pros: Supervisory Monitor exit 0 DISK consumption within bounds Component (webserver) Efficient Scalability is different Fewer moving parts Less exposure Can submit to multiple places Cons: Nonideal for network-based services Different tuning (windowed expectation)

14 Example: Passive health check Supervisory Interval { TIME?? Component

15 TIME Example: Passive health check Supervisory Interval {?? Interval Component Schedule Latency Exposure = (Schedule + Interval )*UnknownConsecutiveIntervals+1

16 Frequency and Transience Probability Of False Positives Probability Of Nondetection Short intervals Low # of retries Short timeouts Long intervals High # of retries Long timeouts

17 In-Line Example: Passive application event logging monitor application

18 Supervisory Example: Passive application event logging monitor application Pros: On-demand publish Cons: Onus is on the app Can t be 100% sure it s working

19 Supervisory Example: Passive application event logging monitor application Positive events (sales, registrations, etc.) Negative events (errors, exceptions, etc.) Lack or presence of data mean different things, so history is paramount.

20 Context

21 Evaluation what is abnormal?

22 10 9 Response Time Time

23 Static Thresholds 10 Response Time Critical Warning Time

24 Static Thresholds 10 Response Time Critical Warning Time

25 Static Thresholds 10 Response Time Critical Warning Time

26 Static Thresholds 10 Response Time Critical Warning Time

27 Static Thresholds

28 Static Thresholds

29 Context Normal?

30 Context 24 hours

31 Context 7 days

32 Normal But Noisy Context

33 Context Smoothing?

34 Context Holt-Winters Exponential Smoothing Recent points influencing a forecast, exponentially decreasing influence backwards in time. en.wikipedia.org/wiki/exponential_smoothing

35 Context Aberrant Behavior Detection in Time Series for Network Monitoring full_papers/brutlag/brutlag_html/

36 Dynamic Thresholds

37 Dynamic Thresholds Upper bound Raw data Lower bound

38 Dynamic Thresholds Hrm...

39 Dynamic Thresholds Hrm...

40 Dynamic Thresholds Holt-Winters Aberration Ah!

41 Dynamic Thresholds Graphite metrics collection w/holt-winters abberations Nagios check for Graphite data

42 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

43 FAULT TOLERANCE

44 Detection of fault X Triggers corrective action Y Clean up, report back (RECOVERY OR MASKING)

45 Variation Tolerance

46 Adaptive Systems Expected Variation

47 Adaptive Systems Expected Variation

48 Adaptive Systems Expected Variation

49 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation Woods, 2011

50 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation

51 New Disturbances Arise Compensation is Exhausted Variation Disturbance Expected Variation Fault Control compensation decompensation

52 Variations!= Faults

53 Dead Corrupt Late Wrong

54 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)

55 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)

56 Spatial Redundancy 2 2

57 Spatial Redundancy Active/Active

58 Spatial Redundancy Active/Passive

59 Spatial Redundancy Roaming Spare Dedicated Spare

60 In-Line Fault Tolerance PHP (thrift client) App Thrift Connect timeout Search (Lucene/Solr) Send timeout Receive timeout

61 App In-Line Fault Tolerance X Search (Lucene/Solr) 1. App attempts connection, can t 2. Caches APC user object with 60s TTL key=server:port 3. Moves to next server in rotation, skipping any found in APC

62 In-Line Fault Tolerance /lib/php/src/tsocketpool.php

63 In-Line Fault Tolerance Pros: Distributed checking and perspective Handles transient failures Auto-recovery Cons: Onus is on the app for implementation

64 Fault Tolerance Nagios Event Handlers Attempt to recover from specific conditions Chain together recovery actions eventhandlers.html

65 If (fault X) then HUP process; re-check If (OK) then notify+exit ELSE Hard restart process; re-check If (OK) then notify+exit ELSE Remove from production; notify+exit

66 How many seconds of errors can you tolerate serving?

67 Fail Closed When fault is found, and can t be recovered or masked, operations cease to protect the rest of the system from damage.

68 Depth and Dependencies Monitor Load Balancers Health check App DB

69 Depth and Dependencies WARNING: Monitor Load Balancers Health check Don t be too App crazy DB

70 Fail Closed Aggregate Cluster Checking X X X X If (clusterfail > 25%) then notify+exit ELSE OK

71 Fail Open When a fault happens, and can t be masked or recovered, operations continue without the feature.

72 Fail Open Example 1 at Etsy: Geo Targeting 50ms Internal SLA on guessing location via client IP. If >50ms, we just don t show local results.

73 Fail Open Example 2 at Etsy: Rate Limiting App Memcache Internal SLA on incrementing counters+checking totals. If >SLA, we let the action continue, and throw fire-andforget counter if we can.

74 SYSTEMIC

75 App Cache DB Search Logging Queue

76 App Cache DB Search Logging Queue

77 Functional Resonance

79 Shop Stats

80 Shop Stats App Cache DB Search Logging Queue

81 Registration App Cache DB Search Logging Queue

82 Registration

83 Shop Stats Logins Registrations Checkout New Listings Photos Search API Rate limiting Data Analysis Search A/B analysis Page performance Search Ads Editorial content systems Feedback Messaging/Convos Activity Feeds Circles Shipping Mobile Internationalization Testing Fraud

84 Systemic Application/Functionality Health Componential/Resource Health

85 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

86 Anticipation During design of architecture During choice of technologies During design of monitoring and metrics

87 TRADE-OFFS

88 What could possibly go wrong?

89 REQUISITE IMAGINATION

90 Possible Foreseeable Situations Situations Considered By Situations Considered By Situations Considered By Novice Designer Average Designer Expert Designer Adamski and Westrum, 2003

91 Anticipation Failure Mode Effects Analysis (FMEA) Failure Mode Effects and Criticality Analysis (FMECA) Failure_mode,_effects,_and_criticality_analysis

92 Architectural reviews Go or No-Go meetings Game Day exercises

93 Anticipation Servers Networks Software Applications Monitoring Metrics Traffic

94 PEOPLE

95 (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

96 THE END

The Walking Dead Michael Nitschinger

The Walking Dead Michael Nitschinger The Walking Dead A Survival Guide to Resilient Reactive Applications Michael Nitschinger @daschl the right Mindset 2 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3 4 5 Not