USENIX / ACM / IFIP 10th International Middleware Conference How To Keep Your Head Above Water While Detecting Errors Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi Dependable Computing System Lab School of Electrical and Computer Engineering Purdue University Slide 1/27
Impact of Failures in Internet Services Internet services are expected to be running 24/7 System downtime can cost $1 million / hour (Source: Meta Group, 2002) Service degradation is the most frequent problem Service is slower than usual; almost unavailable Can be difficult to detect and diagnose Internet-based applications are very large and dynamic Complexity increases as new components are added Slide 2/27
Complexity of Internet-based Applications Software Components Servlets, JavaBeans, EJBs Clients Distributed E-commerce System Web Servers Application Servers Database Servers Each tier has multiple components Components can be stateful Slide 3/27
P R E V I O U S W O R K The Monitor Detection System (TDSC 06) Clients Distributed E-commerce System Web Servers Application Servers Database Servers Monitor Observed System Non-intrusive observe messages between components Online faster detection than offline approaches Black-box detection components treated as black boxes No knowledge of components internals Slide 4/27
P R E V I O U S W O R K Stateful Rule-based Detection Packet Capturer Monitor Finite State Machine (event-based state transition model) Normal- Behavior Rules A message is captured Deduce current application state Dt Detection ti Process Match rules based on the deduced state If a rule does not satisfy, signal alarm 1 2 3 4 Slide 5/27
P R E V I O U S W O R K The Breaking Point in High Rate of Messages De etection La atency (msec ) Breaking point 1200 900 600 300 0 0 100 200 300 400 Captured Packets / second (Incoming msg rate at Monitor) After breaking point, latency increases sharply True alarms rate decreases because packets are dropped Breaking point expected in any stateful detection system Slide 6/27
P R E V I O U S W O R K Avoiding the Breaking Point: Random Sampling (SRDS 07) Processing Load in Monitor δ = R C R: Incoming message rate, C: Processing cost per message Processing Load δ is reduced by reducing R Only a portion of incoming messages is processed n out of m messages are randomly sampled Sampling is activated if R breaking point Slide 7/27
P R E V I O U S W O R K The Non-Determinism Caused by Sampling A portion of a Finite State Machine Events in Monitor m3 (1) SV = { S0 } S3 (2) A message is dropped m1 S1 m4 (3) SV = S1, S2 S0 () {, } S4 m2 (4) A message is sampled S2 m5 (5) The message is m5 S5 (6) SV = { S5 } Definitions: State Vector (SV) The state(s) of the application from Monitor s point of view (deduced state(s)) Non-Determinism Monitor is no longer aware of the exact state the application is in (because of dropped messages) Slide 8/27
C U R R E N T W O R K Remaining Agenda I. Addressing the problem of non-determinism A. Intelligent Sampling B. Hidden Markov Model (HMM) for state determination II. Experimental Test-bed III. Performance Results IV. Efficient Rule Matching and Triggering Slide 9/27
Challenges with Non-Determinism Intelligent Sampling Decrease Detection Latency (Large SV increases rule matching time) Non-Determinism Hidden Markov Model SV reduction Decrease False Alarms (Reduce incorrect states in SV) Increase True Alarms (Reduce effect of incorrect messages) Slide 10/27
Intelligent Sampling State Vector Incoming Messages Random Finite State S1 Sampling Machine S6 S5 S4 Sampled Messages S2 S3 Intelligent Sampling Finite State Machine State Vector S1 S3 S4 Incoming Messages Sampled Messages Message with desirable property Slide 11/27
What is a Desirable Property of a Message? Suppose, SV ={ S1, S2, S3 }, Sampling Rate = 1/3 S1 S2 S3 A portion of a Finite State Machine m1 m1 m1 S4 S5 S6 m2 m2 m2 S7 S8 m3 m3 S9 Sampled Message SV Discriminative Size m1 {S4, S5, S6} 3 m2 { S7, S8 } 2 m3 { S9 } 1 Discriminative Size Number of times a message appears in transitions to different states in the FSM Desirable property is a small discriminative size Slide 12/27
Benefits of Intelligent Sampling Random Sampling SV can grow into large size Multiple l incorrect states t Increase of false alarms Intelligent Sampling SV is kept small Detection latency reduction Less incorrect states t in SV False alarms reduction Slide 13/27
The Problem of Sampling an Incorrect Message What if an incorrect message is sampled? The message is incorrect in current states, e.g., a message from buggy component m1 S1 m3 m4 S0 m2 S2 m5 S3 S4 S5 Suppose SV = { S1, S2 } and m3 is changed to m5 SV = { S5 } (incorrect SV!, it should be { S3 }) Slide 14/27
Probabilistic State Vector Reduction: A Hidden Markov Model Approach Hidden Markov Model (HMM) used to reduce SV An HMMi is an extended dmarkov Model where states t are not observable States are hidden as in the monitored application Given an HMM, we can ask: The probability of the application being in any state, given a sequence of messages? Cost is O(N 3 L), N: number of states L: sequence length The HMM is trained (offline) with application traces Slide 15/27
State Vector Reduction with the HMM Monitor asks the HMM: {p 1, p 2,, p N } p i = P(S i O), S i : application state i, O: observation s sequence Messages (observations) HMM p 1, p 2,, p N Sort by Probabilities p 3, p 5,, p 1 α = Top k probabilities α SV = new SV New SV is robust to incorrect messages Slide 16/27
Experimental Testbed: Java Duke s Bank Application Simulates multi-tier online banking system User transactions: Access account information, transfer money, etc. Application stressed with different workloads Incoming message rate at Monitor varies with user load Slide 17/27
Web Interaction: A Sequence of calls and returns Clients Distributed E-commerce System Web Servers Application Servers Database Servers Components: Java servlets, JavaBeans, EJBs Slide 18/27
Error Injection Types Error Type Response Delays Null calls Description a response delay in a method call a call to a component that is never executed Unhandled Exceptions Incorrect Message Sequences exception thrown by execution that is never caught by the program change randomly the web interaction structure Errors are injected in components touched by web interactions A web interaction is faulty if at least one of its components is faulty Slide 19/27
Performance Metrics Used in Experiments Accuracy (True Alarms) % of true detections out of web interactions where errors were injected Precision i % of true detections out of the total number of fdetections (False Alarms) Detection time elapsed between the error injection and its detection Latency Web Interactions 1 2 3 4 5 Example: Error Injected X X X Detection (An alarm is signaled) X X X X Accuracy = 2/3 = 067 0.67 Precision = 2/4 = 05 0.5 Slide 20/27
State Vector Size Stat te Vector Siz ze 30 20 10 0 30 20 10 0 Results: State Vector Reduction Random Sampling 10 20 30 40 50 60 70 80 90 100 Discrete Time Intelligent Sampling 10 20 30 40 50 60 70 80 90 100 Discrete Time CDF 1 08 0.8 0.6 04 0.4 0.2 Random Sampling Intelligent Sampling 0 0 2 4 6 8 Pruned State Vector Size Peaks are not observed in intelligent sampling (IS) IS capability of selecting messages with small discriminative size SV of size 1 is more frequent in IS Slide 21/27
Results: Monitor vs. Pinpoint (Accuracy, Precision) Pinpoint (NSDI 04), traces paths though multiple components Use of PCFG to detect abnormal paths Accuracy 1 08 0.8 0.6 0.4 0.2 0 Monitor Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Precision 1 08 0.8 0.6 0.4 0.2 0 Monitor Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Monitor and Pinpoint expose similar levels of accuracy Precision in Monitor (1.0) is higher than in Pinpoint (0.9) Slide 22/27
Results: Monitor vs. Pinpoint (Detection Latency) Latency (m msec) Detection 200 150 100 50 0 Monitor 4 8 12 16 20 24 Concurrent Users Detection Latency (sec) 200 150 100 50 0 Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Detection latency in Monitor is in the order of milliseconds, while in Pinpoint is in seconds The PCFG has a high space and time complexity Slide 23/27
Results: Memory Consumption (MB) Virtual Memory RAM Monitor 282.27 25.53 Pinpoint-PCFG 933.56 696.06 Monitor doesn t rely in large data structures PCFG in Pinpoint has high space and time complexity O(RL 2 ) and O(L 3 ) R: number of rules in the grammar L: size of a web interaction Pinpoint thrashes due to high memory requirements Slide 24/27
Efficient Rule Matching. No Expensive to match? Yes System unstable? Yes Match. Next rule No Selectively ect match computationally o a expensive rules Expensive rules don t have to be matched all the time Rules are matched only if instability is present Slide 25/27
Efficient Rule Matching Example: Detecting Memory Leak Efficiently detect memory leak in Apache web server Memory leak injected probabilistically bili ll with web requests Expensive ARIMA-based rule to detect abnormal memory usage Average matching latency reduced Rule Matching Criteria Memory Leak Detected Average Matching Latency (msec.) Always matched yes 19.283 σ 0.5 yes 7.115 σ 1.0 no 1.25 Slide 26/27
Contributions: Concluding Remarks Sampling used to scale stateful detection system under high- rate of messages Intelligent Sampling reduces non-determinism caused by sampling HMM-approach handles incorrect messages Techniques can be applied to any stateful detection system Monitor performs better than other approaches Future Work: Efficient Rule Matching technique will be extended Sampling only sequences of messages that lead to errors Automatic generation of rules from traces Slide 27/27