How To Keep Your Head Above Water While Detecting Errors

Similar documents
How To Keep Your Head Above Water While Detecting Errors

Distributed Diagnosis of Failures in a Three Tier E-Commerce System. Motivation

Stateful Detection in High Throughput Distributed Systems

Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications. DCSL Research Thrusts

Large Scale Debugging of Parallel Tasks with AutomaDeD!

NON INTRUSIVE DETECTION AND DIAGNOSIS OF FAILURES IN HIGH THROUGHPUT DISTRIBUTED APPLICATIONS. Research Initiatives

Performance Debugging for Distributed Systems of Black Boxes. Yinghua Wu, Haiyong Xie

Distributed Diagnosis of Failures in a Three Tier E- Commerce System

Agenda. Cassandra Internal Read/Write path.

Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

Using Hidden Semi-Markov Models for Effective Online Failure Prediction

Self Checking Network Protocols: A Monitor Based Approach

Introduction to SLAM Part II. Paul Robertson

PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction

Path-based Macroanalysis for Large Distributed Systems

Reliability and Dependability in Computer Networks. CS 552 Computer Networks Side Credits: A. Tjang, W. Sanders

Multimedia Streaming. Mike Zink

Monitoring Standards for the Producers of Web Services Alexander Quang Truong

Resource Usage Monitoring for Web Systems Using Real-time Statistical Analysis of Log Data

Collaborative Intrusion Detection System : A Framework for Accurate and Efficient IDS. Outline

Connecting the Dots: Using Runtime Paths for Macro Analysis

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation

Process Scheduling. Copyright : University of Illinois CS 241 Staff

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

An Online Performance Anomaly Detector in Cluster File Systems

Dynamic Flow Regulation for IP Integration on Network-on-Chip

Mace, J. et al., "2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services," Proc. of ACM SIGCOMM '16, 46(4): , Aug

Exploring Econometric Model Selection Using Sensitivity Analysis

Maintaining Frequent Itemsets over High-Speed Data Streams

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

Distributed Energy-Aware Routing Protocol

Stateful Detection in High Throughput Distributed Systems

Application of SDN: Load Balancing & Traffic Engineering

Automated Control for Elastic Storage

IBM Education Assistance for z/os V2R2

Sleep/Wake Aware Local Monitoring (SLAM)

Improving Data Access of J2EE Applications by Exploiting Asynchronous Messaging and Caching Services

Stateful Detection in High Throughput Distributed Systems

Collective Entity Resolution in Relational Data

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Specification-based Intrusion Detection. Michael May CIS-700 Fall 2004

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Topology and fmri Data

Network Coordinates in the Wild

ECE 669 Parallel Computer Architecture

Network Load Balancing Methods: Experimental Comparisons and Improvement

Time. COS 418: Distributed Systems Lecture 3. Wyatt Lloyd

DISCRETE TIME ADAPTIVE LINEAR CONTROL

DDR4 Debug and Protocol Validation Challenges. Presented by: Jennie Grosslight Memory Product Manager Agilent Technologies

A Comparison of File. D. Roselli, J. R. Lorch, T. E. Anderson Proc USENIX Annual Technical Conference

LCCI (Large-scale Complex Critical Infrastructures)

Assignment 4 CSE 517: Natural Language Processing

VM Migration, Containers (Lecture 12, cs262a)

SUMMARY: MODEL DRIVEN SECURITY

Diagnostics in Testing and Performance Engineering

Epidemic Routing for Partially-Connected Ad Hoc Networks

Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

Enterprise Java Unit 1-Chapter 2 Prof. Sujata Rizal Java EE 6 Architecture, Server and Containers

ENVI: Elastic resource flexing for Network functions VIrtualization

Tools for Social Networking Infrastructures

Stochastic Analysis of Horizontal IP Scanning

INTRODUCTION TO INTEGRATION STYLES

COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

J2EE Instrumentation for software aging root cause application component determination with AspectJ

Highway Dimension and Provably Efficient Shortest Paths Algorithms

Strategy Pattern. What is it?

Lecture 6: Network Analysis: Process Part 1

Deep-Q: Traffic-driven QoS Inference using Deep Generative Network

Orion+: Automated Problem Diagnosis in Computing Systems by Mining Metric Data

High Availability Distributed (Micro-)services. Clemens Vasters Microsoft

Monitoring Interfaces for Faults

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

Slides 11: Verification and Validation Models

Agent-Enabling Transformation of E-Commerce Portals with Web Services

Response Time and Throughput

Presented by: Nafiseh Mahmoudi Spring 2017

J2EE DIAGNOSING J2EE PERFORMANCE PROBLEMS THROUGHOUT THE APPLICATION LIFECYCLE

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Computational complexity

Fault Management using Passive Testing formobileipv6networks

CORGIDS: A Correlation-based Generic Intrusion Detection System

Sensors for Changing Enterprise Systems

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Adaptive SMT Control for More Responsive Web Applications

Introduction to Modeling. Lecture Overview

Making Session Stores More Intelligent KYLE J. DAVIS TECHNICAL MARKETING MANAGER REDIS LABS

The Onion Routing Performance using Shadowplugin-TOR

STAFF: State Transition Applied Fast Flash Translation Layer

Optimizing Tiered Storage Workloads with Precise for Storage Tiering

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Providing Automated Detection of Problems in Virtualized Servers using Monitor framework

Machine Learning Techniques for Data Mining

CS 61C: Great Ideas in Computer Architecture. Virtual Memory III. Instructor: Dan Garcia

Computations with Bounded Errors and Response Times on Very Large Data

COCONUT: Seamless Scale-out of Network Elements

CQNCR: Optimal VM Migration Planning in Cloud Data Centers

Routing Protocols in MANETs

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Resource Availability Prediction in Fine-Grained Cycle Sharing Systems

Java Without the Jitter

Transcription:

USENIX / ACM / IFIP 10th International Middleware Conference How To Keep Your Head Above Water While Detecting Errors Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi Dependable Computing System Lab School of Electrical and Computer Engineering Purdue University Slide 1/27

Impact of Failures in Internet Services Internet services are expected to be running 24/7 System downtime can cost $1 million / hour (Source: Meta Group, 2002) Service degradation is the most frequent problem Service is slower than usual; almost unavailable Can be difficult to detect and diagnose Internet-based applications are very large and dynamic Complexity increases as new components are added Slide 2/27

Complexity of Internet-based Applications Software Components Servlets, JavaBeans, EJBs Clients Distributed E-commerce System Web Servers Application Servers Database Servers Each tier has multiple components Components can be stateful Slide 3/27

P R E V I O U S W O R K The Monitor Detection System (TDSC 06) Clients Distributed E-commerce System Web Servers Application Servers Database Servers Monitor Observed System Non-intrusive observe messages between components Online faster detection than offline approaches Black-box detection components treated as black boxes No knowledge of components internals Slide 4/27

P R E V I O U S W O R K Stateful Rule-based Detection Packet Capturer Monitor Finite State Machine (event-based state transition model) Normal- Behavior Rules A message is captured Deduce current application state Dt Detection ti Process Match rules based on the deduced state If a rule does not satisfy, signal alarm 1 2 3 4 Slide 5/27

P R E V I O U S W O R K The Breaking Point in High Rate of Messages De etection La atency (msec ) Breaking point 1200 900 600 300 0 0 100 200 300 400 Captured Packets / second (Incoming msg rate at Monitor) After breaking point, latency increases sharply True alarms rate decreases because packets are dropped Breaking point expected in any stateful detection system Slide 6/27

P R E V I O U S W O R K Avoiding the Breaking Point: Random Sampling (SRDS 07) Processing Load in Monitor δ = R C R: Incoming message rate, C: Processing cost per message Processing Load δ is reduced by reducing R Only a portion of incoming messages is processed n out of m messages are randomly sampled Sampling is activated if R breaking point Slide 7/27

P R E V I O U S W O R K The Non-Determinism Caused by Sampling A portion of a Finite State Machine Events in Monitor m3 (1) SV = { S0 } S3 (2) A message is dropped m1 S1 m4 (3) SV = S1, S2 S0 () {, } S4 m2 (4) A message is sampled S2 m5 (5) The message is m5 S5 (6) SV = { S5 } Definitions: State Vector (SV) The state(s) of the application from Monitor s point of view (deduced state(s)) Non-Determinism Monitor is no longer aware of the exact state the application is in (because of dropped messages) Slide 8/27

C U R R E N T W O R K Remaining Agenda I. Addressing the problem of non-determinism A. Intelligent Sampling B. Hidden Markov Model (HMM) for state determination II. Experimental Test-bed III. Performance Results IV. Efficient Rule Matching and Triggering Slide 9/27

Challenges with Non-Determinism Intelligent Sampling Decrease Detection Latency (Large SV increases rule matching time) Non-Determinism Hidden Markov Model SV reduction Decrease False Alarms (Reduce incorrect states in SV) Increase True Alarms (Reduce effect of incorrect messages) Slide 10/27

Intelligent Sampling State Vector Incoming Messages Random Finite State S1 Sampling Machine S6 S5 S4 Sampled Messages S2 S3 Intelligent Sampling Finite State Machine State Vector S1 S3 S4 Incoming Messages Sampled Messages Message with desirable property Slide 11/27

What is a Desirable Property of a Message? Suppose, SV ={ S1, S2, S3 }, Sampling Rate = 1/3 S1 S2 S3 A portion of a Finite State Machine m1 m1 m1 S4 S5 S6 m2 m2 m2 S7 S8 m3 m3 S9 Sampled Message SV Discriminative Size m1 {S4, S5, S6} 3 m2 { S7, S8 } 2 m3 { S9 } 1 Discriminative Size Number of times a message appears in transitions to different states in the FSM Desirable property is a small discriminative size Slide 12/27

Benefits of Intelligent Sampling Random Sampling SV can grow into large size Multiple l incorrect states t Increase of false alarms Intelligent Sampling SV is kept small Detection latency reduction Less incorrect states t in SV False alarms reduction Slide 13/27

The Problem of Sampling an Incorrect Message What if an incorrect message is sampled? The message is incorrect in current states, e.g., a message from buggy component m1 S1 m3 m4 S0 m2 S2 m5 S3 S4 S5 Suppose SV = { S1, S2 } and m3 is changed to m5 SV = { S5 } (incorrect SV!, it should be { S3 }) Slide 14/27

Probabilistic State Vector Reduction: A Hidden Markov Model Approach Hidden Markov Model (HMM) used to reduce SV An HMMi is an extended dmarkov Model where states t are not observable States are hidden as in the monitored application Given an HMM, we can ask: The probability of the application being in any state, given a sequence of messages? Cost is O(N 3 L), N: number of states L: sequence length The HMM is trained (offline) with application traces Slide 15/27

State Vector Reduction with the HMM Monitor asks the HMM: {p 1, p 2,, p N } p i = P(S i O), S i : application state i, O: observation s sequence Messages (observations) HMM p 1, p 2,, p N Sort by Probabilities p 3, p 5,, p 1 α = Top k probabilities α SV = new SV New SV is robust to incorrect messages Slide 16/27

Experimental Testbed: Java Duke s Bank Application Simulates multi-tier online banking system User transactions: Access account information, transfer money, etc. Application stressed with different workloads Incoming message rate at Monitor varies with user load Slide 17/27

Web Interaction: A Sequence of calls and returns Clients Distributed E-commerce System Web Servers Application Servers Database Servers Components: Java servlets, JavaBeans, EJBs Slide 18/27

Error Injection Types Error Type Response Delays Null calls Description a response delay in a method call a call to a component that is never executed Unhandled Exceptions Incorrect Message Sequences exception thrown by execution that is never caught by the program change randomly the web interaction structure Errors are injected in components touched by web interactions A web interaction is faulty if at least one of its components is faulty Slide 19/27

Performance Metrics Used in Experiments Accuracy (True Alarms) % of true detections out of web interactions where errors were injected Precision i % of true detections out of the total number of fdetections (False Alarms) Detection time elapsed between the error injection and its detection Latency Web Interactions 1 2 3 4 5 Example: Error Injected X X X Detection (An alarm is signaled) X X X X Accuracy = 2/3 = 067 0.67 Precision = 2/4 = 05 0.5 Slide 20/27

State Vector Size Stat te Vector Siz ze 30 20 10 0 30 20 10 0 Results: State Vector Reduction Random Sampling 10 20 30 40 50 60 70 80 90 100 Discrete Time Intelligent Sampling 10 20 30 40 50 60 70 80 90 100 Discrete Time CDF 1 08 0.8 0.6 04 0.4 0.2 Random Sampling Intelligent Sampling 0 0 2 4 6 8 Pruned State Vector Size Peaks are not observed in intelligent sampling (IS) IS capability of selecting messages with small discriminative size SV of size 1 is more frequent in IS Slide 21/27

Results: Monitor vs. Pinpoint (Accuracy, Precision) Pinpoint (NSDI 04), traces paths though multiple components Use of PCFG to detect abnormal paths Accuracy 1 08 0.8 0.6 0.4 0.2 0 Monitor Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Precision 1 08 0.8 0.6 0.4 0.2 0 Monitor Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Monitor and Pinpoint expose similar levels of accuracy Precision in Monitor (1.0) is higher than in Pinpoint (0.9) Slide 22/27

Results: Monitor vs. Pinpoint (Detection Latency) Latency (m msec) Detection 200 150 100 50 0 Monitor 4 8 12 16 20 24 Concurrent Users Detection Latency (sec) 200 150 100 50 0 Pinpoint-PCFG 4 8 12 16 20 24 Concurrent Users Detection latency in Monitor is in the order of milliseconds, while in Pinpoint is in seconds The PCFG has a high space and time complexity Slide 23/27

Results: Memory Consumption (MB) Virtual Memory RAM Monitor 282.27 25.53 Pinpoint-PCFG 933.56 696.06 Monitor doesn t rely in large data structures PCFG in Pinpoint has high space and time complexity O(RL 2 ) and O(L 3 ) R: number of rules in the grammar L: size of a web interaction Pinpoint thrashes due to high memory requirements Slide 24/27

Efficient Rule Matching. No Expensive to match? Yes System unstable? Yes Match. Next rule No Selectively ect match computationally o a expensive rules Expensive rules don t have to be matched all the time Rules are matched only if instability is present Slide 25/27

Efficient Rule Matching Example: Detecting Memory Leak Efficiently detect memory leak in Apache web server Memory leak injected probabilistically bili ll with web requests Expensive ARIMA-based rule to detect abnormal memory usage Average matching latency reduced Rule Matching Criteria Memory Leak Detected Average Matching Latency (msec.) Always matched yes 19.283 σ 0.5 yes 7.115 σ 1.0 no 1.25 Slide 26/27

Contributions: Concluding Remarks Sampling used to scale stateful detection system under high- rate of messages Intelligent Sampling reduces non-determinism caused by sampling HMM-approach handles incorrect messages Techniques can be applied to any stateful detection system Monitor performs better than other approaches Future Work: Efficient Rule Matching technique will be extended Sampling only sequences of messages that lead to errors Automatic generation of rules from traces Slide 27/27