Distributed Diagnosis of Failures in a Three Tier E-Commerce System. Motivation

Similar documents
Distributed Diagnosis of Failures in a Three Tier E- Commerce System

Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications. DCSL Research Thrusts

How To Keep Your Head Above Water While Detecting Errors

Stateful Detection in High Throughput Distributed Systems

Self Checking Network Protocols: A Monitor Based Approach

NON INTRUSIVE DETECTION AND DIAGNOSIS OF FAILURES IN HIGH THROUGHPUT DISTRIBUTED APPLICATIONS. Research Initiatives

How To Keep Your Head Above Water While Detecting Errors

Path-based Macroanalysis for Large Distributed Systems

Collaborative Intrusion Detection System : A Framework for Accurate and Efficient IDS. Outline

Connecting the Dots: Using Runtime Paths for Macro Analysis

Stateful Detection in High Throughput Distributed Systems

Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

PowerTracer: Tracing requests in multi-tier services to diagnose energy inefficiency

Providing Automated Detection of Problems in Virtualized Servers using Monitor framework

Safety-Critical Software Development

Large Scale Debugging of Parallel Tasks with AutomaDeD!

Stateful Detection in High Throughput Distributed Systems

Barry D. Lamkin Executive IT Specialist Capitalware's MQ Technical Conference v

J2EE Instrumentation for software aging root cause application component determination with AspectJ

SESDAD. Desenvolvimento de Aplicações Distribuídas Project (IST/DAD): MEIC-A / MEIC-T / METI. October 1, 2015

Performance Debugging for Distributed Systems of Black Boxes. Yinghua Wu, Haiyong Xie

FP7 NEMESYS Project: Advances on Mobile Network Security

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation

Varanus: More-With-Less Fault Localization in Data Centers

Routing and fast protection in networks of long-reach PONs

From Correlation to Causation: Active Delay Injection for Service Dependency Detection

Port Tapping Session 2 Race tune your infrastructure

October 25, Berkeley/Stanford Recovery-oriented Computing Course Lecture

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS

Lecture 6: Vehicular Computing and Networking. Cristian Borcea Department of Computer Science NJIT

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers

Reliable Computing I

Testing Digital Systems I

On the Way to ROC-2 (JAGR: JBoss + App-Generic Recovery)

Redeeming the P Word. Making the Case for Probes as an Effective UC Diagnostics Tools WHITE PAPER

Testing. ECE/CS 5780/6780: Embedded System Design. Why is testing so hard? Why do testing?

Network Management Functions - Fault. Network Management

Oracle Database 11g: Real Application Testing & Manageability Overview

Profiling and diagnosing large-scale decentralized systems David Oppenheimer

Delay Injection for. Service Dependency Detection

TransactionVision Technical White Paper

Fault Simulation. Problem and Motivation

Introduction to Distributed Systems

Practical Fault Detection and Diagnosis in Data Centers

WSN Routing Protocols

End-to-end QoS negotiation in network federations

Software Architecture and Engineering: Part II

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

VLSI System Testing. Fault Simulation

ORACLE ENTERPRISE MANAGER 10g ORACLE DIAGNOSTICS PACK FOR NON-ORACLE MIDDLEWARE

VLSI Test Technology and Reliability (ET4076)

Lecture 6: Network Analysis: Process Part 1

Model Driven Testing Overview

Oracle Database 12c Rel. 2 Cluster Health Advisor - How it Works & How to Use it

Discovering Dependencies between Virtual Machines Using CPU Utilization. Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh

Oracle Database 10g The Self-Managing Database

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

COSC 6374 Parallel Computation. Parallel Computer Architectures

Proactive Detection of Inadequate Diagnostic Messages for Software Configuration Errors

SIGCOMM 17 Preview Session: Network Monitoring

How can we gain the insights and control we need to optimize the performance of applications running on our network?

VirtualWisdom â ProbeNAS Brief

Sangfor Application Criticality Model and Reliability Design Best Practice

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

Scalable overlay Networks

COSC 6374 Parallel Computation. Parallel Computer Architectures

Achieving High Survivability in Distributed Systems through Automated Response

RA-GRS, 130 replication support, ZRS, 130

ON-LINE QUALITATIVE MODEL-BASED DIAGNOSIS OF TECHNOLOGICAL SYSTEMS USING COLORED PETRI NETS

An Efficient Method for Multiple Fault Diagnosis

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

RobinHood: Tail Latency-Aware Caching Dynamically Reallocating from Cache-Rich to Cache-Poor

Amit. Amit - Active Middleware. Technology Overview. IBM Research Lab in Haifa Active Technologies October 2002

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Robust BFT Protocols

User-level Internet Path Diagnosis

Fuzzy Intrusion Detection

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

A Real-world Demonstration of NetSocket Cloud Experience Manager for Microsoft Lync

PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds

Network Troubleshooting from End-Hosts. Renata Teixeira Laboratoire LIP6 CNRS and UPMC Sorbonne Universités

Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04

Magpie: Distributed request tracking for realistic performance modelling

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

J. A. Drew Hamilton, Jr., Ph.D. Director, Information Assurance Laboratory and Associate Professor Computer Science & Software Engineering

CS 457 Networking and the Internet. Shortest-Path Problem. Dijkstra s Shortest-Path Algorithm 9/29/16. Fall 2016

Application of Adaptive Probing for Fault Diagnosis in Computer Networks 1

Optimizing SAN Performance & Availability: A Financial Services Customer Perspective

An Introduction to Software Architecture. David Garlan & Mary Shaw 94

Auto Management for Apache Kafka and Distributed Stateful System in General

Appendix D: Storage Systems (Cont)

How to Troubleshoot Databases and Exadata Using Oracle Log Analytics

Diagnosis of TCP Overlay Connection Failures using Bayesian Networks

Black-Box Problem Diagnosis in Parallel File System

Performance analysis basics

MPLS OAM Technology White Paper

Using Hidden Semi-Markov Models for Effective Online Failure Prediction

<Insert Picture Here> Managing Oracle Exadata Database Machine with Oracle Enterprise Manager 11g

J2EE DIAGNOSING J2EE PERFORMANCE PROBLEMS THROUGHOUT THE APPLICATION LIFECYCLE

Managing Oracle Real Application Clusters. An Oracle White Paper January 2002

Transcription:

Distributed Diagnosis of Failures in a Three Tier E-ommerce System Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh agchi Dependable omputing Systems Lab (DSL) School of Electrical and omputer Engineering Purdue University Slide Motivation onnected society of today has come to rely heavily on distributed computer infrastructure ATM networks, Airline reservation systems, distributed internet services Increased reliance on Internet services supported by multi-tier applications ost of downtimes of these systems can run into several millions of dollars Financial brokerage firms have a downtime cost $6.2M/hr (Source: ID, 25) Need a scalable real-time diagnosis system for determining root cause of a failure Identify the faulty elements so that fast recovery is possible (low MTTR), giving high availability Slide 2

hallenges in Diagnosis for E-ommerce Systems Diagnosis: Protocol to determine the component that originated the failure There is complex interaction among components in three tier applications omponents belong to one of three layers - web, application tier and database Not all interactions can be enumerated a priori Large number of concurrent users Rate of interactions is high Errors can remain undetected for an arbitrary length of time An error can propagate from one component to another and finally manifest itself as a failure High error propagation rate due to high rate of interactions Slide 3 SRDS 4, TDS 6, TDS 7 Solution Approach: Monitor Web omponents Database Users Application omponents Observed System Monitor Observer System Non intrusive: observe messages exchanged and estimate application state transitions Anomaly based: match against a rule base Online: fast detection and diagnosis Treat application entities (or components) as lack-box Slide 4

Tracking ausal Dependencies Messages between the components are used to build a asual Graph (G) Vertices are components, and edges are interactions between components Intuition: A message interaction can be used to propagate an error omponents interacting by message passing A 6 4 2 3 Message ID Logical Time L 2 L2 3 L3 D 5 Slide 5 Aggregate Graph We bound the size of the asual Graph Unbounded G would lead to long delays in traversing the G during diagnosis leading to high latency However, complete purging of the information in the G can cause inaccuracies during the diagnosis process We aggregate the state information in the asual Graph at specified time points and store it in an Aggregate Graph (AG) The AG contains aggregate information about the protocol behavior averaged over the past AG is similar to G in the structure (a node represents a component and a link represents a communication channel) AG maintains historical information about failure behavior of components and links Slide 6

Diagnosis in the Presence of Failures A D 7 6 5 4 Failure detected in omponent D 2 3 Diagnosis algorithm triggered Monitor Which components have interacted with D previously? When a failure is detected in a node n f, Monitor checks which other components have interacted with n f in the G ounded search space due to the bounded nature of G The diagnosis algorithm is then triggered Slide 7 Diagnosis Tree A tree rooted at the n f (where the failure is detected) is built from G: Diagnosis Tree Nodes which have directly sent a message to n f are at depth Nodes which have directly sent a message to a node at depth h are at depth h+ Diagnostic Tests D 5 2 3 Error detected in D 7 Message IDs A Slide 8

Diagnostic Tests Diagnostic tests are specific to the component and its state The state is as deduced by the Monitor Example: After receiving K of data, did the component send an acknowledgement? Example: After receiving a database update due to a transaction, did the update complete within 5 seconds? No active probing of the components with tests State of the component may have changed during the error propagation Probing introduces additional stress on the components when failures are already occurring Tests may not be perfect They are probabilistic in nature Slide 9 PPEP: Path Probability of Error Propagation Definition: PPEP(n i,n j ), the probability of node n i being faulty and causing this error to propagate on the path from n i to n j, leading to a failure at n j D PPEP depends on Node Reliability of the node which is possibly faulty Link Reliability of links on path Error Masking capability of intermediate nodes on path 5 2 3 Example: PPEP(, D) 7 A Slide

Node Reliability Node Reliability (n r ): quantitative measure of the reliability of the component It is obtained by running the diagnostic tests on the states of the components overage of the component c(n) = #tests that succeeded/#tests PPEP is inversely proportional to node reliability More reliable node, therefore less likelihood of originating the chain of errors Slide Link Reliability Link Reliability lr (i,j) : quantitative measure of the reliability of the network link between two components lr (i,j) = number of received messages in receiver n j / number of sent messages in sender n i PPEP is proportional to link reliability Reliable links increase probability of the path being used for propagating errors Slide 2

Error Masking apability (EM) Error Masking apability (e m ): quantitative measure of the ability of a node to mask an error and not propagate it to the subsequent node on the path EM of a component depends on the type of error, e.g., syntactical or semantic errors A node may be able to mask all off by one errors but not an error which leads to a higher rate stream PPEP is inversely proportional to the EM of node in the path The intermediate nodes are less likely to have propagated the error to the root node Slide 3 alculating PPEP for a Path Example: alculate PPEP from to D lr (, D) : alculate link reliability in, D D e m (): alculate EM in lr (, ) : alculate link reliability in, n r (): alculate node reliability of A PPEP(, D) = (- n r ()) lr (, ) (-e m ()) lr (, D) For a general path, PPEP(n, n k ) = (-n r (n )) lr (,2) (- e m (n 2 )) lr (2,3) lr (i,i+) (- e m (n i+ )) lr (i+,i+2) (-e m (n k- )) lr (k-,k) Slide 4

Overall Flow of the Diagnosis Process omponents interacting Monitor Failure Monitor detects failure Start Diagnosis Process Top-k PPEP nodes are diagnosed reate Diagnosis Tree (DT) Traverse DT down from the root Apply Diagnostic Tests (compute node reliabilities) alculate PPEP for each node AG Node reliab. Link reliab. EM Slide 5 Experimental Testbed Application: Pet Store (version.4) from Sun Microsystems Pet Store runs on top of the Joss application server with MySQL database as the back-end providing an example of a 3-tier environment A web client emulator which generates client transactions based on sample traces (written in Perl) For the mix of client transactions, we mimic the TP-WIPSo distribution with equal percentage of browse and buy interactions A web interaction is a complete cycle of communication between the client emulator and the application A transaction is a sequence of web interactions. Example: Welcome page Search View Item details Slide 6

Logical Topology Web client emulator J2EE Application Server Web Tier EJ Tier servlet JSP EJ Database External Failure Detector servlet JSP communication layer EJ EJ communication layer Internal Failure Detector Pinpoint Monitor Application messages Failure or notification messages Slide 7 Monitor onfiguration The Monitor is provided an input of state transition diagrams for the verified components and causal tests getardtype() S 4 Return getardtype() Example Rules getexpiritydata() getdata() S getdata S2 return getdata S 3 Return getexpiritydata() S S 2 Return getdata() S getexpirymonth S return getexpirymonth getexpiritymonth() S Return getexpiritymonth() S getexpirydata S3 return getexpirydata Slide 8

omparison with Pinpoint Pinpoint (hen et al., DSN 2, NSDI 4) represents a recent state-of-the-art black-box diagnosis system Well explained and demonstrated on an open source application We implement the Pinpoint algorithm (online) for comparison with the Monitor s diagnosis approach Summary of Pinpoint algorithm: Pinpoint tracks which components have been touched by a particular transaction Determine which transaction has failed y a clustering algorithm (UPGMA), Pinpoint correlates the failures of transactions to the components that are most likely to be the cause of the failure Slide 9 Performance Metrics Accuracy: a result is accurate when all components causing a fault are correctly identified Example, if two components, A and, are interacting to cause a failure, identifying both would be accurate Identifying only one or neither would not be accurate Example: if predicted fault set is {A,,, D, E}, but faults were in components {A, } accuracy still % Precision: penalizes a system for diagnosing a superset Precision is the ratio of the number of faulty components to the total number of entities in the predicted fault set Above example: precision = 4% omponents {, D, E} are false positives Slide 2

Fault Injection on Different omponents We use -component, 2-component and 3-component triggers -component trigger: every time the target component is touched by a transaction, the fault in injected in that component 2-component trigger: a sequence of 2-components is determined and during a transaction, the last component in the transaction is injected This mimics an interaction fault between two components oth components should be flagged as faulty 3-component fault is defined similarly as in 2-component Slide 2 Results for a -omponent Fault Injection Accuracy.8.6.4.2 Pinpoint Accuracy.2.4.6.8 False Positive=(-Precision) Monitor Accuracy.8.6.4.2 Monitor Accuracy..2.3.4.5.6.7.8.9 False Positive=(-Precision) oth can achieve high accuracy but Pinpoint suffers from high false positive rates In Pinpoint, two different accuracy values can be achieved since a given precision value is achieved for two different cluster sizes The latency of detection in our system is very low Thus, the faulty component is often at the root of the DT in the Monitor. Slide 22

Results for a 2-omponent Fault Injection Accuracy.8.6.4.2 Pinpoint Accuracy..2.3.4.5.6.7.8.9 False Positive=(-Precision) Monitor Accuracy.8.6.4.2 Monitor Accuracy.2.4.6.8 False Positive=(-Precision) Performance of the Monitor declines and Pinpoint improves from the single component fault Monitor still outperforms Pinpoint Monitor gets less opportunity for refining the parameter values the PPEP calculation is not as accurate as for the single component faults Slide 23 Results for a 3-omponent Fault Injection Accuracy.8.6.4.2 Pinpoint Accuracy..2.3.4.5.6.7.8.9 False Positive=(-Precision) Monitor Accuracy.8.6.4.2 Monitor Accuracy..2.3.4.5.6.7 False Positive=(-Precision) Performance of the Monitor declines and Pinpoint improves from the single and two component faults Monitor still outperforms Pinpoint Pinpoint maximum average precision value ~ 27% Attributed to the fact that more number of components causes selected transactions to fail better performance by the clustering algorithm Monitor declines because of the same reason as in 2-component faults Slide 24

Precision.3.2. Latency omparison Precision Latency - Pinpoint Accuracy Latency - Pinpoint -omponent 2-omponent 3-omponent Accuracy.4.3.2. -omponent 2-omponent 3-omponent 2 3 4 5 6 Latency (minutes) 2 3 4 5 6 Latency (minutes) The Monitor has an average latency of 58.32 ms with a variance of 4.35 ms After 3.5 minutes the accuracy and precision of Pinpoint increase with latency Pinpoint takes as input the transactions and corresponding failure status every 3 seconds during a round Pinpoint runs the diagnosis for each of these snapshots taken at 3 second intervals Slide 25 Related Work White and lack box systems White box where the system is observable and, optionally, controllable. lack box where the system is neither Alert correlation in a cooperative intrusion detection framework, IEEE Symp. on Security and Privacy 2 Debugging in distributed applications Work in providing tools for debugging problems in distributed applications Performance problems: "Performance debugging for distributed systems of black boxes, SOSP 3 Misconfigurations: "PeerPressure AM SIGMETRIS 4 Network diagnosis Root cause analysis of network faults models "Shrink," AM SIGOMM 5 "orrelating instrumentation data to system states," OSDI 4 Automated diagnosis in OTS systems Automated diagnosis for black-box distributed OTS components "Automatic Model-Driven Recovery in Distributed Systems," SRDS '5 Slide 26

onclusions We presented an online diagnosis system called Monitor for arbitrary failures in distributed applications Diagnosis can be performed online with low latency Monitor s probabilistic diagnosis outperforms Pinpoint by achieving higher accuracy for the same precision values omplex interaction of components in e-commerce system make accurate diagnosis challenging lustering algorithms like the used in Pinpoint have high dependency on the complexity of the application Need for many distinguishable transactions to achieve high accuracy Transactions must touch almost all the components Future Work: How do we estimate diagnosis parameter values more accurately? Slide 27 ackup Slides Slide 28

Detectors and Injected Failures Internal and an external failure detectors are created to provide failure status of transactions to Pinpoint and Monitor Faults are injected into Pet Store application components: Servlets and EJs Four different kinds of fault injection: () Declared exceptions (2) Undeclared exceptions (3) Endless calls (4) Null call The internal detector is more likely to detect the declared and undeclared exceptions, and the null calls The external detector is more likely to detect the endless call Slide 29 Number of tightly coupled components for Pet Store omponents where faults are injected item.screen enter_order_information.screen order.do ShoppinglientFacadeLocalEJ reditardej ontactinfoej atalogej AsyncSenderEJ AddressEJ 2 3 4 5 # of tightly coupled components Slide 3