Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications. DCSL Research Thrusts

Similar documents
Self Checking Network Protocols: A Monitor Based Approach

Distributed Diagnosis of Failures in a Three Tier E-Commerce System. Motivation

Stateful Detection in High Throughput Distributed Systems

NON INTRUSIVE DETECTION AND DIAGNOSIS OF FAILURES IN HIGH THROUGHPUT DISTRIBUTED APPLICATIONS. Research Initiatives

Providing Automated Detection of Problems in Virtualized Servers using Monitor framework

Distributed Diagnosis of Failures in a Three Tier E- Commerce System

How To Keep Your Head Above Water While Detecting Errors

Stateful Detection in High Throughput Distributed Systems

Stateful Detection in High Throughput Distributed Systems

Collaborative Intrusion Detection System : A Framework for Accurate and Efficient IDS. Outline

Analysis of Black-Hole Attack in MANET using AODV Routing Protocol

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

How To Keep Your Head Above Water While Detecting Errors

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

CMPE 257: Wireless and Mobile Networking

CSE 5306 Distributed Systems. Fault Tolerance

02 - Distributed Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Reliable Distributed System Approaches

Achieving High Survivability in Distributed Systems through Automated Response

Chapter 8 Fault Tolerance

Toward Intrusion Tolerant Clouds

Assignment 12: Commit Protocols and Replication Solution

Robust Data Dissemination in Sensor Networks

CSE 5306 Distributed Systems

Sleep/Wake Aware Local Monitoring (SLAM)

Dep. Systems Requirements

02 - Distributed Systems

AN ADAPTIVE ALGORITHM FOR TOLERATING VALUE FAULTS AND CRASH FAILURES 1

Basic vs. Reliable Multicast

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Today: Fault Tolerance

Lecture 9. Quality of Service in ad hoc wireless networks

Failure Diagnosis and Cyber Intrusion Detection in Transmission Protection System Assets Using Synchrophasor Data

Architectural Support for Mode-Driven Fault Tolerance in Distributed Applications

Self-Adapting Epidemic Broadcast Algorithms

System Models for Distributed Systems

Toward Intrusion Tolerant Cloud Infrastructure

Distribution Transparencies For Integrated Systems*

Communication. Distributed Systems Santa Clara University 2016

Fault Tolerance. Distributed Systems. September 2002

Consensus and related problems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Detection of Vampire Attack in Wireless Adhoc

Subject: Adhoc Networks

Multipath Routing Protocol for Congestion Control in Mobile Ad-hoc Network

Fault Tolerance. Basic Concepts

An Efficient Routing Approach and Improvement Of AODV Protocol In Mobile Ad-Hoc Networks

Somersault Software Fault-Tolerance

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Bayeux: An Architecture for Scalable and Fault Tolerant Wide area Data Dissemination

Path-based Macroanalysis for Large Distributed Systems

Module 6 STILL IMAGE COMPRESSION STANDARDS

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

Congestion Control in Mobile Ad-Hoc Networks

ECS-087: Mobile Computing

Failure Tolerance. Distributed Systems Santa Clara University

TCP PERFORMANCE FOR FUTURE IP-BASED WIRELESS NETWORKS

Chapter 7 CONCLUSION

Providing Real-Time and Fault Tolerance for CORBA Applications

Chapter 5 Ad Hoc Wireless Network. Jang Ping Sheu

Introducing Campus Networks

Today: Fault Tolerance. Fault Tolerance

Advantages and disadvantages

Black-Box Problem Diagnosis in Parallel File System

Exercise 12: Commit Protocols and Replication

Impact of transmission errors on TCP performance. Outline. Random Errors

Lecture 22: Fault Tolerance

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. Prof. Dr. Schahram Dustdar Distributed Systems Group Vienna University of Technology. dsg.tuwien.ac.

PIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Distributed Systems Fault Tolerance

Secure Networking. Dr. Douglas Maughan DARPA / ITO

Sensors for Changing Enterprise Systems

Chapter 8 Fault Tolerance

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

presenter Hakim Weatherspoon CS 6464 February 5 th, 2009

Overlay Networks for Multimedia Contents Distribution

What Is Congestion? Effects of Congestion. Interaction of Queues. Chapter 12 Congestion in Data Networks. Effect of Congestion Control

IMS signalling for multiparty services based on network level multicast

Introduction to Distributed Systems

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

CSE 513: Distributed Systems (Distributed Shared Memory)

Distributed Systems COMP 212. Lecture 19 Othon Michail

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 30, 2018

Behavioral Analysis for Intrusion Resilience. Ahmed Fawaz Dec 6, 2016

Programming Distributed Systems

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Routing Protocol Approaches in Delay Tolerant Networks

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Networking interview questions

Transport layer issues

Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

Network-Attack-Resilient Intrusion- Tolerant SCADA for the Power Grid

Transcription:

Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications Saurabh Bagchi Dependable Computing Systems Lab (DCSL) & The Center for Education and Research in Information Assurance and Security (CERIAS) School of Electrical and Computer Engineering Purdue University Joint work with: Gunjan Khanna, Ignacio Laguna, Fahad Arshad (Students); Gautam Kar (IBM); Peter Shier (Microsoft) Work supported by: IBM, Purdue Research Foundation 1 DCSL Research Thrusts Wireless link Wired link End-to-end Dependability = Node-level issues Applications Middleware Reliable communications Operating system Hardware + Network-level issues 2

Research Projects in DCSL Framework for distributed intrusion tolerant system How to build an adaptive infrastructure for diagnosing and recovering from failures in a distributed platform? Application: Distributed e-commerce application, Voice over IP system Black-box diagnosis How to diagnose source of errors in high throughput distributed apps? Application: Distributed e-learning application, Distributed e- commerce application Dependable ad-hoc and sensor networks How to build dependable network out of inherently unreliable components with resource constraints? Application: Monitoring environmental conditions in urban areas 3 Monitor Project: Motivation Distributed Network Protocols are used in everyday computing Increased reliance on these protocols for critical applications Financial Sector, Telecommunications, Security etc. Cost of downtimes of these systems can run into several millions of dollars Vulnerable to naturally occurring errors and malicious attacks Response failure, Timing failure, Human mis-configurations 4

Design Goals A generic framework which should address both detection and diagnosis Detection: Evidence of a protocol behavior which differs from the defined set of correct actions Diagnosis: Process to pin-point the root cause of the detected failure Fault tolerance system is non-intrusive to the application Application is treated as black-box No explicit probes during runtime Two systems operate asynchronously Fast runtime detection and diagnosis Recovery is made more feasible 5 Solution Approach: Monitor Monitor Monitor Protocol entity Interaction among protocol entities RuleBase Monitoring interactions among protocol entities Monitor provides the fault tolerance services It overhears message exchanges between Protocol Entities (PEs) For detection, Monitor matches message interactions against an anomaly-based rule base For diagnosis, Monitor creates a dependency graph and runs a rulebase over the deduced state variables of the PEs 6

Application Example Distributed e-learning based application at Purdue Uses tree based reliable multicast (TRAM) Reliable delivery of multimedia stream to a large number of receivers Ensures reliability of message transfer in case of node or link failures and message errors. Sender RH RH RH RH R R R R R Data Flow Ack Flow Buffer 7 Application Example Distributed e-commerce application based on J2EE middleware PetStore: Buys and sells pet related supplies with additional system administrator and supplier interactions Three tier e-commerce system J2 EE Application Server Web client emulator External Failure Detector Web Tier component component component communication layer EJB Tier component component component communication layer Database Internal Failure Detector Pinpoint Monitor Application interactions Failure or notification message 8

Monitor Architecture 1. Data Capturer: Snoops over communication between PEs. 2. State Maintainer: Contains event definitions & reduced STDs. Flags rule matching based on State Event 3. Rule Classifier: Decides if rules are to be matched at current monitor. 4. Interaction Component: Responsible for interactions between Monitors for distributed rule matching. 9 Structure of Rule Base Rule matching engine invoked by State Maintainer Rules defined based on protocol specifications and QoS requirements. Rules are anomaly based Currently created manually by sysadmin or from UML specifications Rules can be Combinatorial: Valid for entire duration except for transients Consists of expressions of state variables arranged as an expression tree yielding Boolean result Temporal: Associated time component for precondition and postcondition 10

Detection: Challenges & Solutions How to overhear the message interactions between PEs Passive approach: Broadcast communication medium; Port mirroring on router Active approach: The middleware on which PE runs is modified to forward messages to Monitor Efficient rule matching Customized syntax for rules is developed based on Temporal Logic Actions (TLA) Example: L V t U, t (t i,t i+k ) Highly efficient rule matching for each incoming event for each type of rule Scalability Hierarchy of local and non-local monitors Filtering at each level Most interactions are local and hence configuration is optimized 11 Sample Failure Scenarios E-learning application Receiver is faulty; Withholds Ack causing slow data rate and high buffer occupancy; Example of error propagation across several PEs A repair head runs out of buffer space; Unable to handle the requisite number of receivers E-commerce application Database lock is not released; An update transaction by the supplier application is blocked indefinitely A malicious user accesses private data from the back-end database 12

Results: E-learning Application MPEG-2 video stream with single server, multiple clients Minimum data rate 20 KB/sec, Max data rate 40 KB/sec Errors injected in bursts burst length = 15 ms. Error models: Stuck-at-Fault; Directed; Random Loose clients check data rate after 4 Ack windows, tight clients after every Ack window. Possible outcomes Exception (E), Client crash (C), Data rate error (DE), No failures (NF) Single level Monitor has accuracy 84.37% Hierarchical Monitor has accuracy of 90.97%, 7% more than in the single Monitor case 13 Diagnosis: Challenges Error propagation could cause multiple simultaneous failures Fast error propagation makes diagnosis difficult since chain may span multiple PEs Existing systems consider entity manifesting failure to be faulty Under failure condition, further stress in the system is to be avoided Monitor should not send active probes/tests to the application system for diagnosis There are various factors causing uncertainty Messages may be lost Monitor capacity may temporarily get overwhelmed PEs may be able to mask some errors 14

Diagnosis Approach Monitor maintains a causal graph with events ordered according to the logical time On detection of a violation at n f, diagnosis starts by building a suspicion tree of all the nodes which sent messages to n f say set A Each node in the suspicion set is tested using a test procedure If no node is faulty, then suspicion set is expanded to include the nodes which sent messages to nodes in A D (5) (7) C B (2) (3) (1) B C A The path P from any node n i to the root n f constitutes a possible path for error propagation PPEP(n i, n f ) is defined as the probability of node n i being faulty and causing this error to propagate on the path from n i to n f, leading to a failure at D 15 Diagnosis Flow-Chart 16

Sample Failure Scenarios: E-commerce Application Null call: Instead of returning the appropriate value, a method returns a null object Consequently, the EJB does a dummy operation and the database is not updated A subsequent transaction sees an error because it is reading an incorrect database state Sample Result Achieves high accuracy as does the best of breed static dependency graph scheme (Pinpoint) But incurs much fewer false positives Performance dependent on length of chain, resources at Monitor, quality of rules 17 What s in the works High throughput applications More efficient per packet processing Sampling so that some of the packets are not processed and discarded Challenges: Maintain accuracy, decide which packets to drop Handling failures in the Monitor Replication to tolerate environment-specific errors Leverage work on synthetically introducing diversity to create diverse replicas Challenges: Low overhead replica coordination 18

Conclusion We have a system (the Monitor) that can do low overhead detection and diagnosis in black-box distributed applications Applied to four real applications so far Distributed e-learning Distributed e-commerce Virtualized server environment (IBM) Device drivers in Windows XP (Microsoft) 19 Publications 1. Gunjan Khanna, Saurabh Bagchi, Kirk Beaty, Andrew Kochut, and Gautam Kar, Providing Automated Detection of Problems in Virtualized Servers using Monitor framework, In the Workshop on Applied Software Reliability (WASR), held with the IEEE International Conference on Dependable Systems and Networks (DSN), 6 pages, June 25-28, 2006. 2. Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi, Automated Online Monitoring of Distributed Applications through External Monitors, IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 2, pp. 115-129, Apr-Jun, 2006. 3. Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi, LRRM: A Randomized Reliable Multicast Protocol for Optimizing Recovery Latency and Buffer Utilization, In the 24th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 215-225, October 26-28, 2005, Orlando, Florida. 4. Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi, Self Checking Network Protocols: A Monitor Based Approach, In Proceedings of the 23rd IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 18-30, October 18-20, 2004, Florianopolis, Brazil. DCSL URL: www.ece.purdue.edu/~dcsl 20