Anomaly Detection Fault Tolerance Anticipation
|
|
- Matilda Rogers
- 5 years ago
- Views:
Transcription
1 Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012
2 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
3 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
4 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
5 Anomaly Detection
6 Anomaly Detection Getting at the state of health Evaluating the state of health Components AND systems
7 Supervisory Example: Active health check check_http Monitor Component (webserver) exit OK
8 Supervisory Monitor check_http Component (webserver) Pros: Easy to implement exit OK Easy to understand Well-known pattern Cons: Messaging can fail Scalability is limited
9 Supervisor Sensitivity 1 sec timeout 1 retry 1s 1s 3 sec interval X X 3s (7.9 sec exposure) Up to ~2.9s for the previous interval
10 Supervisor Sensitivity Request latency Schedule Latency (max = N) Monitor (max = 0.9s) check_http Component (webserver) exit OK Response latency (max = 0.9s)
11 Supervisor Sensitivity How many seconds of errors can you tolerate serving?
12 Supervisory Example: Interval Passive health check Monitor Component (webserver) exit 0 DISK consumption within bounds
13 Example: Interval Passive health check Pros: Supervisory Monitor exit 0 DISK consumption within bounds Component (webserver) Efficient Scalability is different Fewer moving parts Less exposure Can submit to multiple places Cons: Nonideal for network-based services Different tuning (windowed expectation)
14 Example: Passive health check Supervisory Interval { TIME?? Component
15 TIME Example: Passive health check Supervisory Interval {?? Interval Component Schedule Latency Exposure = (Schedule + Interval )*UnknownConsecutiveIntervals+1
16 Frequency and Transience Probability Of False Positives Probability Of Nondetection Short intervals Low # of retries Short timeouts Long intervals High # of retries Long timeouts
17 In-Line Example: Passive application event logging monitor application
18 Supervisory Example: Passive application event logging monitor application Pros: On-demand publish Cons: Onus is on the app Can t be 100% sure it s working
19 Supervisory Example: Passive application event logging monitor application Positive events (sales, registrations, etc.) Negative events (errors, exceptions, etc.) Lack or presence of data mean different things, so history is paramount.
20 Context
21 Evaluation what is abnormal?
22 10 9 Response Time Time
23 Static Thresholds 10 Response Time Critical Warning Time
24 Static Thresholds 10 Response Time Critical Warning Time
25 Static Thresholds 10 Response Time Critical Warning Time
26 Static Thresholds 10 Response Time Critical Warning Time
27 Static Thresholds
28 Static Thresholds
29 Context Normal?
30 Context 24 hours
31 Context 7 days
32 Normal But Noisy Context
33 Context Smoothing?
34 Context Holt-Winters Exponential Smoothing Recent points influencing a forecast, exponentially decreasing influence backwards in time. en.wikipedia.org/wiki/exponential_smoothing
35 Context Aberrant Behavior Detection in Time Series for Network Monitoring full_papers/brutlag/brutlag_html/
36 Dynamic Thresholds
37 Dynamic Thresholds Upper bound Raw data Lower bound
38 Dynamic Thresholds Hrm...
39 Dynamic Thresholds Hrm...
40 Dynamic Thresholds Holt-Winters Aberration Ah!
41 Dynamic Thresholds Graphite metrics collection w/holt-winters abberations Nagios check for Graphite data
42 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
43 FAULT TOLERANCE
44 Detection of fault X Triggers corrective action Y Clean up, report back (RECOVERY OR MASKING)
45 Variation Tolerance
46 Adaptive Systems Expected Variation
47 Adaptive Systems Expected Variation
48 Adaptive Systems Expected Variation
49 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation Woods, 2011
50 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation
51 New Disturbances Arise Compensation is Exhausted Variation Disturbance Expected Variation Fault Control compensation decompensation
52 Variations!= Faults
53 Dead Corrupt Late Wrong
54 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)
55 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)
56 Spatial Redundancy 2 2
57 Spatial Redundancy Active/Active
58 Spatial Redundancy Active/Passive
59 Spatial Redundancy Roaming Spare Dedicated Spare
60 In-Line Fault Tolerance PHP (thrift client) App Thrift Connect timeout Search (Lucene/Solr) Send timeout Receive timeout
61 App In-Line Fault Tolerance X Search (Lucene/Solr) 1. App attempts connection, can t 2. Caches APC user object with 60s TTL key=server:port 3. Moves to next server in rotation, skipping any found in APC
62 In-Line Fault Tolerance /lib/php/src/tsocketpool.php
63 In-Line Fault Tolerance Pros: Distributed checking and perspective Handles transient failures Auto-recovery Cons: Onus is on the app for implementation
64 Fault Tolerance Nagios Event Handlers Attempt to recover from specific conditions Chain together recovery actions eventhandlers.html
65 If (fault X) then HUP process; re-check If (OK) then notify+exit ELSE Hard restart process; re-check If (OK) then notify+exit ELSE Remove from production; notify+exit
66 How many seconds of errors can you tolerate serving?
67 Fail Closed When fault is found, and can t be recovered or masked, operations cease to protect the rest of the system from damage.
68 Depth and Dependencies Monitor Load Balancers Health check App DB
69 Depth and Dependencies WARNING: Monitor Load Balancers Health check Don t be too App crazy DB
70 Fail Closed Aggregate Cluster Checking X X X X If (clusterfail > 25%) then notify+exit ELSE OK
71 Fail Open When a fault happens, and can t be masked or recovered, operations continue without the feature.
72 Fail Open Example 1 at Etsy: Geo Targeting 50ms Internal SLA on guessing location via client IP. If >50ms, we just don t show local results.
73 Fail Open Example 2 at Etsy: Rate Limiting App Memcache Internal SLA on incrementing counters+checking totals. If >SLA, we let the action continue, and throw fire-andforget counter if we can.
74 SYSTEMIC
75 App Cache DB Search Logging Queue
76 App Cache DB Search Logging Queue
77 Functional Resonance
78
79 Shop Stats
80 Shop Stats App Cache DB Search Logging Queue
81 Registration App Cache DB Search Logging Queue
82 Registration
83 Shop Stats Logins Registrations Checkout New Listings Photos Search API Rate limiting Data Analysis Search A/B analysis Page performance Search Ads Editorial content systems Feedback Messaging/Convos Activity Feeds Circles Shipping Mobile Internationalization Testing Fraud
84 Systemic Application/Functionality Health Componential/Resource Health
85 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
86 Anticipation During design of architecture During choice of technologies During design of monitoring and metrics
87 TRADE-OFFS
88 What could possibly go wrong?
89 REQUISITE IMAGINATION
90 Possible Foreseeable Situations Situations Considered By Situations Considered By Situations Considered By Novice Designer Average Designer Expert Designer Adamski and Westrum, 2003
91 Anticipation Failure Mode Effects Analysis (FMEA) Failure Mode Effects and Criticality Analysis (FMECA) Failure_mode,_effects,_and_criticality_analysis
92 Architectural reviews Go or No-Go meetings Game Day exercises
93 Anticipation Servers Networks Software Applications Monitoring Metrics Traffic
94 PEOPLE
95 (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
96 THE END
The Walking Dead Michael Nitschinger
The Walking Dead A Survival Guide to Resilient Reactive Applications Michael Nitschinger @daschl the right Mindset 2 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3 4 5 Not
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationRediffmail Enterprise High Availability Architecture
Rediffmail Enterprise High Availability Architecture Introduction Rediffmail Enterprise has proven track record of 99.9%+ service availability. Multifold increase in number of users and introduction of
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationScalable Online Analytics for Monitoring
Scalable Online Analytics for Monitoring LISA15, ov. 13, 2015, Washington, D.C. Heinrich Hartmann, PhD, Chief Data Scientist, Circonus I m Heinrich Heinrich.Hartmann@Circonus.com From Mainz, Germany Studied
More informationTrending with Purpose. Jason Dixon
Trending with Purpose Jason Dixon Monitoring Nagios Fault Detection Notifications Escalations Acknowledgements/Downtime http://www.nagios.org/ Nagios Pros Free Extensible Plugins Configuration templates
More informationDocument Sub Title. Yotpo. Technical Overview 07/18/ Yotpo
Document Sub Title Yotpo Technical Overview 07/18/2016 2015 Yotpo Contents Introduction... 3 Yotpo Architecture... 4 Yotpo Back Office (or B2B)... 4 Yotpo On-Site Presence... 4 Technologies... 5 Real-Time
More informationSpark and Flink running scalable in Kubernetes Frank Conrad
Spark and Flink running scalable in Kubernetes Frank Conrad Architect @ apomaya.com scalable efficient low latency processing 1 motivation, use case run (external, unknown trust) customer spark / flink
More informationThe Future of Real-Time in Spark
The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems
More informationOutline. Failure Types
Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 10 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus
More informationFault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues
Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction
More informationThe Power of Snapshots Stateful Stream Processing with Apache Flink
The Power of Snapshots Stateful Stream Processing with Apache Flink Stephan Ewen QCon San Francisco, 2017 1 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager
More informationLecture 15: Datacenter TCP"
Lecture 15: Datacenter TCP" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Mohammad Alizadeh Lecture 15 Overview" Datacenter workload discussion DC-TCP Overview 2 Datacenter Review"
More informationebay s Architectural Principles
ebay s Architectural Principles Architectural Strategies, Patterns, and Forces for Scaling a Large ecommerce Site Randy Shoup ebay Distinguished Architect QCon London 2008 March 14, 2008 What we re up
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationDistributed Data Management Replication
Felix Naumann F-2.03/F-2.04, Campus II Hasso Plattner Institut Distributing Data Motivation Scalability (Elasticity) If data volume, processing, or access exhausts one machine, you might want to spread
More informationError code. Description of the circumstances under which the problem occurred. Less than 200. Linux system call error.
Error code Less than 200 Error code Error type Description of the circumstances under which the problem occurred Linux system call error. Explanation of possible causes Countermeasures 1001 CM_NO_MEMORY
More informationMERC. User Guide. For Magento 2.X. Version P a g e
MERC User Guide For Magento 2.X Version 1.0.0 http://litmus7.com/ 1 P a g e Table of Contents Table of Contents... 2 1. Introduction... 3 2. Requirements... 4 3. Installation... 4 4. Configuration... 4
More informationHigh Availability/ Clustering with Zend Platform
High Availability/ Clustering with Zend Platform David Goulden Product Manager goulden@zend.com Copyright 2007, Zend Technologies Inc. In this Webcast Introduction to Web application scalability using
More informationMonitor your containers with the Elastic Stack. Monica Sarbu
Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team monica@elastic.co 3 Monitor your containers with the Elastic Stack Elastic Stack 5 Beats are lightweight shippers
More informationDistributed Systems
15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard
More informationebay Marketplace Architecture
ebay Marketplace Architecture Architectural Strategies, Patterns, and Forces Randy Shoup, ebay Distinguished Architect QCon SF 2007 November 9, 2007 What we re up against ebay manages Over 248,000,000
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationRhapsody Interface Management and Administration
Rhapsody Interface Management and Administration Welcome The Rhapsody Framework Rhapsody Processing Model Application and persistence store files Web Management Console Backups Route, communication and
More informationIBM InfoSphere Streams v4.0 Performance Best Practices
Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related
More informationGFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures
GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,
More informationZero to Millions: Building an XLSP for Gears of War 2
Zero to Millions: Building an XLSP for Gears of War 2 Dan Schoenblum Senior Engine Programmer Epic Games dan.schoenblum@epicgames.com About Me Working in online gaming for over 10 years At GameSpy from
More informationAll Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging
3/14/2001 1 All Paging Schemes Depend on Locality VM Page Replacement Emin Gun Sirer Processes tend to reference pages in localized patterns Temporal locality» locations referenced recently likely to be
More informationMemory may be insufficient. Memory may be insufficient.
Error code Less than 200 Error code Error type Description of the circumstances under which the problem occurred Linux system call error. Explanation of possible causes Countermeasures 1001 CM_NO_MEMORY
More informationVMware vrealize operations Management Pack FOR. PostgreSQL. User Guide
VMware vrealize operations Management Pack FOR PostgreSQL User Guide TABLE OF CONTENTS 1. Purpose... 3 2. Introduction to the Management Pack... 3 2.1 How the Management Pack Collects Data... 3 2.2 Data
More informationConfiguring IP SLAs LSP Health Monitor Operations
Configuring IP SLAs LSP Health Monitor Operations This module describes how to configure an IP Service Level Agreements (SLAs) label switched path (LSP) Health Monitor. LSP health monitors enable you to
More informationAppendix D: Storage Systems (Cont)
Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that
More informationElastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.
Issue 01 Date 2018-04-30 HUAWEI TECHNOLOGIES CO., LTD. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of
More informationHow to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt
How to pimp high volume PHP websites 27. September 2008, PHP conference Barcelona By Jens Bierkandt 1 About me Jens Bierkandt Working with PHP since 2000 From Germany, living in Spain, speaking English
More informationGeorgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong
Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services
More informationIT323 - Software Engineering 2 1
IT323 - Software Engineering 2 1 Explain how standards may be used to capture organizational wisdom about effective methods of software development. Suggest four types of knowledge that might be captured
More informationLecture 22: Fault Tolerance
Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error
More informationEE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1
EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint
More informationWLS Neue Optionen braucht das Land
WLS Neue Optionen braucht das Land Sören Halter Principal Sales Consultant 2016-11-16 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationMonitor your infrastructure with the Elastic Beats. Monica Sarbu
Monitor your infrastructure with the Elastic Beats Monica Sarbu Monica Sarbu Team lead, Beats team Email: monica@elastic.co Twitter: 2 Monitor your servers Apache logs 3 Monitor your servers Apache logs
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationFault Tolerance. Chapter 7
Fault Tolerance Chapter 7 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability Failure Models Type of failure Crash failure Omission failure Receive omission Send omission
More informationOperating Systems Virtual Memory. Lecture 11 Michael O Boyle
Operating Systems Virtual Memory Lecture 11 Michael O Boyle 1 Paged virtual memory Allows a larger logical address space than physical memory All pages of address space do not need to be in memory the
More informationTrends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation
Trends in Data Protection and Restoration Technologies Mike Fishman, EMC 2 Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
More informationCIT 668: System Architecture. Caching
CIT 668: System Architecture Caching Topics 1. Cache Types 2. Web Caching 3. Replacement Algorithms 4. Distributed Caches 5. memcached A cache is a system component that stores data so that future requests
More informationManaging Latency in IPS Networks
Revision C McAfee Network Security Platform (Managing Latency in IPS Networks) Managing Latency in IPS Networks McAfee Network Security Platform provides you with a set of pre-defined recommended settings
More informationAnti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.
Issue 01 Date 2018-08-15 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any
More informationHelp! I need more servers! What do I do?
Help! I need more servers! What do I do? Scaling a PHP application 1 2-Feb-09 Introduction A real world example The wonderful world of startups Who am I? 2 2-Feb-09 Presentation Overview - Scalability
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationdavidklee.net gplus.to/kleegeek linked.com/a/davidaklee
@kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture
More informationCurrent Topics in OS Research. So, what s hot?
Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general
More informationChronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data FAST 2017, Santa Clara Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, and Josef Adersberger Florian.Lautenschlager@qaware.de
More informationOffice and Express Print Release High Availability Setup Guide
Office and Express Print Release High Availability Setup Guide Version 1.0 2017 EQ-HA-DCE-20170512 Print Release High Availability Setup Guide Document Revision History Revision Date May 12, 2017 September
More informationVmware VCP550PSE. VMware Certified Professional on vsphere 5.
Vmware VCP550PSE VMware Certified Professional on vsphere 5 http://killexams.com/exam-detail/vcp550pse QUESTION: 108 A virtual machine fails to migrate during a Storage DRS event. What could cause this
More informationIntrusion Detection Systems (IDS)
Intrusion Detection Systems (IDS) Presented by Erland Jonsson Department of Computer Science and Engineering Contents Motivation and basics (Why and what?) IDS types and detection principles Key Data Problems
More informationHochperformante Softwarearchitekturen Planung, Zufall oder Erfahrung? Unser Geheimrezept!
Hochperformante Softwarearchitekturen Planung, Zufall oder Erfahrung? Unser Geheimrezept! Alexander Buchmann, Wolfgang Strunk Siemens Enterprise Communications SeaCon 2011 Hamburg, June 28th Page 1 Welcome
More informationA Guide to Architecting the Active/Active Data Center
White Paper A Guide to Architecting the Active/Active Data Center 2015 ScaleArc. All Rights Reserved. White Paper The New Imperative: Architecting the Active/Active Data Center Introduction With the average
More informationCS 153 Design of Operating Systems Winter 2016
CS 153 Design of Operating Systems Winter 2016 Lecture 18: Page Replacement Terminology in Paging A virtual page corresponds to physical page/frame Segment should not be used anywhere Page out = Page eviction
More informationIs Your Project in Trouble on System Performance?
Is Your Project in Trouble on System Performance? Charles Chow May 2017 Is SATURN Your Project 2017 in Trouble - Is Your on System Project Performance? in Trouble on System Performance? May 2017 1 4, [Copyright
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationExperience the GRID Today with Oracle9i RAC
1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison
More informationni.com Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects
Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects Agenda Keys to quality in a software architecture Software architecture overview I/O safe states Watchdog timers Message communication
More informationDesigning Fault-Tolerant Applications
Designing Fault-Tolerant Applications Miles Ward Enterprise Solutions Architect Building Fault-Tolerant Applications on AWS White paper published last year Sharing best practices We d like to hear your
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationChapter 11. High Availability
Chapter 11. High Availability This chapter describes the high availability fault-tolerance feature in D-Link Firewalls. Overview, page 289 High Availability Mechanisms, page 291 High Availability Setup,
More informationPage 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm
FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a
More informationImprove Web Application Performance with Zend Platform
Improve Web Application Performance with Zend Platform Shahar Evron Zend Sr. PHP Specialist Copyright 2007, Zend Technologies Inc. Agenda Benchmark Setup Comprehensive Performance Multilayered Caching
More informationGoogle File System 2
Google File System 2 goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) focus on multi-gb files handle appends efficiently (no random writes & sequential reads) co-design
More informationFacebook Immune System 人人安全中心姚海阔
Facebook Immune System 人人安全中心姚海阔 Immune A realtime system to protect our users and the social graph Big data, Real time 25B checks per day 650K per second at peak Realtime checks and classifications on
More informationCSE 565 Computer Security Fall 2018
CSE 565 Computer Security Fall 2018 Lecture 19: Intrusion Detection Department of Computer Science and Engineering University at Buffalo 1 Lecture Outline Intruders Intrusion detection host-based network-based
More informationPutting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt
Putting together the platform: Riak, Redis, Solr and Spark Bryan Hunt 1 $ whoami Bryan Hunt Client Services Engineer @binarytemple 2 Minimum viable product - the ideologically correct doctrine 1. Start
More informationjetnexus Virtual Load Balancer
jetnexus Virtual Load Balancer Mitigate the Risk of Downtime and Optimise Application Delivery We were looking for a robust yet easy to use solution that would fit in with our virtualisation policy and
More informationDemystifying Storage Area Networks. Michael Wells Microsoft Application Solutions Specialist EMC Corporation
Demystifying Storage Area Networks Michael Wells Microsoft Application Solutions Specialist EMC Corporation About Me DBA for 7+ years Developer for 10+ years MCSE: Data Platform MCSE: SQL Server 2012 MCITP:
More informationjetnexus Virtual Load Balancer
jetnexus Virtual Load Balancer Mitigate the Risk of Downtime and Optimise Application Delivery We were looking for a robust yet easy to use solution that would fit in with our virtualisation policy and
More informationCloud Monitoring as a Service. Built On Machine Learning
Cloud Monitoring as a Service Built On Machine Learning Table of Contents 1 2 3 4 5 6 7 8 9 10 Why Machine Learning Who Cares Four Dimensions to Cloud Monitoring Data Aggregation Anomaly Detection Algorithms
More informationToday CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra
Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2
More informationReminder: Mechanics of address translation. Paged virtual memory. Reminder: Page Table Entries (PTEs) Demand paging. Page faults
CSE 451: Operating Systems Autumn 2012 Module 12 Virtual Memory, Page Faults, Demand Paging, and Page Replacement Reminder: Mechanics of address translation virtual address virtual # offset table frame
More informationActivant Solutions Inc. SQL Server 2005: Data Storage
Activant Solutions Inc. SQL Server 2005: Data Storage SQL Server 2005 suite Course 2 of 4 This class is designed for Beginner/Intermediate SQL Server 2005 System Administrators Objectives Understand how
More information! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like
Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total
More informationFault-tolerant techniques
What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques
More informationCluster-Based Scalable Network Services
Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth
More informationvcloud Automation Center Reference Architecture vcloud Automation Center 5.2
vcloud Automation Center Reference Architecture vcloud Automation Center 5.2 This document supports the version of each product listed and supports all subsequent versions until the document is replaced
More informationFault Tolerance. The Three universe model
Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful
More informationMicrosoft SQL Server Fix Pack 15. Reference IBM
Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Note Before using this information and the product it supports, read the information in Notices
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationCS 856 Latency in Communication Systems
CS 856 Latency in Communication Systems Winter 2010 Latency Challenges CS 856, Winter 2010, Latency Challenges 1 Overview Sources of Latency low-level mechanisms services Application Requirements Latency
More informationRelease 3.0 Insights Storage Operation and Maintenance Guide. Revision: 1.0
Release 3.0 Insights Storage Operation and Maintenance Guide Revision: 1.0 Insights Storage Operation and Maintenance Guide Portions of the documents can be copied and pasted to your electronic mail or
More informationAnalysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc.
PRESENTATION Practical Online TITLE GOES Cache HERE Analysis and Optimization Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. SNIA Legal Notice The material contained in this tutorial is copyrighted by
More informationTechnical Guide Network Video Recorder Enterprise Edition Maintenance Guide
Technical Guide Network Video Recorder Enterprise Edition Maintenance Guide Network Video Management System March 23, 2018 NVMSTG008 Revision 1.3.0 CONTENTS 1. Overview... 3 1.1. About This Document...
More informationRELIABILITY & AVAILABILITY IN THE CLOUD
RELIABILITY & AVAILABILITY IN THE CLOUD A TWILIO PERSPECTIVE twilio.com To the leaders and engineers at Twilio, the cloud represents the promise of reliable, scalable infrastructure at a price that directly
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationExBC software C5.7.X. User manual. Part no Issue no 04 Date 04/2018
Part no 6159921030 Issue no 04 Date 04/2018 ExBC software C5.7.X User manual Model Part number EABC15-900 6151658410 EABC26-560 6151658420 EABC32-410 6151658430 EABC45-330 6151658440 EABC50-450 6151658450
More informationYogesh Simmhan. escience Group Microsoft Research
External Research Yogesh Simmhan Group Microsoft Research Catharine van Ingen, Roger Barga, Microsoft Research Alex Szalay, Johns Hopkins University Jim Heasley, University of Hawaii Science is producing
More information