The Walking Dead Michael Nitschinger
|
|
- Hugh Lynch
- 5 years ago
- Views:
Transcription
1 The Walking Dead A Survival Guide to Resilient Reactive Applications Michael
2 the right Mindset 2
3 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3
4 4
5 5
6 Not so fast, mister fancy tests! 6
7 Always ask yourself What can go wrong? 7
8 Fault Tolerance 101 8
9 Fault Error Failure A fault is a latent defect that can cause an error when activated. 9
10 Fault Error Failure Errors are the manifestations of faults. 10
11 Fault Error Failure Failure occurs when the service no longer complies with its specifications. 11
12 Fault Error Failure Errors are inevitable. We need to detect, recover and mitigate them before they become failures. 12
13 Reliability is the probability that a system will perform failure free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
14 Availability is the percentage of time the system is able to perform its function. availability = MTTF MTTF + MTTR 14
15 Expression Downtime/Year Three 9s 99.9% min Four 9s 99.99% min Four 9s and a % min Five 9s % min Six 9s % min 100% 0 15
16 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse????????? 16
17 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse 99.99% 99.99% 99.99% 17
18 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse ~99.999% ~99.999% ~99.999% 18
19 Fault Tolerant Architecture 19
20 Units of Mitigation are the basic units of error containment and recovery. 20
21 Escalation is used when recovery or mitigation is not possible inside the unit. 21
22 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 22
23 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 23
24 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 24
25 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 25
26 Redundancy Cost Time To Recover Cost Active/Active Active/Standby N+M Active/Passive 26
27 The Fault Observer receives system and error events and can guide and orchestrate detection and recovery Unit Listener Unit Unit Unit Observer 27 Listener
28 28
29 29
30 Detecting Errors 30
31 A silent system is a dead system. 31
32 A System Monitor helps to study behaviour and to make sure it is operating as specified. 32
33 33
34 Periodic Checking Heartbeats monitor tasks or remote services and initiate recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34
35 Endpoint Encoder( Decoder( Encoder( Decoder( No Traffic Event on Idle Ne*y( Writes( Ne*y( Reads( 35
36 Riding over Transients is used to defer error recovery if the error is temporary. Patience is a virtue to allow the true signature of an error to show itself. - Robert S. Hanmer 36
37 37
38 And more! Complete Parameter Checking Watchdogs Voting Checksums Routine Audits 38
39 Recovery and Mitigation of Errors 39
40 Timeout to not wait forever and keep holding up the resource. X 40
41 Failover to a redundant unit when the error has been detected and isolated. Redundancy Reminder Cost Time To Recover Cost Active/Active Active/Standby N+M 41
42 Intelligent Retries Fixed Linear Exponential Time between Retries Number of Attempts 42
43 Restart can be used as a last resort with the trade-off to lose state and time. 43
44 Fail Fast to shed load and give a partial great service than a complete bad one. Boundary 44
45 Backpressure & Batching! 45
46 Case Study: Hystrix 46
47 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 47
48 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 48
49 Recommended Reading 49
50 Patterns for Fault-Tolerant Software by Robert S. Hanmer 50
51 Release It! by Michael T. Nygard 51
52 Any Questions? 52
53 Thank you! 53
Dependable Systems. Fault Tolerance Patterns (II) Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007.
Dependable Systems Fault Tolerance Patterns (II) Dr. Peter Tröger Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Error Recovery Patterns Quarantine / Concentrated Recovery
More informationEE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1
EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint
More informationService Recovery & Availability. Robert Dickerson June 2010
Service Recovery & Availability Robert Dickerson June 2010 Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k)
More informationA SKY Computers White Paper
A SKY Computers White Paper High Application Availability By: Steve Paavola, SKY Computers, Inc. 100000.000 10000.000 1000.000 100.000 10.000 1.000 99.0000% 99.9000% 99.9900% 99.9990% 99.9999% 0.100 0.010
More informationHigh Availability and Redundant Operation
This chapter describes the high availability and redundancy features of the Cisco ASR 9000 Series Routers. Features Overview, page 1 High Availability Router Operations, page 1 Power Supply Redundancy,
More informationAppendix D: Storage Systems (Cont)
Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that
More informationAnomaly Detection Fault Tolerance Anticipation
Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What
More informationFault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues
Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction
More informationDependable Systems. Fault Tolerance Patterns. Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007.
Dependable Systems Fault Tolerance Patterns Dr. Peter Tröger Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Phases of Fault Tolerance (Hanmer) Error Detection Error Recovery
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationCIT 668: System Architecture
CIT 668: System Architecture Availability Topics 1. What is availability? 2. Measuring Availability 3. Failover 4. Failover Configurations 5. Linux HA Availability Availability is the ratio of the time
More informationTransparent TCP Recovery
Transparent Recovery with Chain Replication Robert Burgess Ken Birman Robert Broberg Rick Payne Robbert van Renesse October 26, 2009 Motivation Us: Motivation Them: Client Motivation There is a connection...
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationLecture 22: Fault Tolerance
Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More informationBest Practices for Scaling Websites Lessons from ebay
Best Practices for Scaling Websites Lessons from ebay Randy Shoup ebay Distinguished Architect QCon Asia 2009 Challenges at Internet Scale ebay manages 86.3 million active users worldwide 120 million items
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves
More informationHow Does Failover Affect Your SLA? How Does Failover Affect Your SLA?
How Does Failover Affect Your SLA? How Does Failover Affect Your SLA? Dr. Bill Highleyman Dr. Managing Bill Highleyman Editor, Availability Digest Managing HP NonStop Editor, Technical Availability Boot
More informationCSE 5306 Distributed Systems. Fault Tolerance
CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure
More informationFault-Tolerant Embedded System
Fault-Tolerant Embedded System EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationFault-tolerant techniques
What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques
More informationSymantec ST Symantec Enterprise Vault 10.0 for(r) Exchange Technical Assessment.
Symantec ST0-118 Symantec Enterprise Vault 10.0 for(r) Exchange Technical Assessment http://killexams.com/exam-detail/st0-118 QUESTION: 305 A visiting consultant notices that an organization's tape-based
More information416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016
416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction
More informationFault-Tolerant Embedded System
Fault-Tolerant Embedded System COE718: Embedded Systems Design http://www.ee.ryerson.ca/~courses/coe718/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationSIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER
CHAPTER 4 Revised: March 24, 2011, This chapter describes features that apply to all SIP system operations. It includes the following topics: SIP Timer Values, page 4-1 SIP Session Timers, page 4-7 Limitations
More informationSIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER
CHAPTER 4 Revised: October 30, 2012, This chapter describes features that apply to all SIP system operations. It includes the following topics: SIP Timer Values, page 4-1 Limitations on Number of URLs,
More informationMiddleware and Distributed Systems. Fault Tolerance. Peter Tröger
Middleware and Distributed Systems Fault Tolerance Peter Tröger Fault Tolerance Another cross-cutting concern in middleware systems Fault Tolerance Middleware and Distributed Systems 2 Fault - Error -
More information416 Distributed Systems. Errors and Failures Oct 16, 2018
416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques
More informationDistributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter
More informationIntroduction to the Service Availability Forum
. Introduction to the Service Availability Forum Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO Design of dependable
More informationHigh Availability Client Login Profiles
High Availability Login Profiles, page 1 500 (1vCPU 700MHz 2GB) Active/Active Profile, page 3 500 (1vCPU 700MHz 2GB) Active/Standby Profile, page 4 0 (1vCPU 1500MHz 2GB) Active/Active Profile, page 4 0
More informationFault tolerance and Reliability
Fault tolerance and Reliability Reliability measures Fault tolerance in a switching system Modeling of fault tolerance and reliability Rka -k2002 Telecommunication Switching Technology 14-1 Summary of
More informationOverview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5
CPS Architecture, page 1 Geographic Redundancy, page 5 CPS Architecture The Cisco Policy Suite (CPS) solution utilizes a three-tier virtual architecture for scalability, system resilience, and robustness
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationCauses of Software Failures
Causes of Software Failures Hardware Faults Permanent faults, e.g., wear-and-tear component Transient faults, e.g., bit flips due to radiation Software Faults (Bugs) (40% failures) Nondeterministic bugs,
More informationSafety and Reliability Engineering Part 5: Redundancy / Software Reliability
Part 5: Redundancy / Software Reliability Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik XI, Embedded Software Laboratory RWTH Aachen University Summer term 2007 Reminder: Redundancy Architectural principle
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationMQ High Availability and Disaster Recovery Implementation scenarios
MQ High Availability and Disaster Recovery Implementation scenarios Sandeep Chellingi Head of Hybrid Cloud Integration Prolifics Agenda MQ Availability Message Availability Service Availability HA vs DR
More informationUltra Low-Cost Defect Protection for Microprocessor Pipelines
Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key
More informationDistributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013
Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More information416 Distributed Systems. Errors and Failures Feb 1, 2016
416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationSIP System Features. Differentiated Services Codepoint CHAPTER
CHAPTER 6 Revised: December 30 2007, This chapter describes features that apply to all SIP system operations. It includes the following topics: Differentiated Services Codepoint section on page 6-1 Limitations
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationRELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés
RELIABILITY and RELIABLE DESIGN Giovanni Centre Systèmes Intégrés Outline Introduction to reliable design Design for reliability Component redundancy Communication redundancy Data encoding and error correction
More informationFault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit
Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic
More informationIndex. Peter A. Carter 2016 P.A. Carter, SQL Server AlwaysOn Revealed, DOI /
Index A Active node, 10 Advanced Encryption Standard (AES), 95 AlwaysOn administration Availability Group (see AlwaysOn Availability Groups) cluster maintenance, 149 Cluster Node Configuration page, 153
More informationApplication Resilience Engineering and Operations at Netflix. Ben Software Engineer on API Platform at Netflix
Application Resilience Engineering and Operations at Netflix Ben Christensen @benjchristensen Software Engineer on API Platform at Netflix Global deployment spread across data centers in multiple AWS regions.
More informationDependability. IC Life Cycle
Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationTWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018
TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides
More informationFault Tolerance. The Three universe model
Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful
More informationDeveloping Resilient Apps on SAP Cloud Platform
PUBLIC 2018-09-05 2018 SAP SE or an SAP affiliate company. All rights reserved. THE BEST RUN Content 1 About this Guide.... 4 2 What is Resilient Software Design.... 5 2.1 Availability....5 2.2 Failure
More informationDetection Techniques for Fault Tolerance
Detection Techniques for Fault Tolerance Robert S. Hanmer Lucent Technologies 2000 Lucent Lane 2H-207 Naperville, IL 60566-7033 hanmer@lucent.com +1 630 979 4786 Abstract Errors must be detected before
More informationCS October 2017
Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.
More informationDependability and ECC
ecture 38 Computer Science 61C Spring 2017 April 24th, 2017 Dependability and ECC 1 Great Idea #6: Dependability via Redundancy Applies to everything from data centers to memory Redundant data centers
More informationKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File
More informationAvailability, Reliability, and Fault Tolerance
Availability, Reliability, and Fault Tolerance Guest Lecture for Software Systems Security Tim Wood Professor Tim Wood - The George Washington University Distributed Systems have Problems Hardware breaks
More informationAerospace Software Engineering
16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Definitions Software reliability The probability
More informationCopyright 1998, Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr 1. IPS _05_2001_c1
2001, Cisco Systems, Inc. All rights reserved. 1 Presentation_ID.scr 1 Introduction to High Availability Networking Session 2001, Cisco Systems, Inc. All rights reserved. 3 Agenda Introduction Building
More informationProtecting remote site data SvSAN clustering - failure scenarios
White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote
More informationHA Use Cases. 1 Introduction. 2 Basic Use Cases
HA Use Cases 1 Introduction This use case document outlines the model and failure modes for NFV systems. Its goal is along with the requirements documents and gap analysis help set context for engagement
More informationDistributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 2015 Exam 3 Review Paul Krzyzanowski Rutgers University Fall 2016 2013-2016 Paul Krzyzanowski 1 2015 Question 1 What information does each node need to store for a three-dimensional
More informationRELIABILITY & AVAILABILITY IN THE CLOUD
RELIABILITY & AVAILABILITY IN THE CLOUD A TWILIO PERSPECTIVE twilio.com To the leaders and engineers at Twilio, the cloud represents the promise of reliable, scalable infrastructure at a price that directly
More informationCSC2231: Making clusters fault-tolerant
CSC2231: Making clusters fault-tolerant http://www.cs.toronto.edu/~stefan/courses/csc2231/05au Stefan Saroiu Department of Computer Science University of Toronto Administrivia Project proposals due in
More informationUpgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System
Upgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System Dave Dolezilek International Technical Director Schweitzer Engineering Laboratories SEL
More informationFLAT DATACENTER STORAGE CHANDNI MODI (FN8692)
FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication
More informationebay Marketplace Architecture
ebay Marketplace Architecture Architectural Strategies, Patterns, and Forces Randy Shoup, ebay Distinguished Architect QCon SF 2007 November 9, 2007 What we re up against ebay manages Over 248,000,000
More informationDeployment Guide for SRX Series Services Gateways in Chassis Cluster Configuration
Deployment Guide for SRX Series Services Gateways in Chassis Cluster Configuration Version 1.2 June 2013 Juniper Networks, 2013 Contents Introduction... 3 Chassis Cluster Concepts... 4 Scenarios for Chassis
More informationStability Patterns and Antipatterns
Stability Patterns and Antipatterns Michael Nygard mtnygard@thinkrelevance.com @mtnygard Michael Nygard, 2007-2012 1 Stability Antipatterns 2 Integration Points Integrations are the #1 risk to stability.
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationECE Engineering Robust Server Software. Spring 2018
ECE590-02 Engineering Robust Server Software Spring 2018 Business Continuity: High Availability Tyler Bletsch Duke University Includes material adapted from the course Information Storage and Management
More informationCprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques
: Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.
More informationIntroduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki
Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy
More informationWhy Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV
Why Things Break -- With Examples From Autonomous Vehicles Phil Koopman Department of Electrical & Computer Engineering & Institute for Complex Engineered Systems (based, in part, on material from Dan
More informationLast Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications
Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Basic Timestamp Ordering Optimistic Concurrency Control Multi-Version Concurrency Control C. Faloutsos A. Pavlo Lecture#23:
More informationDistributed Systems Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:
More informationSoftware reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.
SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES
More informationDependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8
TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system
More informationEngineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego
Engineering Fault-Tolerant TCP/IP servers using FT-TCP Dmitrii Zagorodnov University of California San Diego Motivation Reliable network services are desirable but costly! Extra and/or specialized hardware
More informationMaximum Availability Architecture: Overview. An Oracle White Paper July 2002
Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network
More informationF5 in AWS Part 3 Advanced Topologies and More on Highly Available Services
F5 in AWS Part 3 Advanced Topologies and More on Highly Available Services ChrisMutzel, 2015-17-08 Thus far in our article series about running BIG-IP in EC2, we ve talked about some VPC/EC2 routing and
More informationHighly Available Networks
Highly Available Networks Pamela Williams Dickerman Advanced Technology Consultant Michael Hayward Hewlett-Packard Company Copyright 1996 Hewlett-Packard Co., Inc. Table of Contents Abstract Single Points
More informationToday: Fault Tolerance. Failure Masking by Redundancy
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing
More informationForeScout CounterACT Resiliency Solutions
ForeScout CounterACT Resiliency Solutions User Guide CounterACT Version 7.0.0 About CounterACT Resiliency Solutions Table of Contents About CounterACT Resiliency Solutions... 5 Comparison of Resiliency
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationINSTRUCTION MANUAL BreakingSecurity.net. Revision Remcos v2.3.0
INSTRUCTION MANUAL Revision 14 -- Remcos v2.3.0 2019 BreakingSecurity.net 1 TABLE OF CONTENTS CHAPTER 1: INTRODUCTION TO REMCOS... 3 USAGE CASES... 3 COMPATIBILITY & DEVELOPMENT... 3 STRUCTURE... 4 CHAPTER
More informationChapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju
Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic
More informationPart 2: Basic concepts and terminology
Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness
More informationebay s Architectural Principles
ebay s Architectural Principles Architectural Strategies, Patterns, and Forces for Scaling a Large ecommerce Site Randy Shoup ebay Distinguished Architect QCon London 2008 March 14, 2008 What we re up
More informationWhite Paper. Dell Reference Configuration
White Paper Dell Reference Configuration Deploying Oracle Database 10g R2 Standard Edition Real Application Clusters with Red Hat Enterprise Linux 4 Advanced Server x86_64 on Dell PowerEdge Servers and
More informationHigh Availability Using Fault Tolerance in the SAN. Mark S Fleming, IBM
High Availability Using Fault Tolerance in the SAN Mark S Fleming, IBM SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this
More informationFault Tolerance. Basic Concepts
COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time
More informationPatterns of Resilience How to build robust, scalable & responsive systems
Patterns of Resilience How to build robust, scalable & responsive systems Uwe Friedrichsen (codecentric AG) GOTO Night Amsterdam 18. May 2015 @ufried Uwe Friedrichsen uwe.friedrichsen@codecentric.de http://slideshare.net/ufried
More information