The Walking Dead Michael Nitschinger

Size: px
Start display at page:

Download "The Walking Dead Michael Nitschinger"

Transcription

1 The Walking Dead A Survival Guide to Resilient Reactive Applications Michael

2 the right Mindset 2

3 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3

4 4

5 5

6 Not so fast, mister fancy tests! 6

7 Always ask yourself What can go wrong? 7

8 Fault Tolerance 101 8

9 Fault Error Failure A fault is a latent defect that can cause an error when activated. 9

10 Fault Error Failure Errors are the manifestations of faults. 10

11 Fault Error Failure Failure occurs when the service no longer complies with its specifications. 11

12 Fault Error Failure Errors are inevitable. We need to detect, recover and mitigate them before they become failures. 12

13 Reliability is the probability that a system will perform failure free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13

14 Availability is the percentage of time the system is able to perform its function. availability = MTTF MTTF + MTTR 14

15 Expression Downtime/Year Three 9s 99.9% min Four 9s 99.99% min Four 9s and a % min Five 9s % min Six 9s % min 100% 0 15

16 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse????????? 16

17 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse 99.99% 99.99% 99.99% 17

18 Pop Quiz! Wanted: 99.99% Availability Edge Service User Service Session Store Data Warehouse ~99.999% ~99.999% ~99.999% 18

19 Fault Tolerant Architecture 19

20 Units of Mitigation are the basic units of error containment and recovery. 20

21 Escalation is used when recovery or mitigation is not possible inside the unit. 21

22 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 22

23 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 23

24 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 24

25 Escalation Cluster Node Node Service Service Service Service Service Endpoint Endpoint Endpoint Endpoint Endpoint 25

26 Redundancy Cost Time To Recover Cost Active/Active Active/Standby N+M Active/Passive 26

27 The Fault Observer receives system and error events and can guide and orchestrate detection and recovery Unit Listener Unit Unit Unit Observer 27 Listener

28 28

29 29

30 Detecting Errors 30

31 A silent system is a dead system. 31

32 A System Monitor helps to study behaviour and to make sure it is operating as specified. 32

33 33

34 Periodic Checking Heartbeats monitor tasks or remote services and initiate recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34

35 Endpoint Encoder( Decoder( Encoder( Decoder( No Traffic Event on Idle Ne*y( Writes( Ne*y( Reads( 35

36 Riding over Transients is used to defer error recovery if the error is temporary. Patience is a virtue to allow the true signature of an error to show itself. - Robert S. Hanmer 36

37 37

38 And more! Complete Parameter Checking Watchdogs Voting Checksums Routine Audits 38

39 Recovery and Mitigation of Errors 39

40 Timeout to not wait forever and keep holding up the resource. X 40

41 Failover to a redundant unit when the error has been detected and isolated. Redundancy Reminder Cost Time To Recover Cost Active/Active Active/Standby N+M 41

42 Intelligent Retries Fixed Linear Exponential Time between Retries Number of Attempts 42

43 Restart can be used as a last resort with the trade-off to lose state and time. 43

44 Fail Fast to shed load and give a partial great service than a complete bad one. Boundary 44

45 Backpressure & Batching! 45

46 Case Study: Hystrix 46

47 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 47

48 And more! Recovery Mitigation Rollback Bounded Queuing Roll-Forward Expansive Controls Checkpoints Marking Data Data Reset Error Correcting Codes 48

49 Recommended Reading 49

50 Patterns for Fault-Tolerant Software by Robert S. Hanmer 50

51 Release It! by Michael T. Nygard 51

52 Any Questions? 52

53 Thank you! 53

Dependable Systems. Fault Tolerance Patterns (II) Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007.

Dependable Systems. Fault Tolerance Patterns (II) Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Dependable Systems Fault Tolerance Patterns (II) Dr. Peter Tröger Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Error Recovery Patterns Quarantine / Concentrated Recovery

More information

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint

More information

Service Recovery & Availability. Robert Dickerson June 2010

Service Recovery & Availability. Robert Dickerson June 2010 Service Recovery & Availability Robert Dickerson June 2010 Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k)

More information

A SKY Computers White Paper

A SKY Computers White Paper A SKY Computers White Paper High Application Availability By: Steve Paavola, SKY Computers, Inc. 100000.000 10000.000 1000.000 100.000 10.000 1.000 99.0000% 99.9000% 99.9900% 99.9990% 99.9999% 0.100 0.010

More information

High Availability and Redundant Operation

High Availability and Redundant Operation This chapter describes the high availability and redundancy features of the Cisco ASR 9000 Series Routers. Features Overview, page 1 High Availability Router Operations, page 1 Power Supply Redundancy,

More information

Appendix D: Storage Systems (Cont)

Appendix D: Storage Systems (Cont) Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that

More information

Anomaly Detection Fault Tolerance Anticipation

Anomaly Detection Fault Tolerance Anticipation Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

Dependable Systems. Fault Tolerance Patterns. Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007.

Dependable Systems. Fault Tolerance Patterns. Dr. Peter Tröger. Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Dependable Systems Fault Tolerance Patterns Dr. Peter Tröger Source: Hanmer, Robert S.: Patterns for Fault Tolerant Software. Wiley, 2007. Phases of Fault Tolerance (Hanmer) Error Detection Error Recovery

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

CIT 668: System Architecture

CIT 668: System Architecture CIT 668: System Architecture Availability Topics 1. What is availability? 2. Measuring Availability 3. Failover 4. Failover Configurations 5. Linux HA Availability Availability is the ratio of the time

More information

Transparent TCP Recovery

Transparent TCP Recovery Transparent Recovery with Chain Replication Robert Burgess Ken Birman Robert Broberg Rick Payne Robbert van Renesse October 26, 2009 Motivation Us: Motivation Them: Client Motivation There is a connection...

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Lecture 22: Fault Tolerance

Lecture 22: Fault Tolerance Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

Best Practices for Scaling Websites Lessons from ebay

Best Practices for Scaling Websites Lessons from ebay Best Practices for Scaling Websites Lessons from ebay Randy Shoup ebay Distinguished Architect QCon Asia 2009 Challenges at Internet Scale ebay manages 86.3 million active users worldwide 120 million items

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

How Does Failover Affect Your SLA? How Does Failover Affect Your SLA?

How Does Failover Affect Your SLA? How Does Failover Affect Your SLA? How Does Failover Affect Your SLA? How Does Failover Affect Your SLA? Dr. Bill Highleyman Dr. Managing Bill Highleyman Editor, Availability Digest Managing HP NonStop Editor, Technical Availability Boot

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Fault-Tolerant Embedded System

Fault-Tolerant Embedded System Fault-Tolerant Embedded System EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

Symantec ST Symantec Enterprise Vault 10.0 for(r) Exchange Technical Assessment.

Symantec ST Symantec Enterprise Vault 10.0 for(r) Exchange Technical Assessment. Symantec ST0-118 Symantec Enterprise Vault 10.0 for(r) Exchange Technical Assessment http://killexams.com/exam-detail/st0-118 QUESTION: 305 A visiting consultant notices that an organization's tape-based

More information

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016 416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction

More information

Fault-Tolerant Embedded System

Fault-Tolerant Embedded System Fault-Tolerant Embedded System COE718: Embedded Systems Design http://www.ee.ryerson.ca/~courses/coe718/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

SIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER

SIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER CHAPTER 4 Revised: March 24, 2011, This chapter describes features that apply to all SIP system operations. It includes the following topics: SIP Timer Values, page 4-1 SIP Session Timers, page 4-7 Limitations

More information

SIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER

SIP System Features. SIP Timer Values. Rules for Configuring the SIP Timers CHAPTER CHAPTER 4 Revised: October 30, 2012, This chapter describes features that apply to all SIP system operations. It includes the following topics: SIP Timer Values, page 4-1 Limitations on Number of URLs,

More information

Middleware and Distributed Systems. Fault Tolerance. Peter Tröger

Middleware and Distributed Systems. Fault Tolerance. Peter Tröger Middleware and Distributed Systems Fault Tolerance Peter Tröger Fault Tolerance Another cross-cutting concern in middleware systems Fault Tolerance Middleware and Distributed Systems 2 Fault - Error -

More information

416 Distributed Systems. Errors and Failures Oct 16, 2018

416 Distributed Systems. Errors and Failures Oct 16, 2018 416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Introduction to the Service Availability Forum

Introduction to the Service Availability Forum . Introduction to the Service Availability Forum Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO Design of dependable

More information

High Availability Client Login Profiles

High Availability Client Login Profiles High Availability Login Profiles, page 1 500 (1vCPU 700MHz 2GB) Active/Active Profile, page 3 500 (1vCPU 700MHz 2GB) Active/Standby Profile, page 4 0 (1vCPU 1500MHz 2GB) Active/Active Profile, page 4 0

More information

Fault tolerance and Reliability

Fault tolerance and Reliability Fault tolerance and Reliability Reliability measures Fault tolerance in a switching system Modeling of fault tolerance and reliability Rka -k2002 Telecommunication Switching Technology 14-1 Summary of

More information

Overview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5

Overview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5 CPS Architecture, page 1 Geographic Redundancy, page 5 CPS Architecture The Cisco Policy Suite (CPS) solution utilizes a three-tier virtual architecture for scalability, system resilience, and robustness

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Causes of Software Failures

Causes of Software Failures Causes of Software Failures Hardware Faults Permanent faults, e.g., wear-and-tear component Transient faults, e.g., bit flips due to radiation Software Faults (Bugs) (40% failures) Nondeterministic bugs,

More information

Safety and Reliability Engineering Part 5: Redundancy / Software Reliability

Safety and Reliability Engineering Part 5: Redundancy / Software Reliability Part 5: Redundancy / Software Reliability Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik XI, Embedded Software Laboratory RWTH Aachen University Summer term 2007 Reminder: Redundancy Architectural principle

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

MQ High Availability and Disaster Recovery Implementation scenarios

MQ High Availability and Disaster Recovery Implementation scenarios MQ High Availability and Disaster Recovery Implementation scenarios Sandeep Chellingi Head of Hybrid Cloud Integration Prolifics Agenda MQ Availability Message Availability Service Availability HA vs DR

More information

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Ultra Low-Cost Defect Protection for Microprocessor Pipelines Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

416 Distributed Systems. Errors and Failures Feb 1, 2016

416 Distributed Systems. Errors and Failures Feb 1, 2016 416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

SIP System Features. Differentiated Services Codepoint CHAPTER

SIP System Features. Differentiated Services Codepoint CHAPTER CHAPTER 6 Revised: December 30 2007, This chapter describes features that apply to all SIP system operations. It includes the following topics: Differentiated Services Codepoint section on page 6-1 Limitations

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

6.033 Lecture Fault Tolerant Computing 3/31/2014

6.033 Lecture Fault Tolerant Computing 3/31/2014 6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults

More information

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés RELIABILITY and RELIABLE DESIGN Giovanni Centre Systèmes Intégrés Outline Introduction to reliable design Design for reliability Component redundancy Communication redundancy Data encoding and error correction

More information

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic

More information

Index. Peter A. Carter 2016 P.A. Carter, SQL Server AlwaysOn Revealed, DOI /

Index. Peter A. Carter 2016 P.A. Carter, SQL Server AlwaysOn Revealed, DOI / Index A Active node, 10 Advanced Encryption Standard (AES), 95 AlwaysOn administration Availability Group (see AlwaysOn Availability Groups) cluster maintenance, 149 Cluster Node Configuration page, 153

More information

Application Resilience Engineering and Operations at Netflix. Ben Software Engineer on API Platform at Netflix

Application Resilience Engineering and Operations at Netflix. Ben Software Engineer on API Platform at Netflix Application Resilience Engineering and Operations at Netflix Ben Christensen @benjchristensen Software Engineer on API Platform at Netflix Global deployment spread across data centers in multiple AWS regions.

More information

Dependability. IC Life Cycle

Dependability. IC Life Cycle Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018 TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Developing Resilient Apps on SAP Cloud Platform

Developing Resilient Apps on SAP Cloud Platform PUBLIC 2018-09-05 2018 SAP SE or an SAP affiliate company. All rights reserved. THE BEST RUN Content 1 About this Guide.... 4 2 What is Resilient Software Design.... 5 2.1 Availability....5 2.2 Failure

More information

Detection Techniques for Fault Tolerance

Detection Techniques for Fault Tolerance Detection Techniques for Fault Tolerance Robert S. Hanmer Lucent Technologies 2000 Lucent Lane 2H-207 Naperville, IL 60566-7033 hanmer@lucent.com +1 630 979 4786 Abstract Errors must be detected before

More information

CS October 2017

CS October 2017 Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.

More information

Dependability and ECC

Dependability and ECC ecture 38 Computer Science 61C Spring 2017 April 24th, 2017 Dependability and ECC 1 Great Idea #6: Dependability via Redundancy Applies to everything from data centers to memory Redundant data centers

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Availability, Reliability, and Fault Tolerance

Availability, Reliability, and Fault Tolerance Availability, Reliability, and Fault Tolerance Guest Lecture for Software Systems Security Tim Wood Professor Tim Wood - The George Washington University Distributed Systems have Problems Hardware breaks

More information

Aerospace Software Engineering

Aerospace Software Engineering 16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Definitions Software reliability The probability

More information

Copyright 1998, Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr 1. IPS _05_2001_c1

Copyright 1998, Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr 1. IPS _05_2001_c1 2001, Cisco Systems, Inc. All rights reserved. 1 Presentation_ID.scr 1 Introduction to High Availability Networking Session 2001, Cisco Systems, Inc. All rights reserved. 3 Agenda Introduction Building

More information

Protecting remote site data SvSAN clustering - failure scenarios

Protecting remote site data SvSAN clustering - failure scenarios White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote

More information

HA Use Cases. 1 Introduction. 2 Basic Use Cases

HA Use Cases. 1 Introduction. 2 Basic Use Cases HA Use Cases 1 Introduction This use case document outlines the model and failure modes for NFV systems. Its goal is along with the requirements documents and gap analysis help set context for engagement

More information

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 2015 Exam 3 Review Paul Krzyzanowski Rutgers University Fall 2016 2013-2016 Paul Krzyzanowski 1 2015 Question 1 What information does each node need to store for a three-dimensional

More information

RELIABILITY & AVAILABILITY IN THE CLOUD

RELIABILITY & AVAILABILITY IN THE CLOUD RELIABILITY & AVAILABILITY IN THE CLOUD A TWILIO PERSPECTIVE twilio.com To the leaders and engineers at Twilio, the cloud represents the promise of reliable, scalable infrastructure at a price that directly

More information

CSC2231: Making clusters fault-tolerant

CSC2231: Making clusters fault-tolerant CSC2231: Making clusters fault-tolerant http://www.cs.toronto.edu/~stefan/courses/csc2231/05au Stefan Saroiu Department of Computer Science University of Toronto Administrivia Project proposals due in

More information

Upgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System

Upgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System Upgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System Dave Dolezilek International Technical Director Schweitzer Engineering Laboratories SEL

More information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication

More information

ebay Marketplace Architecture

ebay Marketplace Architecture ebay Marketplace Architecture Architectural Strategies, Patterns, and Forces Randy Shoup, ebay Distinguished Architect QCon SF 2007 November 9, 2007 What we re up against ebay manages Over 248,000,000

More information

Deployment Guide for SRX Series Services Gateways in Chassis Cluster Configuration

Deployment Guide for SRX Series Services Gateways in Chassis Cluster Configuration Deployment Guide for SRX Series Services Gateways in Chassis Cluster Configuration Version 1.2 June 2013 Juniper Networks, 2013 Contents Introduction... 3 Chassis Cluster Concepts... 4 Scenarios for Chassis

More information

Stability Patterns and Antipatterns

Stability Patterns and Antipatterns Stability Patterns and Antipatterns Michael Nygard mtnygard@thinkrelevance.com @mtnygard Michael Nygard, 2007-2012 1 Stability Antipatterns 2 Integration Points Integrations are the #1 risk to stability.

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

ECE Engineering Robust Server Software. Spring 2018

ECE Engineering Robust Server Software. Spring 2018 ECE590-02 Engineering Robust Server Software Spring 2018 Business Continuity: High Availability Tyler Bletsch Duke University Includes material adapted from the course Information Storage and Management

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy

More information

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV Why Things Break -- With Examples From Autonomous Vehicles Phil Koopman Department of Electrical & Computer Engineering & Institute for Complex Engineered Systems (based, in part, on material from Dan

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Basic Timestamp Ordering Optimistic Concurrency Control Multi-Version Concurrency Control C. Faloutsos A. Pavlo Lecture#23:

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES

More information

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8 TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system

More information

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego Engineering Fault-Tolerant TCP/IP servers using FT-TCP Dmitrii Zagorodnov University of California San Diego Motivation Reliable network services are desirable but costly! Extra and/or specialized hardware

More information

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002 Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network

More information

F5 in AWS Part 3 Advanced Topologies and More on Highly Available Services

F5 in AWS Part 3 Advanced Topologies and More on Highly Available Services F5 in AWS Part 3 Advanced Topologies and More on Highly Available Services ChrisMutzel, 2015-17-08 Thus far in our article series about running BIG-IP in EC2, we ve talked about some VPC/EC2 routing and

More information

Highly Available Networks

Highly Available Networks Highly Available Networks Pamela Williams Dickerman Advanced Technology Consultant Michael Hayward Hewlett-Packard Company Copyright 1996 Hewlett-Packard Co., Inc. Table of Contents Abstract Single Points

More information

Today: Fault Tolerance. Failure Masking by Redundancy

Today: Fault Tolerance. Failure Masking by Redundancy Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing

More information

ForeScout CounterACT Resiliency Solutions

ForeScout CounterACT Resiliency Solutions ForeScout CounterACT Resiliency Solutions User Guide CounterACT Version 7.0.0 About CounterACT Resiliency Solutions Table of Contents About CounterACT Resiliency Solutions... 5 Comparison of Resiliency

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

INSTRUCTION MANUAL BreakingSecurity.net. Revision Remcos v2.3.0

INSTRUCTION MANUAL BreakingSecurity.net. Revision Remcos v2.3.0 INSTRUCTION MANUAL Revision 14 -- Remcos v2.3.0 2019 BreakingSecurity.net 1 TABLE OF CONTENTS CHAPTER 1: INTRODUCTION TO REMCOS... 3 USAGE CASES... 3 COMPATIBILITY & DEVELOPMENT... 3 STRUCTURE... 4 CHAPTER

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness

More information

ebay s Architectural Principles

ebay s Architectural Principles ebay s Architectural Principles Architectural Strategies, Patterns, and Forces for Scaling a Large ecommerce Site Randy Shoup ebay Distinguished Architect QCon London 2008 March 14, 2008 What we re up

More information

White Paper. Dell Reference Configuration

White Paper. Dell Reference Configuration White Paper Dell Reference Configuration Deploying Oracle Database 10g R2 Standard Edition Real Application Clusters with Red Hat Enterprise Linux 4 Advanced Server x86_64 on Dell PowerEdge Servers and

More information

High Availability Using Fault Tolerance in the SAN. Mark S Fleming, IBM

High Availability Using Fault Tolerance in the SAN. Mark S Fleming, IBM High Availability Using Fault Tolerance in the SAN Mark S Fleming, IBM SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

Patterns of Resilience How to build robust, scalable & responsive systems

Patterns of Resilience How to build robust, scalable & responsive systems Patterns of Resilience How to build robust, scalable & responsive systems Uwe Friedrichsen (codecentric AG) GOTO Night Amsterdam 18. May 2015 @ufried Uwe Friedrichsen uwe.friedrichsen@codecentric.de http://slideshare.net/ufried

More information