Distributed Systems 24. Fault Tolerance

Similar documents
Distributed Systems 23. Fault Tolerance

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Fault Tolerance Dealing with an imperfect world

Distributed Systems 11. Consensus. Paul Krzyzanowski

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Distributed Systems COMP 212. Lecture 19 Othon Michail

Exam 2 Review. Fall 2011

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Dep. Systems Requirements

Basic vs. Reliable Multicast

Distributed Systems COMP 212. Revision 2 Othon Michail

Fault Tolerance. Distributed Systems. September 2002

416 practice questions (PQs)

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Today: Fault Tolerance

Fault Tolerance. Distributed Systems IT332

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Replication in Distributed Systems

Fault Tolerance 1/64

Associate Professor Dr. Raed Ibraheem Hamed

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Today: Fault Tolerance. Fault Tolerance

Fault Tolerance. Basic Concepts

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Today: Fault Tolerance. Replica Management

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Failure Tolerance. Distributed Systems Santa Clara University

Module 8 - Fault Tolerance

Chapter 8 Fault Tolerance

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Paxos provides a highly available, redundant log of events

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Exam 2 Review. October 29, Paul Krzyzanowski 1

Definition of RAID Levels

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Consensus and related problems

Chapter 8 Fault Tolerance

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

CSE 380 Computer Operating Systems

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Today: Fault Tolerance. Failure Masking by Redundancy

Distributed System. Gang Wu. Spring,2018

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Replication. Feb 10, 2016 CPSC 416

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems Fault Tolerance

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

CS October 2017

Failover Solutions with Open-E Data Storage Server (DSS V6)

Distributed Systems (ICE 601) Fault Tolerance

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Fault Tolerance. The Three universe model

Consensus in Distributed Systems. Jeff Chase Duke University

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Storage Systems. Storage Systems

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CS 425 / ECE 428 Distributed Systems Fall 2017

Recovering from a Crash. Three-Phase Commit

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Introduction to Distributed Systems Seif Haridi

I/O CANNOT BE IGNORED

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

CSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID)

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Replication Solutions with Open-E Data Storage Server (DSS) April 2009

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Silberschatz and Galvin Chapter 15

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Physical Storage Media

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Fault Tolerance. Distributed Software Systems. Definitions

Mass-Storage Structure

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Database Management Systems

FAULT TOLERANT SYSTEMS

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Practical Byzantine Fault

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

I/O CANNOT BE IGNORED

Consistency and Replication (part b)

Replication architecture

Distributed Systems

416 Distributed Systems. Errors and Failures Oct 16, 2018

Distributed Consensus Protocols

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

FAULT TOLERANT SYSTEMS

Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

CSE 486/586 Distributed Systems

Distributed KIDS Labs 1

Guide to SATA Hard Disks Installation and RAID Configuration

Transcription:

Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1

Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network errors/outages 2

Faults Three categories transient faults intermittent faults permanent faults Processor / storage faults Fail-silent (fail-stop): stops functioning Byzantine: produces faulty results Network faults Data corruption (Byzantine) Link failure (fail-silent) One-way link failure Network partition Connection between two parts of a network fails 3

Synchronous vs. Asynchronous systems synchronous system vs. asynchronous system E.g., IP packet versus serial port transmission Synchronous: known upper bound on time for data transmission Why is this important? Distinguish a slow network (or processor) from a stopped one 4

Fault Tolerance Fault Avoidance Design a system with minimal faults Fault Removal Validate/test a system to remove the presence of faults Fault Tolerance Deal with faults! 5

Achieving fault tolerence Redundancy information redundancy Hamming codes, parity memory ECC memory time redundancy Timeout & retransmit physical redundancy/replication TMR, RAID disks, backup servers Replication vs. redundancy: Replication: multiple identical units functioning concurrently vote on outcome Redundancy: One unit functioning at a time: others standing by 6

Availability: how much fault tolerance? 100% fault-tolerance cannot be achieved The closer we wish to get to 100%, the more expensive the system will be Availability: % of time that the system is functioning Typically expressed as # of 9 s Downtime includes all time when the system is unavailable. 7

Availability Class Level Continuous 100% 0 Annual Downtime Six nines (carrier class switches) Fault Tolerant (carrier-class servers) 99.9999% 30 seconds 99.999% 5 minutes Fault Resilient 99.99% High Availability 99.9% 53 minutes 8.3 hours Normal availability 99-99.5% 44-87 hours 8

Points of failure Goal: avoid single points of failure Points of failure: A system is k-fault tolerant if it can withstand k faults. Need k+1 components with silent faults k can fail and one will still be working Need 2k+1 components with Byzantine faults k can generate false replies: k+1 will provide a majority vote 9

Active replication Technique for fault tolerance through physical redundancy No redundancy: A B C Triple Modular Redundancy (TMR): Threefold component replication to detect and correct a single component failure A B C A B C A B C 10

Active replication: Replicated State Machines Use a distributed consensus algorithm to agree on the order of updates across all replicas. Example: GFS replication Paxos instance Paxos instance Paxos instance A A A A B B B C C C A 11

Active-Active vs. Active-Passive Active-Active Any server can handle requests global state update Usually requires total ordering for updates: Paxos, distributed lock manager, eventual or immediate consistency (Brewer s CAP theorem impacts us) Active-Passive = Primary Backup One server does all the work When it fails, backup takes over Backup may ping primary with are you alive messages Simpler design Issues Watch out for Byzantine faults Recovery may be time-consuming and/or complex 12

Examples of Fault Tolerance 16

Example: ECC memory Memory chips designed with Hamming code logic Most implementations correct single bit errors in a memory location and detect multiple bit errors. Example of information redundancy Why is this not physical redundancy? 17

Example: ECC memory Memory chips designed with Hamming code logic Most implementations correct single bit errors in a memory location and detect multiple bit errors. Example of information redundancy Why is this not physical redundancy? The extra circuitry is not n-way replication of existing components 18

Example: Failover via DNS SRV Goal: allow multiple machines (with unique IP addresses in possibly different locations) to be represented by one hostname Instead of using DNS to resolve a hostname to one IP address, use DNS to look up SRV records for that name. Each record will have a priority, weight, and server name Use the priority to pick one of several servers Use the weight to pick servers of the same priority (for load balancing) Then, once you picked a server, use DNS to look up its address Commonly used in voice-over-ip systems to pick a SIP server/proxy MX records (mail servers) take the same approach: use DNS to find several mail servers and pick one that works Example of physical redundancy 19

Example: DNS with device monitoring Custom DNS server that returns an IP address of an available machine by monitoring the liveness of a set of equivalent machines Akamai approach (Akamai has more criteria than this) 20

Example: TCP retransmission Sender requires ack from a receiver for each packet If the ack is not received in a certain amount of time, the sender retransmits the packet Example of time redundancy 21

Example: RAID 1 (disk mirroring) RAID = redundant array of independent disks RAID 1: disk mirroring All data that is written to one disk is also written to a second disk A block of data can be read from either disk If one disk goes out of service, the remaining disk will still have the data Example of physical redundancy 22

RAID 0: Performance Striping Advantages: Performance All storage capacity can be used Disadvantage: Not fault tolerant 23

RAID 1: HA Mirroring Advantages: Double read speed No rebuild necessary if a disk fails: just copy Disadvantage: Only half the space Physical Redundancy 24

RAID 3: HA Separate parity disk Advantages: Very fast reads High efficiency: low ratio of parity/data Disadvantages: Slow random I/O performance Only one I/O at a time Information redundancy 25

RAID 5 Interleaved parity Advantages: Very fast reads High efficiency: low ratio of parity/data Disadvantage: Slower writes Complex controller Information redundancy 27

RAID 1+0 Combine mirroring and striping Striping across a set of disks Mirroring of the entire set onto another set 28

The End 29