Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Similar documents
Distributed Systems 24. Fault Tolerance

Distributed Systems 23. Fault Tolerance

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Fault Tolerance Dealing with an imperfect world

Distributed Systems 11. Consensus. Paul Krzyzanowski

Exam 2 Review. Fall 2011

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

416 practice questions (PQs)

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Exam 2 Review. October 29, Paul Krzyzanowski 1

CSE 5306 Distributed Systems

Today: Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

Dep. Systems Requirements

Distributed Systems COMP 212. Revision 2 Othon Michail

Today: Fault Tolerance. Fault Tolerance

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems IT332

Fault Tolerance 1/64

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance. Basic Concepts

CS October 2017

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Replication in Distributed Systems

Today: Fault Tolerance. Replica Management

Chapter 8 Fault Tolerance

Module 8 - Fault Tolerance

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Consensus and related problems

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Tolerance. Distributed Systems Santa Clara University

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Associate Professor Dr. Raed Ibraheem Hamed

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Basic vs. Reliable Multicast

Today: Fault Tolerance. Failure Masking by Redundancy

Chapter 8 Fault Tolerance

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Replication. Feb 10, 2016 CPSC 416

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Module 8 Fault Tolerance CS655! 8-1!

Paxos provides a highly available, redundant log of events

FAULT TOLERANT SYSTEMS

Recovering from a Crash. Three-Phase Commit

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Distributed System. Gang Wu. Spring,2018

CS 425 / ECE 428 Distributed Systems Fall 2017

Distributed Systems (ICE 601) Fault Tolerance

Definition of RAID Levels

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems Fault Tolerance

Failover Solutions with Open-E Data Storage Server (DSS V6)

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Physical Storage Media

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Distributed Systems

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

Storage Systems. Storage Systems

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Practical Byzantine Fault

Fault Tolerance. The Three universe model

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Lecture XIII: Replication-II

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Today: Fault Tolerance. Reliable One-One Communication

Distributed Commit in Asynchronous Systems

Replication architecture

Failures, Elections, and Raft

CSE 380 Computer Operating Systems

416 Distributed Systems. Errors and Failures Oct 16, 2018

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

Practical Byzantine Fault Tolerance

Replication Solutions with Open-E Data Storage Server (DSS) April 2009

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?

STEVEN R. BAGLEY PACKETS

Applications of Paxos Algorithm

Silberschatz and Galvin Chapter 15

Introduction to Distributed Systems Seif Haridi

MODELS OF DISTRIBUTED SYSTEMS

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Database Management Systems

Transcription:

Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1

Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network errors/outages 2

Faults Three categories transient faults intermittent faults permanent faults Processor / storage faults Fail-silent (fail-stop): stops functioning Fail-silent (fail-restart): stops functioning but then restarts (state lost) Byzantine: produces faulty results Network faults Data corruption (Byzantine) Link failure (fail-silent) One-way link failure Network partition Connection between two parts of a network fails 3

Synchronous vs. Asynchronous systems Synchronous system vs. asynchronous system E.g., IP packet versus serial port transmission Synchronous: known upper bound on time for data transmission Why is this important? Distinguish a slow network (or processor) from a stopped one 4

Fault Tolerance Fault Avoidance Design a system with minimal faults Fault Removal Validate/test a system to remove the presence of faults Fault Tolerance Deal with faults! 5

Achieving fault tolerance Redundancy Information redundancy Hamming codes, parity memory ECC memory Time redundancy Timeout & retransmit Physical redundancy/replication TMR, disk mirroring, backup servers Replication: copying information so it can be available on redundant resources Failover: switch operation from a failed system to a redundant working one 6

Availability: how much fault tolerance? 100% fault-tolerance cannot be achieved The closer we wish to get to 100%, the more expensive the system will be Availability: % of time that the system is functioning Typically expressed as # of 9 s Downtime includes all time when the system is unavailable. 7

Availability Class Level Annual Downtime Continuous 100% 0 Six nines (carrier class switches) Fault Tolerant (carrier-class servers) 99.9999% 30 seconds 99.999% 5 minutes Fault Resilient 99.99% High Availability 99.9% Normal availability 99-99.5% 53 minutes 8.3 hours 44-87 hours 8

Availability At home, component failure is a disruptive event In a network of 100,000+ machines, it is a daily occurrence November 27, 2013 2013 Paul Krzyzanowski 9

Points of failure Goal: avoid single points of failure Points of failure: A system is k-fault tolerant if it can withstand k faults. Need k+1 components with silent faults k can fail and one will still be working Need 2k+1 components with Byzantine faults k can generate false replies: k+1 will provide a majority vote 10

Active replication Technique for fault tolerance through physical redundancy No redundancy: A B C Triple Modular Redundancy (TMR): Threefold component replication to detect and correct a single component failure A B C A B C A B C 11

Active replication: Replicated State Machines Use a distributed consensus algorithm to agree on the order of updates across all replicas. A Paxos instance Paxos instance Paxos instance A A A B B B C C C A 12

Active-Active vs. Active-Passive Active-Active Any server can handle requests global state update Usually requires total ordering for updates: Paxos, distributed lock manager, eventual or immediate consistency (Brewer s CAP theorem impacts us) Active-Passive = Primary Backup(s) One server does all the work When it fails, backup takes over Backup may ping primary with are you alive messages Simpler design Example: Chubby Issues Watch out for Byzantine faults Recovery may be time-consuming and/or complex 13

Examples of Fault Tolerance 17

Example: ECC memory Memory chips designed with Hamming code logic Most implementations correct single bit errors in a memory location and detect multiple bit errors. Example of information redundancy Why is this not physical redundancy? 18

Example: ECC memory Memory chips designed with Hamming code logic Most implementations correct single bit errors in a memory location and detect multiple bit errors. Example of information redundancy Why is this not physical redundancy? The extra circuitry is not n-way replication of existing components 19

Example: Failover via DNS SRV Goal: allow multiple machines (with unique IP addresses in possibly different locations) to be represented by one hostname Instead of using DNS to resolve a hostname to one IP address, use DNS to look up SRV records for that name. Each record will have a priority, weight, and server name Use the priority to pick one of several servers Use the weight to pick servers of the same priority (for load balancing) Then, once you picked a server, use DNS to look up its address Commonly used in voice-over-ip systems to pick a SIP server/proxy MX records (mail servers) take the same approach: use DNS to find several mail servers and pick one that works Example of physical redundancy 20

Example: DNS with device monitoring Custom DNS server that returns an IP address of an available machine by monitoring the liveness of a set of equivalent machines Akamai approach (Akamai has more criteria than this) 21

Example: TCP retransmission Sender requires ack from a receiver Acknowledgement contains next expected byte # If the ack is not received in a certain amount of time, the sender retransmits the packet If a packet is received but the next expected byte # is unchanged, the sender assumes that the previous packet has not been received Example of time redundancy 22

Disk failure Hard disk annual failure rates ~ 5% 80 disks per rack 100 racks >1 failure per day on average SSD annual failure rates ~ 1.5% http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923.html 23

Example: RAID 1 (disk mirroring) RAID = redundant array of independent disks RAID 1: disk mirroring All data that is written to one disk is also written to a second disk A block of data can be read from either disk If one disk goes out of service, the remaining disk will still have the data Example of physical redundancy 24

Example: RAID 1 (disk mirroring) High availability by mirroring (replication) Advantages: Double read speed No rebuild necessary if a disk fails: just copy Disadvantage: Only half the space Physical Redundancy (data is replication across devices) 25

RAID 0: Performance Striping: this is not fault tolerance Advantages: Performance All storage capacity can be used Disadvantage: Not fault tolerant 26

RAID 3: HA Separate parity disk Advantages: Very fast reads High efficiency: low ratio of parity/data Disadvantages: Slow random I/O performance Only one I/O at a time Information redundancy (not n-way replication of data) 27

RAID 5 Interleaved parity Advantages: Very fast reads High efficiency: low ratio of parity/data Disadvantage: Slower writes Complex controller Information redundancy (not n-way replication of data) 29

RAID 1+0 Combine mirroring and striping Striping across a set of disks Mirroring of the entire set onto another set 30

Fault tolerant techniques we encountered Networking Ethernet checksums, IP header checksums, TCP & UDP data checksums TCP retransmission, IP routing Remote procedure calls Retransmission of requests with time-outs Group communication & virtual synchrony Retransmission of data Partial and total ordering to ensure replicas are consistent Group management and view changes in virtual synchrony Replicated inputs (replicated state machines) File systems Replicated servers (Coda, AFS, GFS) Queued changes if a server is not available (Coda) November 27, 2013 2013 Paul Krzyzanowski 31

Fault tolerant techniques we encountered Mutex, Election, Consensus, and Commit algorithms Mechanisms to pick a unique leader Mechanisms to agree on data even if processes die Concept of a quorum of >50% live processes Writeahead logs Undoing or redoing changes after a failure writeahead log, GFS operation log Checkpointing Pregel periodic checkpoints to save the state of the computation November 27, 2013 2013 Paul Krzyzanowski 32

The End November 27, 2013 2013 Paul Krzyzanowski 33