CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Similar documents
Distributed Systems 24. Fault Tolerance

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Replication in Distributed Systems

Module 8 Fault Tolerance CS655! 8-1!

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance. Basic Concepts

Today: Fault Tolerance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Today: Fault Tolerance. Fault Tolerance

Chapter 8 Fault Tolerance

Fault Tolerance. Distributed Systems. September 2002

Module 8 - Fault Tolerance

Dep. Systems Requirements

Fault Tolerance. The Three universe model

Fault Tolerance. Distributed Systems IT332

Last Class:Consistency Semantics. Today: More on Consistency

Distributed Systems 23. Fault Tolerance

Issues in Programming Language Design for Embedded RT Systems

EECS 498 Introduction to Distributed Systems

Chapter 8 Fault Tolerance

Failure Tolerance. Distributed Systems Santa Clara University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems COMP 212. Revision 2 Othon Michail

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

Today: Fault Tolerance. Replica Management

Distributed Systems

Implementation Issues. Remote-Write Protocols

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Fault Tolerance Dealing with an imperfect world

Dependability tree 1

TSW Reliability and Fault Tolerance

416 Distributed Systems. Errors and Failures Oct 16, 2018

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

FAULT TOLERANT SYSTEMS

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Chapter 7 Consistency And Replication

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

Eventual Consistency. Eventual Consistency

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed System. Gang Wu. Spring,2018

To do. Consensus and related problems. q Failure. q Raft

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Fault Tolerance. Chapter 7

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Fault Tolerance in Distributed Systems: An Introduction

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Fault Tolerance 1/64

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Basic vs. Reliable Multicast

Fault-tolerant techniques

Recovering from a Crash. Three-Phase Commit

CS October 2017

Dependability and ECC

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Fault Tolerance in Distributed Systems: An Introduction

Fault Tolerance. Distributed Software Systems. Definitions

Strong Consistency & CAP Theorem

Physical Storage Media

Introduction to Distributed Systems Seif Haridi

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Distributed Data Management Replication

Distributed Systems (ICE 601) Fault Tolerance

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Paxos provides a highly available, redundant log of events

PRIMARY-BACKUP REPLICATION

CSE 486/586 Distributed Systems

An Introduction to RAID

Flash File Systems Overview

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS 425 / ECE 428 Distributed Systems Fall 2017

Aerospace Software Engineering

Today: Fault Tolerance. Failure Masking by Redundancy

Fault Tolerance via the State Machine Replication Approach. Favian Contreras

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Diagnosis in the Time-Triggered Architecture

Consensus and related problems

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 7 Consistency And Replication

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Distributed Systems Fault Tolerance

A Large-Scale Study of Soft- Errors on GPUs in the Field

Appendix D: Storage Systems (Cont)

Transcription:

CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online sources, including github.com/donnemartin/system-design-primer

Desirable system properties We want dependable systems Available: ready for use at any given time Reliable: runs continuously without failure Safe: nothing catastrophic happens upon failure Maintainable: easy to repair Similar to definitions for dependable software (CS 345)

Problem Inherent tension between: Consistency: reads see previous writes ("safety") Availability: operations finish ("liveness") Partition tolerance: failures don't affect correctness Can we have it all?

CAP Theorem A system cannot be simultaneously consistent (C), available (A), and partition-tolerant (P) We can only have two of three In a non-distributed system, P isn't needed Tradeoff: latency vs. consistency ("PACELC Theorem") In a distributed system, P isn't optional Thus, we must choose: CP or AP I.e., consistency or availability C A P Original conjecture by Eric Brewer: http://dl.acm.org/citation.cfm?id=822436 Formal theorem: http://dl.acm.org/citation.cfm?id=5646 NOT POSSIBLE!

Consistency Usual choice: compromise on consistency Strong consistency: reads see all previous writes (sequential consistency) Eventual consistency: reads eventually see all previous writes (continuous w/ long interval) Alternatively, continuous w/ short interval Causal consistency: reads see all causally-related writes E.g., "guaranteed convergence" Weak consistency: reads may not see previous writes E.g., "best effort"

Availability Active-passive / master-slave (asymmetric) Master server handles all requests Backup/failover server takes over if master fails Active-active / master-master (symmetric) Multiple master servers share request load Load re-balances if one fails Server Master Backup Load balancer Server 2 Active-passive Active-active

Fault tolerance Sometimes, consistency/availability tradeoff decisions depend on the failure model: What kinds of failures happen? How often do they happen? What are the effects of a failure?

Fault tolerance Soft vs hard failures Soft failure: data is corrupted (often corrected by hardware) Hard failure: a component of a system stops working Hard failures in a non-distributed system are usually fatal The entire system must be restarted Hard failures in a distributed system can be non-fatal Partial failure: a failure of a subset of the components of a distributed system If the system is well-designed, it may be able to recover and continue after a partial failure

Measuring failure Failure rate (λ): failures per unit of time Mean Time Between Failures (MTBF) = / λ Assumes constant failure rate Failures In Time (FIT) = failures expected in one billion device-hours MTBF = e9 x /FIT

Measuring failure Failure rate (λ): failures per unit of time Mean Time Between Failures (MTBF) = / λ Assumes constant failure rate Failures In Time (FIT) = failures expected in one billion device-hours MTBF = e9 x /FIT On a million core machine, FIT means once every hours or once every ~4.2 days!

Failure types Crash: the system halts Omission: the system fails to respond to requests Timing: the system responds too slowly Response: the system responds incorrectly Arbitrary failure: anything else (unpredictable!) Sometimes called "Byzantine" failures if they can manifest in such a way that prevents future consensus

Failures Some distinguish between failure levels: A failure occurs when a system cannot meet its specification An error is the part of a system's state that leads to a failure A fault is the low-level cause of an error Most common source of faults: memory or disk storage If a system can provide dependable services even in the presence of faults, that system is fault-tolerant

Faults Permanent faults reproduce deterministically These are usually the easiest to fix Intermittent faults recur but do not always reproduce deterministically Unfortunately common in distributed systems Heisenbug: a software defect that seems to change or disappear during debugging Transient faults occur only once Often the result of physical phenomena

Bit errors Bit error: low-level fault where a bit is read/written incorrectly Single-bit vs. double-bit vs. multi-bit Single-Bit Error (SBE), Double-Bit Error (DBE) Hamming distance: # of bits different Potential DRAM source: "weak bits" in hardware Electrons are stored in a memory cell capacitor Critical charge (Qcrit) is the threshold between and values Refreshed often, but sometimes still read incorrectly Radiation and cosmic rays

Cosmic rays

Example: GPU fault study The Titan supercomputer has 8,688 GPUs Tiwari, Devesh, et al. "Understanding gpu errors on large-scale hpc systems and the implications for system design and operation." High Performance Computer Architecture (HPCA), 25 IEEE 2st International Symposium on. IEEE, 25. http://www4.ncsu.edu/~dtiwari2/papers/25_hpca_tiwari_gpu_reliability.pdf

Dealing with failure Detection: discovering failures Active (pinging) vs. passive (wait for messages) Issue: unreliability of timeouts Prevention: eliminate the possibility of failure Avoidance: divert around failure possibilities Not possible in a distributed system Only possible in particular circumstances Recovery: restore valid system state after a failure

Detection and avoidance Data-centric Redundancy, diversity, and replication Parity bits, checksums, and hashes E.g., dual modular redundancy (DMR), TMR E.g., cyclic redundancy check (CRC), MD5, SHA Computation-centric Acknowledgement (ACK)-based protocols Consensus and voting protocols One-phase vs. two-phase (e.g., Paxos)

Recovery (hardware) Hardware (general space vs. safety tradeoff) Dual modular redundancy (DMR) can detect a single-bit error Triple modular redundancy (TMR) can recover one corrupted bit Or detect a double-bit error Parity bits Even parity bits are if the # of s is even; otherwise Special case of CRC (polynomial is x+) Odd parity bits are if the # of s is even; otherwise TMR: DMR: ok (value = ) SBE SBE ok (value = ) ok (value = ) SBE (value = ) SBE (value = ) SBE (value = ) SBE (value = ) SBE (value = ) SBE (value = ) ok (value = ) or or or or or or DBE DBE DBE DBE DBE DBE

Recovery Hamming codes (often used in ECC memory) use parity bits Bit position 2i is a parity covering all bits with the (i+)th least significant bit set Each bit is covered by a unique set of parity bits Error locations are identified by summing the positions of the faulty parity bits Can detect & recover single-bit errors (can be extended to detect double-bit errors) Reed-Solomon codes are more complex (but widely used) Function values or coefficients of a polynomial Hamming code: parity bits and corresponding data bits from https://en.wikipedia.org/wiki/hamming_code

Recovery QR codes provide multiple recovery % options Four levels: L (7%), M (5%), Q (25%), H (3%)

Recovery Software level Log: record of operations (can enable recovery) Checkpoint: snapshot of current state Independent vs. coordinated checkpointing Standalone vs. incremental checkpointing Tradeoff: space vs. time (how much to save?) Restore: revert system state to a checkpoint May require replaying some calculations Can a checkpoint be restored on a different system? If so, how?