= trustworthiness of a computer system with regard to the services it provides

Similar documents
Dep. Systems Requirements

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Today: Fault Tolerance

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Replica Management

TSW Reliability and Fault Tolerance

Dependability tree 1

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems IT332

Module 8 - Fault Tolerance

Distributed Systems Fault Tolerance

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

CSE 5306 Distributed Systems. Fault Tolerance

Fault Tolerance. Distributed Software Systems. Definitions

Distributed Systems

CSE 5306 Distributed Systems

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Fault Tolerance Dealing with an imperfect world

Module 8 Fault Tolerance CS655! 8-1!

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

To do. Consensus and related problems. q Failure. q Raft

Today: Fault Tolerance. Failure Masking by Redundancy

Chapter 8 Fault Tolerance

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Issues in Programming Language Design for Embedded RT Systems

Causes of Software Failures

Distributed Systems 24. Fault Tolerance

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems 11. Consensus. Paul Krzyzanowski

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Fault Tolerance. Basic Concepts

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

Failure Tolerance. Distributed Systems Santa Clara University

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Causal Consistency and Two-Phase Commit

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Last Class:Consistency Semantics. Today: More on Consistency

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

FAULT TOLERANT SYSTEMS

02 - Distributed Systems

On the Way to ROC-2 (JAGR: JBoss + App-Generic Recovery)

Implementation Issues. Remote-Write Protocols

Browser Security Guarantees through Formal Shim Verification

02 - Distributed Systems

Recovering from a Crash. Three-Phase Commit

Verification, Testing, and Bugs

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Specifying and Proving Broadcast Properties with TLA

Consensus and related problems

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Fault-Tolerant Distributed Consensus

Distributed Systems (ICE 601) Fault Tolerance

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

CS5412: TRANSACTIONS (I)

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems (5DV147)

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Aerospace Software Engineering

Byzantine Fault Tolerance

Virtualization. Application Application Application. MCSN - N. Tonellotto - Distributed Enabling Platforms OPERATING SYSTEM OPERATING SYSTEM

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Distributed System. Gang Wu. Spring,2018

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

MODELS OF DISTRIBUTED SYSTEMS

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis

Appendix D: Storage Systems (Cont)

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

PRIMARY-BACKUP REPLICATION

Survey of Cyber Moving Targets. Presented By Sharani Sankaran

Protection. Thierry Sans

FAULT TOLERANT SYSTEMS

Consensus, impossibility results and Paxos. Ken Birman

Operating Systems. Operating System Structure. Lecture 2 Michael O Boyle

A Survey on Virtualization Technologies

2 Copyright 2015 M. E. Kabay. All rights reserved. 4 Copyright 2015 M. E. Kabay. All rights reserved.

Consensus and agreement algorithms

Replication in Distributed Systems

Virtual machines (e.g., VMware)

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Availability, Reliability, and Fault Tolerance

School of Computer & Communication Sciences École Polytechnique Fédérale de Lausanne

Fault Tolerance. The Three universe model

Lecture 9: MIMD Architectures

Part 2: Basic concepts and terminology

Transcription:

What Is Dependability? Dependability : Vocabulary and Techniques = trustworthiness of a computer system with regard to the services it provides Reliability = continuous functioning w/out failure Availability = readiness for usage Safety = avoid catastrophic effects on environment George Candea Stanford University Security = prevent unauthorized access and/or handling of information 2 Historical Evolution Some Definitions 1. Make things work and keep them working reliability Babbage s analytical engine MTBF(vacuum tubes) computation time 2. Widespread use (accounting, payroll, inventory, billing) availability 3. Critical applications (FAA, early -warning, nuclear power plants, aircraft, ABS) safety 4. Distributed systems increased node accessibility and vulnerability security Specification = agreed description of a sw system s expected service Environment = any external entity that interacts w/ system (present/past/future). Beware of different system boundaries. User = that part of the environment that provides inputs to the service and/or receives outputs Function = what the system should do Behavior = what the system does do (hence, service = abstraction of system behavior) Structure = what makes the system do what it does 3 4 From Fault to Failure Failure = deviation of system s service from its spec Error = the part of system state that is liable Fault = adjudged/hypothesized cause of the error Fault Error Failure Failure of one system is fault for another system (programmers, tools, operators, etc.) Fault Error Failure Fault Error Failure Distinction is hard, so make it at the level at which the fault is meant to be prevented or tolerated Bugs = Software Faults Heisenbugs = intermittent design/implementation faults; the more you look for them, the more elusive they become Bohrbugs = permanent design/implementation faults (solid, easily identified) Note: Intermittent fault internal Transient fault external 5 6 1

Failures Classification Byzantine failure = system returns wrong values (named after Byzantine Empire) Stopping failure = system activity no longer perceived by user, delivers constant value service Omission failure = stopping failure w/ no service being delivered at all Crash failure = persistent omission failure Crash < Omission < Stopping < Byzantine Benign failure: consequences potential benefit from up-ness (same order of magnitude) Catastrophic failure: consequences >> potential benefit from up-ness Fail-safe system = only benign failures Fail-stop system = only stopping failures (sometimes equivalent to fail-stop) Fail-silent system = only crash failures 7 8 Statistical Definitions Reliability: random variable R(t) = probability that system does not fail before time t Mean time to failure: MTTF = E[R(t) ] (how long expect system to work w/out failure) Instantaneous availability: A(t) = probability that system is up at time t ; then Availability = lim A(t) up Failure down t inf up MTTF MTTF Availability = = MTTF + MTTR MTBF down Repair time Techniques Three basic approach to faults/errors/failures: 9 10 11 12 2

13 14 Formal Methods Example: Proof-Carrying Code Idea came from traditional engineering disciplines Specify and model behavior of a system, and mathematically verify that its design and implementation satisfy functional and safety reqs. L1: formally specify the system (using logic or spec language) L2: spec at 2+ levels; pencil-and-paper proof that concrete levels imply the more abstract levels (e.g., implem. design) L3: spec and then convince mechanical theorem prover Problems: Specs and mechanical prover must be 100% correct Large, complex systems are impossible to verify Code producer generates formal safety proof (e.g., first-order logic proof for DEC Alpha machine code) Code consumer verifies validity of proof with a fast checker Advantages: Shifts burden of proof to code producer Only need to trust proof checker Faster than an interpreter PCC maintains safety, even if tampered with Disadvantages: Must trust proof checker and safety policy Hard to prove interesting properties automatically Safety is not sufficient for dependability 15 16 Static Program Analysis Inspect source code, manually or automatically, and find potential bugs (e.g., compilers) Example: metacompilation Programmer provides C-like gcc extensions to automatically check or optimize their code; get compiled together w/ src Extensions express accepted rules: Syscall must check user pointers before using them Don t call blocking function with interrupts disabled Disabled interrupts must eventually be re-enabledenabled Found couple thousand bugs in Linux, OpenBSD, and Xok Latest tool: infer these rules automatically, thus not requiring the programmer to write them 17 18 3

Restrictive Languages To limit the damage a program can do, limit what can be expressed in the source language ( if you can t say it, nobody will do it ) Typical restrictions Control flow (e.g., no backward branching finite exec) Type safety, no pointers Issues: Language may be too limited and awkward Dependable code must be written in that language Binaries must be tamper-evidentevident Assumes all dev tools are trusted and correct Example: SPIN and user-provided kernel extensions written in Modula-3 (type-safe and OO); extensions signed by compiler 19 20 Fault Forecasting Monitor system and infer when it has entered an area of its state space where it s prone to failure; then fix it Internal monitoring: ISTORE Incrementally scaleable, self-maintaining storage appliance Sensors monitor system and communicate changes Software triggers (predicates over system state) get evaluated when something changes and signal potential problems Adaptation code gets invoked to deal with anomalies External monitoring: Internet services Statistically model system/network performance behavior A deviation from that model == sign of impending fault 21 22 Sandboxing Dynamic Dataflow Analysis Isolate user program in a sandbox where it can execute without harming anything outside the sandbox Sandbox = fault domain = code + data segment, suitably aligned Configure MMU to fault on accesses/jumps outside of fault domain Rewrite the binary to mask off high order bits on addresses to keep them within fault domain Redirect system calls through a protected jump table to an arbitrator Deny potentially unsafe operations Quarantine data that may be contaminated (taintperl) print STDERR Enter file name: ; $x = <STDIN>; # tainted. $y = /etc/hosts ; # clean $z = $sysdir/$x ; # transitively tainted system( cat $z ); # not permitted system( cat $y ); # OK 23 24 4

Virtualization VM = sw abstraction of a machine on top of another machine Can virtualize hw or a language exec environment Advantages: Guest can virtualize resources differently from host Excellent way to test and debug (incl. with altered privileges) Run distinct versions of various software, for diversification Create sophisticated sanboxes (e.g., for classified information processing); VM isolation is simple to understand Intercept, control, and monitor access to all resources; could infer when application is about to fail Problems: Must rely on integrity and correctness of VM 25 26 Virtualization Examples 1967: IBM introduces the S/360 model 67 w/ virtual memory; TSS subsystem provides the illusion of multiple 360 s w/out virtual memory These days: S/390 can run Linux in 10,000s of concurrent VMs (1,000s on production systems) IBM s latest offering: z/vm for the z900 mainframe Other: VMWare s x86 VM for Windows and Linux RealPC and VirtualPC to simulate x86 on Macs of course JVM 27 28 Information/Data Redundancy Processor Redundancy CRC, EDC/ECC codes, parity checks, etc. Replication Primary/Secondary copy Multi-node replication Majority voting Weighted voting Geographical replication: ship transaction log off-site Simple example: Triple Module Redundancy Decreased MTTF; so why is it a good idea? Bonus: single point of failure to something simple (a voter) Hot/Cold standy and failover (e.g., Tandem Non-Stop) Clusters (combines data with processor redundancy) Distributed systems problems: Manageability How to reach agreement? 29 30 5

Distributed Consensus Impossibility of Consensus Nodes can fail (stop or Byzantine); good nodes need to agree on a value (e.g., time of transaction commit) Most famous anthropomorphism in distributed systems: Byzantine Generals problem (Lamport) City surrounded by Byzantine army; attack or retreat? When? Must do it at the same time! Traitorous generals want to deceive loyal generals Can use oral or written (signed messages) What is the max. number of traitors that can be tolerated? Unsigned messages: can tolerate strictly less than 1/3 Signed messages: can tolerate any number of faults Fundamental result, proved in 1985 Asynchronous system (can make no assumptions about relative speeds of processes) w/ reliable communication At most one process can fail (stop or Byzantine) Impossible to guarantee consensus in finite time Basic reason: cannot distinguish failed nodes from slow nodes Consequence: cannot tolerate any fault in async system In real world: place upper bounds on communication and processing time; if slow, consider node faulty 31 32 Temporal Redundancy Perform computation several times, use comparator to generate result In face of failure, repeat computation At the basis of reliable message delivery through retransmit 33 34 Recovery ACID Backward recovery: return system to previous, known-good state (e.g., checkpoint-restart, rollback, reboot) Forward recovery: take system to new, good state from where it can continue operating, potentially in degraded mode (e.g., app-specific exception handling) Compare to: compensation, in which erroneous state is sufficiently redundant to allow continued operation (e..g, compensating transactions, failover, etc.) Atomicity = all-or-nothing Consistent = only correct state changes Isolated = as if running alone, even if concurrent txs Durable = committed changes are permanent Transaction = unit of work that is ACID Write-ahead logging for atomicity and durability Hairy algs. for logging and recovery Distributed transactions & recovery management Multiphase commits 35 36 6

Pros and Cons of ACID Restart-Based Techniques Advantages: Atomicity is extremely appealing Unambiguous notion of fail-stop Easy and simply to understand and reason about Excellent building block for complex interactions Disadvantages: Consistency protocols are hard to get right (design & impl.) Locking and scheduling; deadlock detection code worst kind of code: almost never exercised, but absolutely critical when called Performance and correctness antagonistic One of the largest sources of unavailability: intermittent bugs and transient faults Structure systems such that they can be restarted at various fine grain levels without adverse effects Use prophylactic and reactive restarts to cure failure Properties: Unequivocally returns system to its start state High confidence way to reclaim stale and leaked resources Easy to understand and employ 37 38 Diversity N-version programming (govt.-funded critical systems) More realistic variant: deploy a collection of different releases from various software vendors, to avoid similar fault patterns Automated mutation of software 39 40 41 7