Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability
|
|
- Pauline Parker
- 5 years ago
- Views:
Transcription
1 Course: Advanced Software Engineering academic year: Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it Copyright Notice» The material in these slides may be freely reproduced and distributed, partially or totally, as far as an explicit reference or acknowledge to the material author is preserved. 2 1
2 BASIC CONCEPTS 3 BASIC CONCEPTS : dependability summary 4 2
3 Reliability definition (RELIABILITY) Probability of a system working within specs throughout an interval of time without system-level repair 5 Reliability definition 6 Probability of a system working within specs for a certain throughout number an interval of invocations of time without system-level repair (RELIABILITY ON DEMAND) 3
4 Availability definition Fraction of time that the system is up within specs 7 Reliability terminology Fault feature that precludes the software from operating according to its specifications Error the value of the software state differs from the expected one Failure the actual software output (for some input) differs from the expected one 8 4
5 Faults, Errors and Failures program modthreeofsquare begin read(s); s := 2*s; s := s mod 3; write(s); end s=2 : no Error s=3 : Error! Fault! s=3,s=2 : no Failure s=4 : Failure! 9 Specification: a function that computes the remainder by 3 of the square of the input value y = (s 2 mod 3) Faults, Errors and Failures A failure is usually a result of a system error that is derived from a fault in the system However, faults do not necessarily result in system errors A faulty system might never execute the faulty statement to originate an error Errors do not necessarily lead to system failures The error can be corrected by built-in error detection and recovery or it can be naturally masked from other system components (error propagation) 10 5
6 About the error propagation Ф(C1) C1 Somehow interacting Ф(Cn) Cn Reliability of each component may not suffice Ф(C2) C2 component correct erroneous correct erroneous 11 About the error propagation system interface system component interface component i component j internal fault activation error error input error error propagation status of component i correct service component i failure incorrect service status of component j correct service (system) failure incorrect service 12 6
7 Dependability achievement» Fault avoidance - Development techniques are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults» Fault detection and removal - Verification and validation techniques that increase the probability of detecting and correcting faults before the system goes into service» Fault tolerance - Run-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures 13 FAULT AND FAILURE TYPES 14 7
8 Fault classification Heisenbugs Byzantine faults 15 Fault classification A repeatable bug; one that manifests reliably under a possibly unknown but well-defined set of conditions. 16 8
9 Fault classification A bug that disappears or alters its behavior when one attempts to probe or isolate it. E.g., the use of a debugger sometimes alters a program's operating environment significantly enough that buggy code, such as that which relies on the values of uninitialized memory, behaves quite differently. 17 Heisenbugs Fault classification A repeatable bug; A bug one whose that manifests underlying reliably under a possibly unknown causes but are well-defined so complex set and of conditions. obscure as to make its behavior appear chaotic or even non-deterministic. 18 9
10 Failure classification nature of the failure» Hardware failure - Hardware fails because of design and manufacturing errors or because components have reached the end of their natural life.» Software failure - Software fails due to errors in its specification, design or implementation. - Software failures are different from hardware failures in that software does not wear out. It can continue in operation even after an incorrect result has been produced.» Operational failure 19 - Human operators make mistakes. Now perhaps the largest single cause of system failures. Failure classification type of failure 20 10
11 Failure classification severity of failure Failure class Transient Permanent Recoverable Unrecoverable Non-corrupting Corrupting Description Occurs only with certain inputs Occurs with all inputs System can recover without operator intervention Operator intervention needed to recover from failure Failure does not corrupt system state or data Failure corrupts system state or data 21 Reliability improvement» Removing X% of the faults in a system will not necessarily improve the reliability by X%. A study at IBM showed that removing 60% of product defects resulted in a 3% improvement in reliability» Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability» A program with known faults may therefore still be seen as reliable by its users 22 11
12 Reliability specifications» The required level of system reliability should be expressed quantitatively.» Reliability is a dynamic system attribute (reliability specifications related to the source code are meaningless). - No more than N faults/1000 lines; - This is only useful for a post-delivery process analysis where you are trying to assess how good your development techniques are.» An appropriate reliability metric should be chosen to specify the overall system reliability. 23 METRICS 24 12
13 Reliability metrics» Reliability metrics are units of measurement of system reliability.» System reliability is measured by counting the number of operational failures and, where appropriate, relating these to the demands made on the system and the time that the system has been operational. 25 Dependability metrics Metric POFOD Probability of failure on demand ROCOF Rate of failure occurrence MTTF Mean time to failure AVAIL Availability Explanation The likelihood that the system will fail when a service request is made. A POFOD of means that 1 out of a thousand service requests may result in failure. The frequency of occurrence with which unexpected behaviour is likely to occur. A ROCOF of 2/100 means that 2 failures are likely to occur in each 100 operational time units. This metric is sometimes called the failure intensity. The average time between observed system failures. An MTTF of 500 means that 1 failure can be expected every 500 time units. The probability that the system is available for use at a given time. Availability of means that in every 1000 time units, the system is likely to be available for 998 of these
14 Probability of failure on demand (POFOD)» This is the probability that the system will fail when a service request is made. Useful when demands for service are intermittent and relatively infrequent.» Appropriate for protection systems where services are demanded occasionally and where there are serious consequences if the service is not delivered. 27 Rate of failure occurrence (ROCOF)» Reflects the rate of occurrence of failure in the system.» Relevant for operating systems, transaction processing systems where the system has to process a large number of similar requests that are relatively frequent 28 14
15 Mean time to failure (MTTF)» Measure of the time between observed failures of the system. Is the reciprocal of ROCOF for stable systems.» Relevant for systems with long transactions i.e. where system processing takes a long time. MTTF should be longer than transaction length. 29 Availability» Measure of the fraction of the time that the system is available for use.» Takes repair and restart time into account» Relevant for non-stop, continuously running systems 30 15
Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.
SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES
More informationBasic Concepts of Reliability
Basic Concepts of Reliability Reliability is a broad concept. It is applied whenever we expect something to behave in a certain way. Reliability is one of the metrics that are used to measure quality.
More informationCritical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability
Objectives Critical Systems To explain what is meant by a critical system where system failure can have severe human or economic consequence. To explain four dimensions of dependability - availability,
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationAppendix D: Storage Systems (Cont)
Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationFault Tolerance in Distributed Systems: An Introduction
Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena
More informationOverview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University
Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationFault Tolerance in Distributed Systems: An Introduction
Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationRecovering Device Drivers
1 Recovering Device Drivers Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy University of Washington Presenter: Hayun Lee Embedded Software Lab. Symposium on Operating Systems
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationChapter 8. Achmad Benny Mutiara
Chapter 8 SOFTWARE-TESTING STRATEGIES Achmad Benny Mutiara amutiara@staff.gunadarma.ac.id 8.1 STATIC-TESTING STRATEGIES Static testing is the systematic examination of a program structure for the purpose
More informationC 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.
Recap Best Practices Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo 2 Today s Question Two Different System Models How do we handle failures? Cannot
More informationFault tolerance and Reliability
Fault tolerance and Reliability Reliability measures Fault tolerance in a switching system Modeling of fault tolerance and reliability Rka -k2002 Telecommunication Switching Technology 14-1 Summary of
More informationCprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques
: Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.
More informationPart 2: Basic concepts and terminology
Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness
More informationMS630 Memory Problem Determination/Resolution Guide
MS630 Memory Problem Determination/Resolution Guide Order Number EK-MS630-FI-001 ABSTRACT The objective of this guide is to clearly define the recommended memory maintenance strategy for all MS630 memory
More informationIngegneria del Software II, a.a. 2004/05. V.Cortellessa, University of L Aquila
1 2 3 4 5 6 Non-functional validation of software systems Vittorio Cortellessa cortelle@di.univaq.it Ingegneria del Software II (a.a. 2004-05) 7 Programma della seconda parte del corso Introduction Non-functional
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationAnnouncements. R3 - There will be Presentations
Announcements R3 - There will be Presentations Clarify any requirements and expectations with stakeholder Unify any assumptions/dependencies with other silos Distributed Systems SWEN-343 Distributed Systems
More informationFault Tolerant Computing CS 530
Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationFunctional Safety and Safety Standards: Challenges and Comparison of Solutions AA309
June 25th, 2007 Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309 Christopher Temple Automotive Systems Technology Manager Overview Functional Safety Basics Functional
More informationRegression testing. Whenever you find a bug. Why is this a good idea?
Regression testing Whenever you find a bug Reproduce it (before you fix it!) Store input that elicited that bug Store correct output Put into test suite Then, fix it and verify the fix Why is this a good
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationDiagnosis in the Time-Triggered Architecture
TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the
More informationIntroduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki
Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationSteps for project success. git status. Milestones. Deliverables. Homework 1 submitted Homework 2 will be posted October 26.
git status Steps for project success Homework 1 submitted Homework 2 will be posted October 26 due November 16, 9AM Projects underway project status check-in meetings November 9 System-building project
More informationFault-tolerant techniques
What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques
More informationReliable Computing I
Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the
More informationFault, Error, and Failure
Fault, Error, and Failure Testing, Quality Assurance, and Maintenance Winter 2018 Prof. Arie Gurfinkel based on slides by Prof. Lin Tan and others Terminology, IEEE 610.12-1990 Fault -- often referred
More informationC 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How
CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,
More informationFault-Tolerant Computer Systems ECE 60872/CS Recovery
Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.
More informationCS 520 Theory and Practice of Software Engineering Fall 2018
CS 520 Theory and Practice of Software Engineering Fall 2018 Nediyana Daskalova Monday, 4PM CS 151 Debugging October 30, 2018 Personalized Behavior-Powered Systems for Guiding Self-Experiments Help me
More informationINTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERING Introduction to Software Testing d_sinnig@cs.concordia.ca Department for Computer Science and Software Engineering What is software testing? Software testing consists
More informationSoftware Quality Assurance (SQA) Software Quality Assurance
Software Quality Assurance (SQA) Software Quality Assurance Use of analysis to validate artifacts requirements analysis design analysis code analysis and testing Technical/Document reviews Control of changes
More informationSOFTWARE ENGINEERING DECEMBER. Q2a. What are the key challenges being faced by software engineering?
Q2a. What are the key challenges being faced by software engineering? Ans 2a. The key challenges facing software engineering are: 1. Coping with legacy systems, coping with increasing diversity and coping
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationProgram Correctness and Efficiency. Chapter 2
Program Correctness and Efficiency Chapter 2 Chapter Objectives To understand the differences between the three categories of program errors To understand the effect of an uncaught exception and why you
More informationFault-Tolerant Storage and Implications for the Cloud Charles Snyder
Fault-Tolerant Storage and Implications for the Cloud Charles Snyder Abstract Fault-tolerance is an essential aspect of any storage system data must be correctly preserved and transmitted in order to be
More informationDependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8
TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system
More informationSoftware Quality Assurance & Testing
Software Quality Assurance & Testing 1.Software Testing - An ISTQB-BCS Certified Tester Foundation Guide 3rd edition, 2015 Brian Hambling, Peter Morgan, Geoff Thompson, Peter Williams,Angelina Samaroo
More informationToday CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra
Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct
More informationApproaches to Software Based Fault Tolerance A Review
Computer Science Journal of Moldova, vol.13, no.3(39), 2005 Approaches to Software Based Fault Tolerance A Review Goutam Kumar Saha Abstract This paper presents a review work on various approaches to software
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationDarshan Institute of Engineering & Technology Unit : 9
1) Explain software testing strategy for conventional software architecture. Draw the spiral diagram showing testing strategies with phases of software development. Software Testing: Once source code has
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationDistributed Systems (5DV147)
Distributed Systems (5DV147) Fundamentals Fall 2013 1 basics 2 basics Single process int i; i=i+1; 1 CPU - Steps are strictly sequential - Program behavior & variables state determined by sequence of operations
More informationHuman Computer Interaction Lecture 14. HCI in Software Process. HCI in the software process
Human Computer Interaction Lecture 14 HCI in Software Process HCI in the software process Software engineering and the design process for interactive systems Usability engineering Iterative design and
More informationSoftware Quality. Chapter What is Quality?
Chapter 1 Software Quality 1.1 What is Quality? The purpose of software quality analysis, or software quality engineering, is to produce acceptable products at acceptable cost, where cost includes calendar
More informationLOGICAL OPERATOR USAGE IN STRUCTURAL MODELLING
LOGICAL OPERATOR USAGE IN STRUCTURAL MODELLING Ieva Zeltmate (a) (a) Riga Technical University, Faculty of Computer Science and Information Technology Department of System Theory and Design ieva.zeltmate@gmail.com
More informationChapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao
Chapter 39: Concepts of Time-Triggered Communication Wenbo Qiao Outline Time and Event Triggered Communication Fundamental Services of a Time-Triggered Communication Protocol Clock Synchronization Periodic
More informationHuman Computer Interaction Lecture 06 [ HCI in Software Process ] HCI in the software process
Human Computer Interaction Lecture 06 [ HCI in Software Process ] Imran Ihsan Assistant Professor www.imranihsan.com aucs.imranihsan.com HCI06 - HCI in Software Process 1 HCI in the software process Software
More informationParallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationWHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments
WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...
More informationDistributed Algorithms Benoît Garbinato
Distributed Algorithms Benoît Garbinato 1 Distributed systems networks distributed As long as there were no machines, programming was no problem networks distributed at all; when we had a few weak computers,
More informationLast Class:Consistency Semantics. Today: More on Consistency
Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Failure Detectors Slides by: Steve Ko Computer Sciences and Engineering University at Buffalo Administrivia Programming Assignment 2 is out Please continue to monitor Piazza
More informationEventual Consistency. Eventual Consistency
Eventual Consistency Many systems: one or few processes perform updates How frequently should these updates be made available to other read-only processes? Examples: DNS: single naming authority per domain
More informationEE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1
EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint
More informationDistributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi
1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.
More informationDistributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs
1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds
More informationPractical Byzantine Fault
Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault
More informationFault Tolerance. The Three universe model
Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful
More informationAerospace Software Engineering
16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Definitions Software reliability The probability
More informationMicroSurvey Users: How to Report a Bug
MicroSurvey Users: How to Report a Bug Step 1: Categorize the Issue If you encounter a problem, as a first step it is important to categorize the issue as either: A Product Knowledge or Training issue:
More informationLecture 15 Software Testing
Lecture 15 Software Testing Includes slides from the companion website for Sommerville, Software Engineering, 10/e. Pearson Higher Education, 2016. All rights reserved. Used with permission. Topics covered
More informationOutline. Failure Types
Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 10 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus
More informationDistributed Systems
15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationSoftware Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies
Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies The agenda Reliability in Storage systems Types of errors/faults in distributed storage
More informationWHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management
WHITE PAPER: ENTERPRISE AVAILABILITY Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management White Paper: Enterprise Availability Introduction to Adaptive
More informationModeling Run-Time Distributions in Passively Replicated Fault Tolerant Systems
Åsmund Tjora Modeling Run-Time Distributions in Passively Replicated Fault Tolerant Systems Thesis for the degree of doktor ingeniør Trondheim, December 2007 Norwegian University of Science and Technology
More informationRe-host Factors and a Method to Maintain the Integrity of a Test
Re-host Factors and a Method to Maintain the Integrity of a Test Larry Kirkland WesTest Engineering 810 Shepard Lane Farmington, Utah 84025 801-451-9191 ext 124 Abstract: Re-hosting Test Program Sets (TPS)
More informationChapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju
Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationHCI in the software process
chapter 6 HCI in the software process HCI in the software process Software engineering and the process for interactive systems Usability engineering Iterative and prototyping Design rationale the software
More informationHCI in the software. chapter 6. HCI in the software process. The waterfall model. the software lifecycle
HCI in the software process chapter 6 HCI in the software process Software engineering and the process for interactive systems Usability engineering Iterative and prototyping Design rationale the software
More informationVerification and Validation
Lecturer: Sebastian Coope Ashton Building, Room G.18 E-mail: coopes@liverpool.ac.uk COMP 201 web-page: http://www.csc.liv.ac.uk/~coopes/comp201 Verification and Validation 1 Verification and Validation
More informationObject Oriented Programming. Week 7 Part 1 Exceptions
Object Oriented Programming Week 7 Part 1 Exceptions Lecture Overview of Exception How exceptions solve unexpected occurrences Catching exceptions Week 7 2 Exceptions Overview Week 7 3 Unexpected Occurances
More informationIBM POWER6 Processor-based Systems: Designing and Implementing Serviceability
IBM POWER6 Processor-based Systems: Designing and Implementing Serviceability IBM System p Platform Reliability, Availability and Serviceability (RAS) Jim Mitchell, George Ahrens, Julie Villarreal and
More informationLecture 22: Fault Tolerance
Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error
More informationCSE 5306 Distributed Systems. Fault Tolerance
CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure
More informationChapter 9. Software Testing
Chapter 9. Software Testing Table of Contents Objectives... 1 Introduction to software testing... 1 The testers... 2 The developers... 2 An independent testing team... 2 The customer... 2 Principles of
More informationFault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson
Fault-Tolerance I: Atomicity, logging, and recovery COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson What is fault tolerance? Building reliable systems from unreliable components Three basic
More information