Fault Tolerance in Distributed Systems: An Introduction

Similar documents
Fault Tolerance in Distributed Systems: An Introduction

Consistency & Replication in Distributed Systems

Naming in Distributed Systems

Processes in Distributed Systems

Processes in Distributed Systems

Communication in Distributed Systems

Software Architectures

Software Architectures

Introduction to Distributed Systems

Introduction to Distributed Systems

Module 8 Fault Tolerance CS655! 8-1!

Object-Oriented Middleware for Distributed Systems

Jade: Java Agent DEvelopment Framework Overview

Introduction to Distributed Systems

Jade: Java Agent DEvelopment Framework Getting Started

Chapter 8 Fault Tolerance

Introduction to Distributed Systems

Module 8 - Fault Tolerance

From Objects to Agents: The Java Agent Middleware (JAM)

Introduction to Distributed Systems

Introduction to Distributed Systems

Fault Tolerance. Distributed Systems IT332

Software Architectures

Distributed Systems COMP 212. Lecture 19 Othon Michail

Jade: Java Agent DEvelopment Framework Overview

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Chapter 8 Fault Tolerance

Distributed Systems

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

The Architecture of the World Wide Web

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Tolerance. Distributed Systems Santa Clara University

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Prolog Examples. Distributed Systems / Technologies. Sistemi Distribuiti / Tecnologie

Dep. Systems Requirements

Fault Tolerance. Distributed Systems. September 2002

TSW Reliability and Fault Tolerance

Coordination in Situated Systems

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Fault Tolerance. Basic Concepts

Labelled Variables in Logic Programming: A First Prototype in tuprolog

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Today: Fault Tolerance. Replica Management

Middleware and Interprocess Communication

Distributed Information Processing

Time redundancy. Time redundancy

Boolean network robotics

Today: Fault Tolerance

COURSE PRESENTATION. PROGRAMMAZIONE AVANZATA E PARADIGMI Ingegneria e Scienze Informatiche Università di Bologna - Cesena - a.a.

The Spanning Tree Protocol

Documents and computation. Introduction to JavaScript. JavaScript vs. Java Applet. Myths. JavaScript. Standard

Implementation Issues. Remote-Write Protocols

Distributed Systems COMP 212. Lecture 1 Othon Michail

Distributed Systems Fault Tolerance

The Architecture of the World Wide Web

Multimedia Data Management M

Multimedia Data Management M

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Today: Fault Tolerance. Fault Tolerance

Last Class:Consistency Semantics. Today: More on Consistency

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Issues in Programming Language Design for Embedded RT Systems

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Distributed Systems (5DV147)

Distributed Information Processing

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

PRIMARY-BACKUP REPLICATION

Reliable Distribution of Data Using Replicated Web Servers

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Fault Tolerance. Distributed Software Systems. Definitions

Dependability tree 1

Module 1 - Distributed System Architectures & Models

Distributed Systems COMP 212. Lecture 1 Othon Michail

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Distributed File Systems. CS432: Distributed Systems Spring 2017

Defect Tolerance in VLSI Circuits

Information Technology for Documentary Data Representation

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Network Time Protocol

OWL-S for Describing Artifacts

Fault tolerant scheduling in real time systems

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Database Technology. Topic 11: Database Recovery

Fault tolerance and Reliability

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

Fault Tolerance. The Three universe model

Data Modelling and Multimedia Databases M

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Designing a Development Environment for Logic and Multi-Paradigm Programming

Transcription:

Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2011/2012 Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 1 / 18

Outline Outline 1 Introduction 2 Basic Concepts Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 2 / 18

Disclaimer These Slides Contain Material from [Tanenbaum and van Steen, 2007] Slides were made kindly available by the authors of the book Such slides shortly introduced the topics developed in the book [Tanenbaum and van Steen, 2007] adopted here as the main book of the course Some of the material from those slides has been re-used in the following, and integrated with new material according to the personal view of the teacher of this course Every problem or mistake contained in these slides, however, should be attributed to the sole responsibility of the teacher of this course Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 3 / 18

Outline Introduction 1 Introduction 2 Basic Concepts Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 4 / 18

Introduction Failure in Distributed Systems Partial failure A typical feature of distributed systems is the notion of partial failure One component may fail, while the rest of the systems keeps running While the functionality guaranteed by the failed component is compromised, this does not necessarily holds for the other components, as well as for the overall system Engineering distributed systems with failure When engineering a distributed systems, a twofold goal is possible reducing the impact of failure of a single component on the others, and on the overall system performance exploiting partial failure to recover from failure Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 5 / 18

Introduction Failure in Distributed Systems Partial failure A typical feature of distributed systems is the notion of partial failure One component may fail, while the rest of the systems keeps running While the functionality guaranteed by the failed component is compromised, this does not necessarily holds for the other components, as well as for the overall system Engineering distributed systems with failure When engineering a distributed systems, a twofold goal is possible reducing the impact of failure of a single component on the others, and on the overall system performance exploiting partial failure to recover from failure Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 5 / 18

Outline Basic Concepts 1 Introduction 2 Basic Concepts Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 6 / 18

Basic Concepts Dependable Systems Main features of dependable systems Availability Reliability Safety Maintainability Dependability is closely related to fault tolerance Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 7 / 18

Basic Concepts Availability Definition Availability refers to the property that a system is ready for immediate use This means...... that availability refers to the probability that a system is operating correctly at any given moment, ready to provide users with its functions So, a highly-available system is a system that is most likely to be ready and working at any given instant of time Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 8 / 18

Basic Concepts Availability Definition Availability refers to the property that a system is ready for immediate use This means...... that availability refers to the probability that a system is operating correctly at any given moment, ready to provide users with its functions So, a highly-available system is a system that is most likely to be ready and working at any given instant of time Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 8 / 18

Basic Concepts Reliability Definition Reliability refers to the property that a system can run continuously without failure This means...... that reliability is defined in terms of a time interval, rather than of a instant as in the case of availability So, a highly-reliable system is a system that is most likely to keep on running for a long period of time Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 9 / 18

Basic Concepts Reliability Definition Reliability refers to the property that a system can run continuously without failure This means...... that reliability is defined in terms of a time interval, rather than of a instant as in the case of availability So, a highly-reliable system is a system that is most likely to keep on running for a long period of time Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 9 / 18

Safety Basic Concepts Definition Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens This is...... a very difficult property to be defined, and to be ensured as well Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 10 / 18

Safety Basic Concepts Definition Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens This is...... a very difficult property to be defined, and to be ensured as well Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 10 / 18

Basic Concepts Maintainability Definition Maintainability refers to how easily a failed systems can be repaired This means...... that maintainability is closely related to availability So, a highly-maintainable system may also show a high degree of availability Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 11 / 18

Basic Concepts Maintainability Definition Maintainability refers to how easily a failed systems can be repaired This means...... that maintainability is closely related to availability So, a highly-maintainable system may also show a high degree of availability Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 11 / 18

Basic Concepts Faults I Failure A system is said to fail when does not behave as promised An error is a part of a system state that might have caused a failure The cause of an error is a fault Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 12 / 18

Basic Concepts Faults II Fault tolerance and dependable systems Building a dependable system closely relates to controlling faults One may distinguish between preventing faults removing faults forecasting faults In distributed system, the most important issue is fault tolerance as the property of a system to provide its function even in the presence of faults Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 13 / 18

Basic Concepts Faults III Sorts of faults Transient faults occur once then disappear Intermittent faults occur, vanishes of its own accord, then reappears, and so on Permanent faults keep on existing until the faulty component is replaced /fixed Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 14 / 18

Failure Models Basic Concepts Different types of failures [Tanenbaum and van Steen, 2007] Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 15 / 18

Basic Concepts Failure Masking by Redundancy Idea Hiding failures from other processes The key technique for masking faults is redundancy Three kinds of redundancy Information redundancy e.g., extra bits Time redundancy e.g., redos after transaction aborts Physical redundancy typical in biological systems Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 16 / 18

Basic Concepts Failure Masking by Redundancy Idea Hiding failures from other processes The key technique for masking faults is redundancy Three kinds of redundancy Information redundancy e.g., extra bits Time redundancy e.g., redos after transaction aborts Physical redundancy typical in biological systems Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 16 / 18

References I References Tanenbaum, A. S. and van Steen, M. (2007). Distributed Systems. Principles and Paradigms. Pearson Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition. Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 17 / 18

Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2011/2012 Andrea Omicini (Università di Bologna) 10 Introduction to Fault Tolerance A.Y. 2011/2012 18 / 18