Fault Tolerance in Distributed Systems: An Introduction

Similar documents
Fault Tolerance in Distributed Systems: An Introduction

Consistency & Replication in Distributed Systems

Processes in Distributed Systems

Communication in Distributed Systems

Naming in Distributed Systems

Software Architectures

Processes in Distributed Systems

Software Architectures

Jade: Java Agent DEvelopment Framework Getting Started

Jade: Java Agent DEvelopment Framework Overview

Introduction to Distributed Systems

Introduction to Distributed Systems

Introduction to Distributed Systems

Jade: Java Agent DEvelopment Framework Overview

Object-Oriented Middleware for Distributed Systems

Module 8 Fault Tolerance CS655! 8-1!

Software Architectures

Introduction to Distributed Systems

Introduction to Distributed Systems

Chapter 8 Fault Tolerance

Introduction to Distributed Systems

Module 8 - Fault Tolerance

From Objects to Agents: The Java Agent Middleware (JAM)

Prolog Examples. Distributed Systems / Technologies. Sistemi Distribuiti / Tecnologie

Fault Tolerance. Distributed Systems IT332

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Coordination in Situated Systems

Chapter 8 Fault Tolerance

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Labelled Variables in Logic Programming: A First Prototype in tuprolog

The Architecture of the World Wide Web

Distributed Systems

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Tolerance. Distributed Systems Santa Clara University

The Spanning Tree Protocol

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Boolean network robotics

COURSE PRESENTATION. PROGRAMMAZIONE AVANZATA E PARADIGMI Ingegneria e Scienze Informatiche Università di Bologna - Cesena - a.a.

Reliable Distribution of Data Using Replicated Web Servers

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance. Distributed Systems. September 2002

Dep. Systems Requirements

TSW Reliability and Fault Tolerance

Today: Fault Tolerance. Replica Management

Multimedia Data Management M

Multimedia Data Management M

Fault Tolerance. Basic Concepts

Implementation Issues. Remote-Write Protocols

Today: Fault Tolerance

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Distributed Systems COMP 212. Lecture 1 Othon Michail

Last Class:Consistency Semantics. Today: More on Consistency

Time redundancy. Time redundancy

Today: Fault Tolerance. Fault Tolerance

Distributed Information Processing

Middleware and Interprocess Communication

Documents and computation. Introduction to JavaScript. JavaScript vs. Java Applet. Myths. JavaScript. Standard

Information Technology for Documentary Data Representation

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

The Architecture of the World Wide Web

Distributed Systems Fault Tolerance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Data Modelling and Multimedia Databases M

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Lab for the course on Process and Service Modeling and Analysis. LAB-01 Introduction. Lecturer: Andrea MARRELLA

Issues in Programming Language Design for Embedded RT Systems

Generic polymorphism on steroids

Distributed Systems COMP 212. Lecture 1 Othon Michail

Distributed File Systems. CS432: Distributed Systems Spring 2017

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Eventual Consistency. Eventual Consistency

Kafka Streams: Hands-on Session A.A. 2017/18

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems (5DV147)

Ingegneria del Software Corso di Laurea in Informatica per il Management. Introduction to UML

Ingegneria del Software Corso di Laurea in Informatica per il Management

Distributed Information Processing

Fault Tolerance. Distributed Software Systems. Definitions

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Dependability tree 1

Defect Tolerance in VLSI Circuits

PRIMARY-BACKUP REPLICATION

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Fast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm

Towards Logic Programming as a Service: Experiments in tuprolog

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Transcription:

Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2014/2015 Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 1 / 17

Disclaimer These Slides Contain Material from [TvS07] Slides were made kindly available by the authors of the book Such slides shortly introduced the topics developed in the book [TvS07] adopted here as the main book of the course Some of the material from those slides has been re-used in the following, and integrated with new material according to the personal view of the teacher of this course Every problem or mistake contained in these slides, however, should be attributed to the sole responsibility of the teacher of this course Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 2 / 17

Introduction Outline 1 Introduction 2 Basic Concepts Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 3 / 17

Introduction Failure in Distributed Systems Partial failure A typical feature of distributed systems is the notion of partial failure One component may fail, while the rest of the systems keeps running While the functionality guaranteed by the failed component is compromised, this does not necessarily holds for the other components, as well as for the overall system Engineering distributed systems with failure When engineering a distributed systems, a twofold goal is possible reducing the impact of failure of a single component on the others, and on the overall system performance exploiting partial failure to recover from failure Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 4 / 17

Outline 1 Introduction 2 Basic Concepts Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 5 / 17

Dependable Systems Main features of dependable systems Availability Reliability Safety Maintainability Dependability is closely related to fault tolerance Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 6 / 17

Availability Definition Availability refers to the property that a system is ready for immediate use This means...... that availability refers to the probability that a system is operating correctly at any given moment, ready to provide users with its functions So, a highly-available system is a system that is most likely to be ready and working at any given instant of time Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 7 / 17

Reliability Definition Reliability refers to the property that a system can run continuously without failure This means...... that reliability is defined in terms of a time interval, rather than of a instant as in the case of availability So, a highly-reliable system is a system that is most likely to keep on running for a long period of time Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 8 / 17

Safety Definition Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens This is...... a very difficult property to be defined, and to be ensured as well Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 9 / 17

Maintainability Definition Maintainability refers to how easily a failed systems can be repaired This means...... that maintainability is closely related to availability So, a highly-maintainable system may also show a high degree of availability Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 10 / 17

Faults I Failure A system is said to fail when does not behave as promised An error is a part of a system state that might have caused a failure The cause of an error is a fault Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 11 / 17

Faults II Fault tolerance and dependable systems Building a dependable system closely relates to controlling faults One may distinguish between preventing faults removing faults forecasting faults In distributed system, the most important issue is fault tolerance as the property of a system to provide its function even in the presence of faults Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 12 / 17

Faults III Sorts of faults Transient faults occur once then disappear Intermittent faults occur, vanishes of its own accord, then reappears, and so on Permanent faults keep on existing until the faulty component is replaced /fixed Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 13 / 17

Failure Models Different types of failures [TvS07] Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 14 / 17

Failure Masking by Redundancy Idea Hiding failures from other processes The key technique for masking faults is redundancy Three kinds of redundancy Information redundancy e.g., extra bits Time redundancy e.g., redos after transaction aborts Physical redundancy typical in biological systems Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 15 / 17

References References I Andrew S. Tanenbaum and Marteen van Steen. Distributed Systems. Principles and Paradigms. Pearson Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition, 2007. Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 16 / 17

Fault Tolerance in Distributed Systems: An Introduction Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2014/2015 Andrea Omicini (DISI, Univ. Bologna) 11 Introduction to Fault Tolerance A.Y. 2014/2015 17 / 17