Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

Similar documents
Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Distributed Systems 11. Consensus. Paul Krzyzanowski

Mutual Exclusion in DS

Silberschatz and Galvin Chapter 18

Last Time. 19: Distributed Coordination. Distributed Coordination. Recall. Event Ordering. Happens-before

CS 43: Computer Networks. 08:Network Services and Distributed Systems 19 September

11/7/2018. Event Ordering. Module 18: Distributed Coordination. Distributed Mutual Exclusion (DME) Implementation of. DME: Centralized Approach

Event Ordering Silberschatz, Galvin and Gagne. Operating System Concepts

Chapter 16: Distributed Synchronization

Chapter 18: Distributed

Today: Fault Tolerance. Failure Masking by Redundancy

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed KIDS Labs 1

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Exam 2 Review. Fall 2011

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Replica Management

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail

Today: Fault Tolerance

Fault Tolerance Dealing with an imperfect world

Replication in Distributed Systems

EECS 482 Introduction to Operating Systems

Byzantine Fault Tolerance

Chapter 19: Distributed Databases

Recovering from a Crash. Three-Phase Commit

Distributed Operating Systems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Systems - I

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Introduction to Distributed Systems Seif Haridi

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Homework #2 Nathan Balon CIS 578 October 31, 2004

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Consensus and related problems

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Synchronization. Before We Begin. Synchronization. Credit/Debit Problem: Race Condition. CSE 120: Principles of Operating Systems.

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems


Practical Byzantine Fault

CSE 380 Computer Operating Systems

CS October 2017

02 - Distributed Systems

02 - Distributed Systems

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

Fault Tolerance. Distributed Systems. September 2002

CSE Traditional Operating Systems deal with typical system software designed to be:

416 practice questions (PQs)

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Distributed Coordination! Distr. Systems: Fundamental Characteristics!

Dep. Systems Requirements

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?

Replicated State Machine in Wide-area Networks

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Synchronization. Before We Begin. Synchronization. Example of a Race Condition. CSE 120: Principles of Operating Systems. Lecture 4.

CS 425 / ECE 428 Distributed Systems Fall 2017

Concurrency Control II and Distributed Transactions

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed System Chapter 16 Issues in ch 17, ch 18

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CSE 120: Principles of Operating Systems. Lecture 4. Synchronization. October 7, Prof. Joe Pasquale

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Algorithms and protocols for distributed systems

Synchronization (contd.)

PART B UNIT II COMMUNICATION IN DISTRIBUTED SYSTEM PART A

Distributed PostgreSQL with YugaByte DB

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Multi-version concurrency control

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Integrity in Distributed Databases

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Several of these problems are motivated by trying to use solutiions used in `centralized computing to distributed computing

Distributed Systems (5DV147)

Agenda. What is Replication?

Replication. Feb 10, 2016 CPSC 416

SCALABLE CONSISTENCY AND TRANSACTION MODELS

International Journal of Advanced Research in Computer Science and Software Engineering

Distributed Systems. Day 9: Replication [Part 1]

Distributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Advanced Distributed Systems

Distributed Databases

CSE 5306 Distributed Systems. Synchronization

Frequently asked questions from the previous class survey

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00)

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Transcription:

CSE 120: Principles of Operating Systems Lecture 13 Distributed Systems December 2, 2003 Before We Begin Read Chapters 15, 17 (on Distributed Systems topics) Prof. Joe Pasquale Department of Computer Science and Engineering University of California, San Diego 2003 by Joseph Pasquale 1 2 What is a Distributed System? Computers connected by a network that can cooperate Degree of integration may be loose messages between inter-machine processes email, ftp Or it may be medium (network operating system) remotely execute cmds, cross mount file systems Or it may be tight (distributed operating system) process migration, distributed file system Advantages Speed: more resources, parallelism, less contention Reliability: resource redundancy, fault tolerance Scalability: incremental growth, can start cheap Economy of scale: amortizing cost of shared resources Geographic distribution: communication, reliability 3 4

Disadvantages Which is Better? Global state uncertainty no shared memory no shared clock Simultaneous decisions may conflict mutually conflicting decision problem Distributed algorithms are complex all interactions via messages (no shared memory) Single fast server with single queue Multiple slower servers with separate queues Multiple slower servers with single queue λ λ/2 λ/2 λ µ distributed debugging is difficult Little s Law: N = λ W 5 6 Performance: Load Balancing Load: average number of runnable processes Keep load balanced across all nodes When process created, consider where to run node where it was submitted? node that is least loaded? random selection? If load becomes unbalanced, migrate processes may be more trouble than worth Reliability: File Replication Maintain multiple copies of files on separate nodes When file requested, obtain from node having copy primary and backup all nodes are equal Improves availability (and maybe even performance) Problem: updates when a file is written, how are copies updated 7 8

The Client/Server Model Client Server Client Short-lived process that makes requests User-side of application Server Exports well-defined requests/response interface Long-lived process that waits for requests Upon receiving request, carries it out (may spawn) Peer-to-Peer Peer-to-peer is typically dynamic client/server a node may act as a client or server other peer-to-peer relationships are possible Example: peer-to-peer file sharing Client A requests file from server B Server B provides file to client A Client C can request file from A or B; say A A, formerly a client, now acts as server 9 10 Distributed File Systems File service: file system interface for remote clients File server: machine(s) that provides file service storage devices connected to file server Issues Naming: location transparency and independence Caching: client caches, server caches, consistency State: stateful vs. stateless file service Replication: number of replicas, updating 11 Event Ordering What is order of events given no shared clock/memory Happened-before relation: -> A, B events of same process and A before B: A -> B A is a send event, B is a receive event: A -> B If A -> B, and B -> C, then A -> C Implementation Timestamp all events based on local clock Upon receiving a message, advance local clock Resolve ties by ordering machines 12

Mutual Exclusion Centralized approach single process acts as coordinator server request, reply (to allow entrance), release Distributed approach process sends a request with TS to all processes waits until it receives all replies (ok to enter) enter critical section (may get requests, defers) upon exiting, responds (to release) to all deferred TS used to order simultaneous requests 13 Atomic Transactions Program unit that must be executed atomically executed to completion, or not at all Distributed transactions transactions broken into multiple sub-transactions sub-transactions executed on different machines To make distributed transaction atomic have all sub-transactions execute run two-phase commit protocol 14 Two-Phase Commit Protocol Phase 1 Coordinator logs <prepare T>, sends to all sites At each site, upon receiving <prepare T> either{log <no T>, reply} or {log <ready T>, reply} Phase 2 C waits to receive all responses (or times out) If all responses are <ready T>, commit; else abort Log decision, and send decision to all sites Log to stable storage 15 Stable Storage Uses multiple storage devs; independent failure modes Each (logical) block has 2 (or more) physical blocks To write: write first block; if successful, write second Write is successful only if all succeed Recovery If no detectable errors and blocks same, all is ok If detectable error in one block, copy good to bad If no errors but blocks different, copy 2nd to 1st 16

Leader Election Byzantine Generals Problem Many distributed algs rely on a leader (e.g., coordinator) Need to determine whether leader exists; if not, elect Bully algorithm (elects leader L) Every process is numbered (priority): P 1, P 2, P i sends request to L, no reply; tries to elect itself P i sends election msg to all P j, j > i; wait for replies No replies, P i becomes L, sends msgs to P j, j < i If some P j replies, then P i waits for election msg If no election msg, P i restarts election algorithm 17 Divisions of Byzantine army surround enemy camp generals must agree whether to attack at dawn all must agree; if only some attack, defeat Generals can only communicate via messengers Messengers may get captured (unreliable comm) Generals may be traitors (faulty processors) Can t create reliable comm from unreliable parts Can overcome faulty procs if n 3m + 1 (m of n faulty) 18