Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Similar documents
Last Class: Naming. Today: Classical Problems in Distributed Systems. Naming. Time ordering and clock synchronization (today)

Last Class: Naming. DNS Implementation

Synchronization. Distributed Systems IT332

Today: Fault Tolerance. Reliable One-One Communication

Last Class: Clock Synchronization. Today: More Canonical Problems

Synchronization Part 1. REK s adaptation of Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 5

Synchronization. Chapter 5

Synchronization. Clock Synchronization

殷亚凤. Synchronization. Distributed Systems [6]

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Last Class: Clock Synchronization. Today: More Canonical Problems

Synchronisation in. Distributed Systems. Co-operation and Co-ordination in. Distributed Systems. Kinds of Synchronisation. Clock Synchronization

Recovering from a Crash. Three-Phase Commit

CSE 5306 Distributed Systems

Distributed Systems COMP 212. Lecture 17 Othon Michail

Lecture 10: Clocks and Time

CSE 5306 Distributed Systems. Synchronization

Time Synchronization and Logical Clocks

Today: Fault Tolerance. Failure Masking by Redundancy

Advanced Topics in Distributed Systems. Dr. Ayman A. Abdel-Hamid. Computer Science Department Virginia Tech

Lecture 12: Time Distributed Systems

Verteilte Systeme (Distributed Systems)

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

TIME ATTRIBUTION 11/4/2018. George Porter Nov 6 and 8, 2018

Clock-Synchronisation

Distributed Coordination 1/39

Time Synchronization and Logical Clocks

Distributed Systems. Clock Synchronization: Physical Clocks. Paul Krzyzanowski

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Fault Tolerance. Distributed Systems IT332

Chapter 6 Synchronization

Time in Distributed Systems

Time. COS 418: Distributed Systems Lecture 3. Wyatt Lloyd

Failure Tolerance. Distributed Systems Santa Clara University

Today: Fault Tolerance. Fault Tolerance

New Topic: Naming. Approaches

OUTLINE. Introduction Clock synchronization Logical clocks Global state Mutual exclusion Election algorithms Deadlocks in distributed systems

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

2. Time and Global States Page 1. University of Freiburg, Germany Department of Computer Science. Distributed Systems

Distributed Operating Systems. Distributed Synchronization

Today: Fault Tolerance. Replica Management

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan


SYNCHRONIZATION. DISTRIBUTED SYSTEMS Principles and Paradigms. Second Edition. Chapter 6 ANDREW S. TANENBAUM MAARTEN VAN STEEN

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Lecture 6: Logical Time

Lecture 11: February 29

CSE 124: TIME SYNCHRONIZATION, CRISTIAN S ALGORITHM, BERKELEY ALGORITHM, NTP. George Porter October 27, 2017

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Distributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems Principles and Paradigms

DISTRIBUTED REAL-TIME SYSTEMS

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

CSE 124: LAMPORT AND VECTOR CLOCKS. George Porter October 30, 2017

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009

ICT 6544 Distributed Systems Lecture 7

Distributed Coordination! Distr. Systems: Fundamental Characteristics!

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Chapter 6 Synchronization (1)

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Time in Distributed Systems

Multi-threaded programming in Java

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Distributed Systems Synchronization

Recovery. Independent Checkpointing

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Distributed Systems. Rik Sarkar James Cheney Global State & Distributed Debugging February 3, 2014

Distributed Systems COMP 212. Revision 2 Othon Michail

2017 Paul Krzyzanowski 1

Distributed Systems COMP 212. Lecture 19 Othon Michail

Lecture 7: Logical Time

PROCESS SYNCHRONIZATION

Distributed Systems (ICE 601) Fault Tolerance

CSE 486/586 Distributed Systems Reliable Multicast --- 1

CSE 486/586 Distributed Systems

wait with priority An enhanced version of the wait operation accepts an optional priority argument:

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

DISTRIBUTED MUTEX. EE324 Lecture 11

Snapshot Protocols. Angel Alvarez. January 17, 2012

Fault Tolerance. Basic Concepts

Arranging lunch value of preserving the causal order. a: how about lunch? meet at 12? a: <receives b then c>: which is ok?

Homework #2 Nathan Balon CIS 578 October 31, 2004

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Checkpointing HPC Applications

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

CSE 380 Computer Operating Systems

Concurrency Control in Distributed Systems. ECE 677 University of Arizona

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Ad hoc and Sensor Networks Time Synchronization

Fault Tolerance. Distributed Systems. September 2002

Network Time Protocol

Distributed Systems. Time, clocks, and Ordering of events. Rik Sarkar. University of Edinburgh Fall 2014

Chapter 8 Fault Tolerance

Transcription:

Clock and Time THOAI NAM Faculty of Information Technology HCMC University of Technology Using some slides of Prashant Shenoy, UMass Computer Science

Chapter 3: Clock and Time Time ordering and clock synchronization Virtual time (logical clock) Distributed snapshot (global state) Consistent/Inconsistent global state Rollback Recovery

Clock Synchronization Time in unambiguous in centralized systems System clock keeps time, all entities use this for time Distributed systems: each node has own system clock Crystal-based clocks are less accurate (1 part in million) Problem: An event that occurred after another may be assigned an earlier time

Physical Clocks: A Primer Accurate clocks are atomic oscillators 1s ~ 9,192,631,770 transitions of the cesium 133 atom Most clocks are less accurate (e.g., mechanical watches) Computers use crystal-based blocks (one part in million) Results in clock drift How do you tell time? Use astronomical metrics (solar day) Universal coordinated time (UTC) international standard based on atomic time Add leap seconds to be consistent with astronomical time UTC broadcast on radio (satellite and earth) Receivers accurate to 0.1 10 ms Need to synchronize machines with a master or with one another

Clock Synchronization Each clock has a maximum drift rate r» 1-r <= dc/dt <= 1+r Two clocks may drift by 2r Dt in time Dt To limit drift to d => resynchronize every d/2r seconds

Cristian s Algorithm Synchronize machines to a time server with a UTC receiver Machine P requests time from server every d/2r seconds Receives time t from server, P sets clock to t+t reply where t reply is the time to send reply to P Use (t req +t reply )/2 as an estimate of t reply Improve accuracy by making a series of measurements

Berkeley Algorithm Used in systems without UTC receiver Keep clocks synchronized with one another One computer is master, other are slaves Master periodically polls slaves for their times» Average times and return differences to slaves» Communication delays compensated as in Cristian s algorithm Failure of master => election of a new master

Berkeley Algorithm a) The time daemon asks all the other machines for their clock values b) The machines answer c) The time daemon tells everyone how to adjust their clock

Distributed Approaches Both approaches studied thus far are centralized Decentralized algorithms: use resynchronization intervals Broadcast time at the start of the interval Collect all other broadcast that arrive in a period S Use average value of all reported times Can throw away few highest and lowest values Approaches in use today rdate: synchronizes a machine with a specified machine Network Time Protocol (NTP)» Uses advanced techniques for accuracies of 1-50 ms

Logical Clocks For many problems, internal consistency of clocks is important Absolute time is less important Use logical clocks Key idea: Clock synchronization need not be absolute If two machines do not interact, no need to synchronize them More importantly, processes need to agree on the order in which events occur rather than the time at which they occurred

Event Ordering Problem: define a total ordering of all events that occur in a system Events in a single processor machine are totally ordered In a distributed system: No global clock, local clocks may be unsynchronized Can not order events on different machines using local times Key idea [Lamport ] Processes exchange messages Message must be sent before received Send/receive used to order events (and synchronize clocks)

Happened-Before Relation If A and B are events in the same process and A executed before B, then A -> B If A represents sending of a message and B is the receipt of this message, then A -> B Relation is transitive: A -> B and B -> C => A -> C Relation is undefined across processes that do not exchange messages Partial ordering on events

Event Ordering Using HB Goal: define the notion of time of an event such that If A-> B then C(A) < C(B) If A and B are concurrent, then C(A) <, = or > C(B) Solution: Each processor maintains a logical clock LC i Whenever an event occurs locally at I, LC i = LC i +1 When i sends message to j, piggyback LC i When j receives message from i» If LC j < LC i then LC j = LC i +1 else do nothing Claim: this algorithm meets the above goals

Lamport s Logical Clocks

More Canonical Problems Causality Vector timestamps Global state and termination detection Election algorithms

Causality Lamport s logical clocks If A -> B then C(A) < C(B) Reverse is not true!!» Nothing can be said about events by comparing time-stamps!» If C(A) < C(B), then?? Need to maintain causality Causal delivery:if send(m) -> send(n) => deliver(m) -> deliver(n) Capture causal relationships between groups of processes Need a time-stamping mechanism such that:» If T(A) < T(B) then A should have causally preceded B

Vector Clocks Each process i maintains a vector V i V i [i] : number of events that have occurred at process i V i [j] : number of events occurred at process j that process i knows Update vector clocks as follows Local event: increment V i [i] Send a message: piggyback entire vector V Receipt of a message:» V j [i] = V j [i]+1» Receiver is told about how many events the sender knows occurred at another process k V j [k] = max( V j [k],v i [k] ) Homework: convince yourself that if V(A)<V(B), then A causally precedes B

Global State Global state of a distributed system Local state of each process Messages sent but not received (state of the queues) Many applications need to know the state of the system Failure recovery, distributed deadlock detection Problem: how can you figure out the state of a distributed system? Each process is independent No global clock or synchronization Distributed snapshot: a consistent global state

Consistent/Inconsistent Cuts a) A consistent cut b) An inconsistent cut

Distributed Snapshot Algorithm Assume each process communicates with another process using unidirectional point-to-point channels (e.g, TCP connections) Any process can initiate the algorithm Checkpoint local state Send marker on every outgoing channel On receiving a marker Checkpoint state if first marker and send marker on outgoing channels, save messages on all other channels until: Subsequent marker on a channel: stop saving state for that channel

Distributed Snapshot A process finishes when It receives a marker on each incoming channel and processes them all State: local state plus state of all channels Send state to initiator Any process can initiate snapshot Multiple snapshots may be in progress» Each is separate, and each is distinguished by tagging the marker with the initiator ID (and sequence number) A M M B C

Snapshot Algorithm Example (1) (a) Organization of a process and channels for a distributed snapshot

Snapshot Algorithm Example (2) (b) Process Q receives a marker for the first time and records its local state (c) Q records all incoming message (d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

Recovery Techniques thus far allow failure handling Recovery: operations that must be performed after a failure to recover to a correct state Techniques: Checkpointing:» Periodically checkpoint state» Upon a crash roll back to a previous checkpoint with a consistent state

Independent Checkpointing Each processes periodically checkpoints independently of other processes Upon a failure, work backwards to locate a consistent cut Problem: if most recent checkpoints form inconsistenct cut, will need to keep rolling back until a consistent cut is found Cascading rollbacks can lead to a domino effect.

Coordinated Checkpointing Take a distributed snapshot Upon a failure, roll back to the latest snapshot All process restart from the latest snapshot

Message Logging Checkpointing is expensive All processes restart from previous consistent cut Taking a snapshot is expensive Infrequent snapshots => all computations after previous snapshot will need to be redone [wasteful] Combine checkpointing (expensive) with message logging (cheap) Take infrequent checkpoints Log all messages between checkpoints to local stable storage To recover: simply replay messages from previous checkpoint» Avoids recomputations from previous checkpoint