The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

Size: px

Start display at page:

Download "The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer"

Coleen James
6 years ago
Views:

1 The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer - proposes a formal definition for the timed asynchronous distributed system model - presents measurements of process scheduling delays and hardware clocks drifts - Distributed systems can be classified as synchronous or asynchronous, depending on the underlying communication layer (if it provides certain communication ) Definition: certain communication = 1) at any time there s a minimum number of correct processes 2) any message m sent between two correct processes is received in a known amount of time (i.e. the probability that the message is not received and processed in time is negligible ) Definition: synchronous system = a system that guarantees certain communication asynchronous system = a system which is not synchronous - In a synchronous system the frequency of failure is bounded (to achieve certain communication) (e.g. use space or time redundancy at lower levels to bound the failure frequency). Note: the paper was written to give definitions and explain timed asynchronous distributed system model. This helps the design and proper understanding of faulttolerant systems (dependable systems). Citation [5]: fault-tolerant system the basic blocks: services, servers and the depends relation. - servers implement services by using other services that are implemented by other servers => server u depends on server s r behavior if the correctness of u s behavior depends on the correctness of r s behavior. - server u is the user, or client of r, while r is called a resource for u. Definition: dependable systems = distributed systems composed of a set of processes together with depend upon relation, characterized by strict stochastic specifications (minimum probability that the standard behavior id observed at run-time, and maximum probability that a potentially catastrophic failure, different from the specified failure behavior, is observed). Observation: not all dependable systems have relaxed stochastic specifications, so we can assume that the communication is not certain (the system is not synchronous). Asynchronous systems classification: - systems based on time-free model: o services are time-free (transitions specified by the input, no time restriction)

2 o interprocess communication is reliable (any message sent is eventually delivered at the destination) o processes have crash-failure semantics o processes have no access to hardware clocks - timed asynchronous distributed system model (timed model) o all services are timed o messages are unreliable datagrams with omissions/performance-failure (late delivery) semantics o processes have crash/performance-failure semantics o processes have access to the hardware clock o no bound on the frequency of communication and process failures Examples of distributed services implement-able in a timed model: - clock synchronization - membership consensus - election - atomic broadcast - The failure semantics of interprocess communication in time-free systems is much stronger than in the timed model (* page 2) Goals of the paper: 1) give a formal definition for the timed asynchronous distributed system model 2) provide measurements of message and process scheduling delays and clock drifts to confirm that this model accurately describe the current distributed systems built for networked workstations 3) give an intuitive explanation of why consensus or leader election problems are implement-able in timed systems (they can t be implemented in time-free systems). THE MODEL TADS = timed asynchronous distributed system TADS = a set of processes P that run on the computer nodes of a network - processes on the same node = local (otherwise remote) - processes have access to a local hardware clock A. Hardware Clocks - Hardware clock = an oscillator + a counting register - the clock is incremented with a value G, the granularity - definition: correct clocks display strictly monotonically increasing values. RT real-time values (global time) CT clock-time values

3 H p : RT CT (process s p clock) Drift rate = how many microseconds a hardware clock drifts apart from a real-time per second Assume the existence of a constant maximum drift rate ρ << 1 that bounds the absolute value of the drift rate. Definition: correctness of a hardware clock at time u: any interval of time before u was measured correctly (with a specified error, function of ρ). - the clocks can be calibrated, to reduce the drift rate: the speed of the clock is changed by a factor c (see page 3). - externally synchronized clocks: the deviation between a correct clock and real-time is bounded by a known constant - internally synchronized clocks: the deviation between two correct clocks is bounded by a known constant!!! Clock Failure Assumption: Assume that each non-crashed process has access to a correct hardware clock (the hardware clock drifts by at most ρ). => if the process is not crashed at moment t, its hardware clock is correct at time t. B. Datagram Service - provides primitives for transmitting unicast and broadcast messages. - primitives: - send (m, q) - broadcast (m) - deliver (m, p) - transmission delay of a message m = receive time send time Requirements for a datagram service: - Validity: all messages received were sent by the process that is register as sender (if process q receives a message from process p at time t, then process p sent the message at a time u < t) no spoofing - No-duplication - Min-delay: any message m transmitted between two remote processes has a transmission delay at least d - the datagram service does not guarantee an upper bound for the transmission of messages. - a one-way time-out delay δ (delta) can be defined

4 - a message that arrives in less time than d is early. In timed systems there s no early message, d is well-chosen. - how delta is chosen depends on the application s needs Datagram Failure Assumption: The asynchronous datagram service is assumed to have an omission / performance failure semantics. It can drop messages or deliver them late. The following probabilities are negligible: - source address spoofing - message corruption - message duplication (the same message is delivered multiple times to the same process) C. Process Management Service 1. Process modes: up, crashed, recovering + state diagram 2. Alarm clocks: a process p suffers a performance failure if it is awakened too late. Process Failure Assumption: The timed model assumes that processes have crash/performance failure semantics. The execution of a process can stop prematurely (crash) or it can be awakened too late. - the probability that a processor executes the program of a process incorrectly is negligible (if can t make this assumption use 2 processors, and a lock-step execution). Conclusion: - the core of TADS assumes: o a datagram service o the process management service o local hardware clocks - optional extensions: o stable storage o progress assumption (not valid for large scale systems connected by wide area networks) EXTENSIONS Stable storage: - the state of a process is saved on disk, so when it recovers from a crash it can recover reading the latest state from the disk Stability and Progress Assumption: Definition: stable partition of processes: - all processes are timely

5 - only a bounded number of messages are not delivered timely - from any other partition there s no message or the messages arrive late -F-Partitions (partition predicate over a time interval [s, t]) Definition: p and q are F-connected iff (1) p and q are timely in [s, t], and (2) at most F messages sent between two processes in [s, t] are late Definition: process p is -disconnected from q in [s, t] iff any message m sent from p to q in the interval [s, t] is late (the transmission delay is > δ. Definition: the set S is -F-partition in interval [s, t] iff all processes in S are F- connected and -disconnected from all other processes (not in S). Progress assumption: long periods of stability, followed by short periods of instability COMMUNICATION BY TIME - synchronous systems: if process p does not hear in time I am alive from q, p knows that q crashed - asynchronous systems: even if process p does not hear in time I am alive from q, it doesn t know if q crashed, q is slow or the network is overloaded (and the message was dropped or is late) => the asynchronous systems are characterized by communication uncertainty. the problem of leader reelection is more complicated in asynchronous systems Example: processes p and q communicate by time using a locking mechanism, to ensure that at most one of them is the leader at any time (page 12). POSIBILITY AND IMPOSIBILITY ISSUES - problems like election and consensus are implementable in actual distributed computing systems, while they do not allow a deterministic solution in o the time-free model o in core-timed model Addition by Viraj There was discussion whether there was an implicit assumption in this paper that Global knowledge is assumed the argument was never resolved but most of them felt that there was. There was confusion in whether the paper assumed that the receiver receives at-least/at - most one copy of the message from the sender. The result of this discussion was the same as the previous one.

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability