Fault-tolerant design techniques. slides made with the collaboration of: Laprie, Kanoon, Romano

Size: px
Start display at page:

Download "Fault-tolerant design techniques. slides made with the collaboration of: Laprie, Kanoon, Romano"

Transcription

1 Fault-tolerant design techniques slides made with the collaboration of: Laprie, Kanoon, Romano

2 Fault Tolerance Key Ingredients

3 Error Processing ERROR PROCESSING Error detection: identification of erroneous state(s) Error diagnosis: damage assessment Error recovery: error-free state substituted to erroneous state Backward recovery: system brought back in state visited before error occurrence Recovery points (checkpoint) Forward recovery:erroneous state is discarded and correct one is determined Without losing any computation.

4 Fault Treatment

5 Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or time. Such redundancy can be implemented in static, dynamic, or hybrid configurations. Fault tolerance can be achieved by the following techniques: Fault masking is any process that prevents faults in a system from introducing errors. Example: Error correcting memories and majority voting. Reconfiguration is the process of eliminating faulty component from a system and restoring the system to some operational state. 5

6 Reconfiguration Approach Fault detection is the process of recognizing that a fault has occurred. Fault detection is often required before any recovery procedure can be initiated. Fault location is the process of determining where a fault has occurred so that an appropriate recovery can be initiated. Fault containment is the process of isolating a fault and preventing the effects of that fault from propagating throughout the system. Fault recovery is the process of regaining operational status via reconfiguration even in the presence of faults. 6

7 The Concept of Redundancy Redundancy is simply the addition of information, resources, or time beyond what is needed for normal system operation. Hardware redundancy is the addition of extra hardware, usually for the purpose either detecting or tolerating faults. Software redundancy is the addition of extra software, beyond what is needed to perform a given function, to detect and possibly tolerate faults. Information redundancy is the addition of extra information beyond that required to implement a given function; for example, error detection codes. 7

8 The Concept of Redundancy (Cont d) Time redundancy uses additional time to perform the functions of a system such that fault detection and often fault tolerance can be achieved. Transient faults are tolerated by this approach. The use of redundancy can provide additional capabilities within a system. But, redundancy can have very important impact on a system's performance, size, weight and power consumption. 8

9 HARDWARE REDUNDANCY

10 Hardware Redundancy Static techniques use the concept of fault masking. These techniques are designed to achieve fault tolerance without requiring any action on the part of the system. Relies on voting mechanisms. (also called passive redundancy or fault-masking) Dynamic techniques achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. That is, active techniques use fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance. (also called active redundancy ) 10

11 Hardware Redundancy (Cont d) Hybrid techniques combine the attractive features of both the passive and active approaches. Fault masking is used in hybrid systems to prevent erroneous results from being generated. Fault detection, location, and recovery are also used to improve fault tolerance by removing faulty hardware and replacing it with spares. 11

12 Hardware Redundancy - A Taxonomy 12

13 Triple Modular Redundancy (TMR) Masks failure of a single component. Voter is a SINGLE POINT OF FAILURE. 13

14 Reliability of TMR Ideal Voter (R V (t)=1) R SYS (t)=r M (t) 3 +3R M (t) 2 [1-R M (t)]=3r M (t) 2-2R M (t) 3 Non-ideal Voter R SYS (t)=r SYS (t)r V (t) R M (t)=e -λt R SYS (t)=3 e -2λt -2e -3λt

15 TMR with Triplicate Voters 15

16 Multistage TMR System 16

17 N-Modular Redundancy (NMR) Generalization of TMR employing N modules rather than 3. PRO: If N>2f, up to f faults can be tolerated: e.g. 5MR allows tolerating the failures of two modules CON: Higher cost wrt TMR

18 Reliability Plot

19 Hardware vs Software Voters The decision to use hardware voting or software voting depends on: The availability of processor to perform voting. The speed at which voting must be performed. The criticality of space, power, and weight limitations. The flexibility required of the voter with respect to future changes in the system. Hardware voting is faster, but at the cost of more hardware. Software voting is usually slow, but no additional hardware cost. 19

20 Dynamic (or active) redundancy Normal functioning Degraded functioning Fault Occurrence Error Occurrence Fault Containment and Recovery Failure Occurrence

21 Standby Sparing In standby sparing, one module is operational and one or more modules serve as standbys or spares. If a fault is detected and located, the faulty module is removed from the operation and replaced with a spare. Hot standby sparing: the standby modules operate in synchrony with the online modules and are prepared to take over any time. Cold standby sparing: the standby modules are unpowered until needed to replace a faulty module. This involves momentary disturbance in the service. 21

22 Standby Sparing (Cont d) Hot standby is used in applications such as process control where the reconfiguration time needs to be minimized. Cold standby is used in applications where power consumption is extremely important. The key advantage of standby sparing is that a system containing n identical modules can often provide fault tolerance capabilities with significantly fewer power consumption than n redundant/parallel modules. 22

23 Standby Sparing (Cont d) Here, one of the N modules is used to provide system s output and the remaining (N-1) modules serve as spares. 23

24 Pair-and-a-Spare Technique Pair-and-a-Spare technique combines the features present in both standby sparing and duplication with comparison. Two modules are operated in parallel at all times and their results are compared to provide the error detection capability required in the standby sparing approach. second duplicate (pair, and possibly more in case of pair and k-spare) is used to take over in case the working duplicate (pair) detects an error a pair is always operational 24

25 Pair-and-a-Spare Technique (Cont d) = Output =

26 Pair-and-a-Spare Technique (Cont d) Two modules are always online and compared, and any spare replace either of the online modules. 26

27 Mettere figura di impianto.

28 Watchdog Timers The concept of a watchdog timer is that the lack of an action is an indicative of fault. A watchdog timer is a timer that must be reset on a repetitive basis. The fundamental assumption is that the system is fault free if it possesses the capability to repetitively perform a function such as setting a timer. The frequency at which the timer must be reset is application dependent. A watchdog timer can be used to detect faults in both the hardware and the software of a system. 28

29 Hybrid redundancy Hybrid hardware redundancy Key - combine passive and active redundancy schemes NMR with spares example - 5 units 3 in TMR mode 2 spares all 5 connected to a switch that can be reconfigured comparison with 5MR 5MR can tolerate only two faults where as hybrid scheme can tolerate three faults that occur sequentially cost of the extra fault-tolerance: switch

30 Hybrid redundancy Initially active modules Voter Switch Output Spares

31 NMR with spares The idea here is to provide a basic core of N modules arranged in a form of voting configuration and spares are provided to replace failed units in the NMR core. The benefit of NMR with spares is that a voting configuration can be restored after a fault has occurred. 31

32 NMR with Spares (Cont d) The voted output is used to identify faulty modules, which are then replaced with spares. 32

33 Self-Purging Redundancy This is similar to NMR with spares except that all the modules are active, whereas some modules are not active (i.e., the spares) in the NMR with spares.

34 Sift-Out Modular Redundancy It uses N identical modules that are configured into a system using special circuits called comparators, detectors, and collectors. The function of the comparator is used to compare each module's output with remaining modules' outputs. The function of the detector is to determine which disagreements are reported by the comparator and to disable a unit that disagrees with a majority of the remaining modules. 34

35 Sift-Out Modular Redundancy (Cont d) The detector produces one signal value for each module. This value is 1, if the module disagrees with the majority of the remaining modules, 0 otherwise. The function of the collector is to produce system's output, given the outputs of the individual modules and the signals from the detector that indicate which modules are faulty. 35

36 Sift-Out Modular Redundancy (Cont d) All modules are compared to detect faulty modules. 36

37 Hardware Redundancy - Summary Static techniques rely strictly on fault masking. Dynamic techniques do not use fault masking but instead employ detection, location, and recovery techniques (reconfiguration). Hybrid techniques employ both fault masking and reconfiguration. In terms of hardware cost, dynamic technique is the least expensive, static technique in the middle, and the hybrid technique is the most expensive. 37

38 TIME REDUNDANCY

39 Time Redundancy - Transient Fault Detection In time redundancy, computations are repeated at different points in time and then compared. No extra hardware is required. 39

40 Time Redundancy - Permanent Fault Detection During first computation, the operands are used as presented. During second computation, the operands are encoded in some fashion. The selection of encoding function is made so as to allow faults in the hardware to be detected. Used approaches, e.g., in ALUs: Recomputing with shifted operands Recomputing with swapped operands... 40

41 Time Redundancy - Permanent Fault Detection (Cont d) 41

42 SOFTWARE REDUNDANCY

43 Software Redundancy to Detect Hardware Faults Consistency checks use a priori knowledge about the characteristics of the information to verify the correctness of that information. Example: Range checks, overflow and underflow checks. Capability checks are performed to verify that a system possesses the expected capabilities. Examples: Memory test - a processor can simply write specific patterns to certain memory locations and read those locations to verify that the data was stored and retrieved properly. 43

44 Software Redundancy - to Detect Hardware Faults (Cont d) ALU tests: Periodically, a processor can execute specific instructions on specific data and compare the results to known results stored in ROM. Testing of communication among processors, in a multiprocessor, is achieved by periodically sending specific messages from one processor to another or writing into a specific location of a shared memory. 44

45 Fault Tolerance Software Implemented Against Hardware Faults. An example. Comparator Output Mismatch Disagreement triggers interrupts to both processors. Both run self diagnostic programs The processor that find itself failure free within a specified time continues operation The other is tagged for repair 45

46 Software Redundancy - to Detect Hardware Faults. One more example. All modern day microprocessors use instruction retry Any transient fault that causes an exception such as parity violation is retried Very cost effective and is now a standard technique

47 Software Redundancy to Detect Software Faults There are two popular approaches: N-Version Programming (NVP) and Recovery Blocks (RB). NVP masks faults. RB is a backward error recovery scheme. In NVP, multiple versions of the same task is executed concurrently, whereas in RB scheme, the versions of a task are executed serially. NVP relies on voting. RB relies on acceptance test. 47

48 N-Version Programming (NVP) NVP is based on the principle of design diversity, that is coding a software module by different teams of programmers, to have multiple versions. The diversity can also be introduced by employing different algorithms for obtaining the same solution or by choosing different programming languages. NVP can tolerate both hardware and software faults. Correlated faults are not tolerated by the NVP. In NVP, deciding the number of versions required to ensure acceptable levels of software reliability is an important design consideration. 48

49 N-Version Programming (Cont d) 49

50 Recovery Blocks (RB) RB uses multiple alternates (backups) to perform the same function; one module (task) is primary and the others are secondary. The primary task executes first. When the primary task completes execution, its outcome is checked by an acceptance test. If the output is not acceptable, another task is executed after undoing the effects of the previous one (i.e., rolling back to the state at which primary was invoked) until either an acceptable output is obtained or the alternatives are exhausted. 50

51 Recovery Blocks (Cont d) 51

52 Recovery Blocks (Cont d) The acceptance tests are usually sanity checks; these consist of making sure that the output is within a certain acceptable range or that the output does not change at more than the allowed maximum rate. Selecting the range for the acceptance test is crucial. If the allowed ranges are too small, the acceptance tests may label correct outputs as bad (false positives). If they are too large, the probability that incorrect outputs will be accepted (false negatives) will be increase. RB can tolerate software faults because the alternatives are usually implemented with different approaches; RB is also known as Primary-Backup approach. 52

53 Single Version Fault Tolerance: Software Rejuvenation Example: Rebooting a PC As a process executes it acquires memory and file-locks without properly releasing them memory space tends to become increasingly fragmented the process can become faulty and stop executing To head this off, proactively halt the process, clean up its internal state, and then restart it Rejuvenation can be time-based or prediction-based Time-Based Rejuvenation - periodically Rejuvenation period - balance benefits against cost

54 INFORMATION REDUNDANCY

55 Information Redundancy Guarantee data consistency by exploiting additional information to achieve a redundant encoding. Redundant codes permit to detect or correct corrupted bits because of one or more faults: Error Detection Codes (EDC) Error Correction Codes (ECC)

56 Functional Classes of Codes Single error correcting codes any one bit can be detected and corrected Burst error correcting codes any set of consecutive b bits can be corrected Independent error correcting codes up to t errors can be detected and corrected Multiple character correcting codes n-characters, t of them are wrong, can be recovered Coding complexity goes up with number of errors Sometimes partial correction is sufficient

57 Let: b: be the code s alphabet size (the base in case of numerical codes); n: the (constant) block size; N: the number of elements to be coded; m: the minimum value of n which allows to encode all the elements in the source code, i.e. minimum m such that b m >= N A code is said Redundant Codes Not redudant if Redundant if Ambiguous if n = m n > m n < m

58 Binary Codes: Hamming distance The Hamming distance d(x,y) between two words (x,y) of a code (C) is the number of different bits in the same position between x and y d( 10010, ) = 4 d( 11010, ) = 2 The minimum distance of a code is d min = min(d(x,y)) for all x y in C

59 Ambiguity and redundancy Not redundant codes h = 1 (and n = m) Redundant codes h >= 1 (and n > m) Ambiguous codes h = 0

60 Hamming Distance: Examples Words of C First Code Second Code Third Code Fourth code Fifth code alfa beta gamma delta mu h = 1 h = 1 h = 0 h = 2 h = 3 Not Red. Amb. Red. Red. Red. (EDC) (ECC)

61 Error Detecting Codes (EDC) error Transmitter TX Link RX Receiver To detect transmission erroes the transmittin system introduces redundancy in the transmitted information. In an error detecting code the occurrence of an error on a word of the code generates a word not belonging to the code The error weight is the number (and distribution) of corrupted bits tolerated by the code. In binary systems there are only two error possibilities Trasmit 0 Receive 1 Trasmit 1 Receive 0

62 Error Detection Codes The Hamming distance d(x,y) between two words (x,y) of a code (C) is the number of different positions (bits) between x and y d( 10010, ) = 4 d( 11010, ) = 2 The minimum distance of a code is d min = min(d(x,y)) for all x y in C A code having minimum distance d is able to detect errors with weight d-1

63 Error Detecting Codes Code 1 Code 2 A => B => C => D => d min =1 d min =2 Legal code words Illegal code words

64 Parity Code (minimum distance 2) A code having d min =2 can be obtained by using the following expressions: d 1 + d 2 + d d n + p = 0 parity (even number of 1 ) or d 1 + d 2 + d d n + p = 1 odd number of 1 Being n the number of bits of the original block code and + is modulo 2 sum operator and p is the parity bit to add to the original word to obtain an EDC code Information Parity Parity (even) (odd) A code with minimum distance equal to 2 can detect errors having weight 1 (single error)

65 Parity Code Information bits to send Received information bits Transm. System Gen. parity Parity bit Received parity Verify parity Signal Error I 1 + I 2 + I 3 + p = 0 I 1 + I 2 + I 3 + p =? If equal to 0 there has been no single error If equal to 1 there has been a single error Ex. I trasmit 101 Parity generator computes the parity bit p = 0, namely p = 0 and 1010 is transmitted If 1110 is received, the parity check detects an error = 1 0 If 1111 is received: = 0 all right??, no single mistakes!! (double/even weight errors are unnoticeable)

66 Error Correcting Codes A code having minimum distance d can correct errors with weight (d-1)/2 When a code has minimum distance 3 can correct errors having weight = d =

67 67

68 Codici Hamming(1) Metodo per la costruzione di codici a distanza minima 3 per ogni i e possibile costruire un codice a 2 i -1 bit con i bit di parità (check bit) e 2 i -1-i bit di informazione. I bit in posizione corrispondente ad una potenza di 2 (1,2,4,8,...) sono bit di parità i rimanenti sono bits di informazione Ogni bit di parità controlla la correttezza dei bit di informazione la cui posizione, espressa in binario, ha un 1 nella potenza di 2 corrispondente al bit di parità (3) 10 = (0 1 1) 2 (5) 10 = (1 0 1) 2 (6) 10 = (1 1 0) 2 (7) 10 = (1 1 1) 2 I 7 + I 6 + I 5 + p 4 = 0 I 7 + I 6 + I 3 + p 2 = 0 I 7 + I 5 + I 3 + p 1 =

69 69

70 Codici Hamming(2) p 1 p 2 I 3 p 4 I 5 I 6 I posizione Gruppi p 4 + I 5 + I 6 + I 7 = 0 X X X X X X X X X X X X p 2 + I 3 + I 6 + I 7 = 0 p 1 + I 3 + I 5 + I 7 = 0 p i : bit di parità I i : bit di informazione

71 71

72 Circuito di EDAC (Error Detection And Correction) Bits di informazione inviati Bits di informazione ricevuti Sistema trasmissione Generatore di chek bits (Encoder) p4 = I5 + I6 + I7 Bits di parità Controllo dei chek bits (Decoder) Segnale di errore Sindrome p2 = I3 + I6 + I7 p1 = I3 + I5 + I7 S4 = p4 + I5 + I6 + I7 S2 =p2 + I3 + I6 + I7 S1 = p1+ I3 + I5 + I7 Somma modulo 2 Se i tre bit di sindrome sono pari a 0 non ci sono stati errori altrimenti il loro valore da la posizione del bit errato

73 Redundant Array of Inexpensive Disks RAID

74 RAID Architecture RAID: Redundant Array of Inexpensive Disks Combine multiple small, inexpensive disk drives into a group to yield performance exceeding that of one large, more expensive drive Appear to the computer as a single virtual drive Support fault-tolerance by redundantly storing information in various ways Uses Data Striping to achieve better performance

75 Basic Issues Two operations performed on a disk Read() : small or large. Write(): small or large. Access Concurrency is the number of simultaneous requests the can be serviced by the disk system Throughput is the number of bytes that can be read or written per unit time as seen by one request Data Striping: spreading out blocks of each file across multiple disk drives.

76

77 RAID Levels: RAID-0 No Redundancy No Fault Tolerance, If one drive fails then all data in the array is lost. High I/O performance Parallel I/O Best Storage efficiency

78 RAID-1 Disk Mirroring Poor Storage efficiency. Best Read Performance: double of the RAID 0. Poor write Performance: two disks to be written. Good fault tolerance: as long as one disk of a pair is working then we can perform R/W operations.

79 RAID-2 Bit Level Striping. Uses Hamming Codes, a form of Error Correction Code (ECC). Can Tolerate the Failure of one disk # Redundant Disks = O (log (total disks)). Better Storage efficiency than mirroring. High throughput but no access concurrency: disks need to be ALWAYS simulatenously accessed Synchronized rotation Expensive write. Example, for 4 disks 3 redundant disks to tolerate one disk failure

80 RAID-3 Byte Level Striping with parity. No need for ECC since the controller knows which disk is in error. So parity is enough to tolerate one disk failure. Best Throughput, but no concurrency. Only one Redundant disk is needed.

81 81 P Logic Record Physical Record RAID-3 (example in which there is only a byte for disk)

82 RAID-4 Block Level Striping. Stripe size introduces the tradeoff between access concurrency versus throughput. Parity disk is a bottleneck in the case of a small write where we have multiple writes at the same time. No problems for small or large reads.

83 Writes in RAID-3 and RAID-4. In general writes are very expensive. Option 1: read data on all other disks, compute new parity P and write it back Ex.: 1 logical write= 3 physical reads + 2 physical writes Option 2: check old data D0 with the new one D0, add the difference to P, and write back P Ex.: 1 logical write= 2 physical reads + 2 physical writes

84 RAID-5 Block-Level Striping with Distributed parity. Parity is uniformly distributed across disks. Reduces the parity Bottleneck. Best small and large read (same as 4). Best Large write. Still costly for small write

85 Writes in Raid 5 D0 D1 D2 D3 P Concurrent writes are possible thanks to the interleaved parity Ex.: Writes of D0 and D5 use disks 0, 1, 3, 4 D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P disk 0 disk 1 disk 2 disk 3 disk 4

86 Summary of RAID Levels

87 Limits of RAID-5 RAID-5 is probably the most employed scheme The larger the number of disks in a RAID-5, the better performances we may get......but the larger gets the probability of double disk failure: After a disk crash, the RAID system needs to reconstruct the failed crash: detect, replace and recreate a failed disk this can take hours if the system is busy The probability that one disk out N-1 to crashes within this vulnerability window can be high if N is large: especially considering that disks in an array have typically the same age => correlated faults rebuilding a disk may cause reading a HUGE number of data may become even higher than the probability of a single disk s failure

88 RAID-6 Block-level striping with dual distributed parity. Two sets of parity are calculated. Better fault tolerance Data reconstruction is faster than RAID5, so the probability of a second Fault during data reconstruction is less. Writes are slightly worse than 5 due to the added overhead of more parity calculations. May get better read performance than 5 because data and parity are spread into more disks.

89 Error Propagation in Distributed Systems and Rollback Error Recovery Techniques

90 System Model System consists of a fixed number (N) of processes which communicate only through messages. Processes cooperate to execute a distributed application program and interact with outside world by receiving and sending input and output messages, respectively. Outside World Message-passing system Input Messages P 0 P 1 P 2 m 1 m 2 Output Messages

91 Rollback Recovery in a Distributed System Rollback recovery treats a distributed system as a collection of processes that communicate through a network Fault tolerance is achieved by periodically using stable storage to save the processes states during the failurefree execution. Upon a failure, a failed process restarts from one of its saved states, thereby reducing the amount of lost computation. Each of the saved states is called a checkpoint 91

92 Checkpoint based Recovery: Overview Uncoordinated checkpointing: Each process takes its checkpoints independently Coordinated checkpointing: Processes coordinate their checkpoints in order to save a system-wide consistent state. This consistent set of checkpoints can be used to bound the rollback Communication-induced checkpointing: It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes. 92

93 Consistent System State A consistent system state is one in which if a process s state reflects a message receipt, then the state of the corresponding sender reflects sending that message. A fundamental goal of any rollback-recovery protocol is to bring the system into a consistent state when inconsistencies occur because of a fault. 93

94 Example P 0 P 1 P 2 Consistent state m 1 m 2 P 0 m 1 m2 Inconsistent state P 1 P 2 m 2 becomes the orphan message

95 Checkpointing protocols Each process periodically / not periodically saves its state on stable storage. The saved state contains sufficient information to restart process execution. A consistent global checkpoint is a set of N local checkpoints, one from each process, forming a consistent system state. Any consistent global checkpoint can be used to restart process execution upon a failure. The most recent consistent global checkpoint is termed as the recovery line. In the uncoordinated checkpointing paradigm, the search for a consistent state might lead to domino effect. 95

96 Domino effect: example Recovery Line P 0 m 0 m 2 m 3 m 5 m 7 P 1 m 1 m 4 m 6 P 2 Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints

97 Interactions with outside world A message passing system often interacts with the outside world to receive input data or show the outcome of a computation. If a failure occurs the outside world cannot be relied on to rollback. For example, a printer cannot rollback the effects of printing a character, and an automatic teller machine cannot recover the money that it dispensed to a customer. It is therefore necessary that the outside world perceive a consistent behavior of the system despite failures.

98 Interactions with outside world (contd.) Thus, before sending output to the outside world, the system must ensure that the state from which the output is sent will be recovered despite of any future failure Similarly, input messages from the outside world may not be regenerated, thus the recovery protocols must arrange to save these input messages so that they can be retrieved when needed.

99 Garbage Collection Checkpoints and event logs consume storage resources. As the application progresses and more recovery information collected, a subset of the stored information may become useless for recovery. Garbage collection is the deletion of such useless recovery information. A common approach to garbage collection is to identify the recovery line and discard all information relating to events that occurred before that line.

100 Checkpoint-Based Protocols Uncoordinated Check pointing Allows each process maximum autonomy in deciding when to take checkpoints Advantage: each process may take a checkpoint when it is most convenient Disadvantages: Domino effect Possible useless checkpoints Need to maintain multiple checkpoints Garbage collection is needed Not suitable for applications with outside world interaction (output commit)

101 Coordinated Checkpointing Coordinated checkpointing requires processes to orchestrate their checkpoints in order to form a consistent global state. It simplifies recovery and is not susceptible to the domino effect, since every process always restarts from its most recent checkpoint. Only one checkpoint needs to be maintained and hence less storage overhead. No need for garbage collection. Disadvantage is that a large latency is involved in committing output, since a global checkpoint is needed before output can be committed to the outside world.

102 Blocking Coordinated Checkpointing Phase 1: A coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to take a checkpoint. When a process receives this message, it stops its execution and flushes all the communication channels, takes a tentative checkpoint, and sends an acknowledgement back to the coordinator. Phase 2: After the coordinator receives all the acknowledgements from all processes, it broadcasts a commit message that completes the two-phase checkpointing protocol. After receiving the commit message, all the processes remove their old permanent checkpoint and make the tentative checkpoint permanent. Disadvantage: Large Overhead due to large block time

103 Communication-induced checkpointing Avoids the domino effect while allowing processes to take some of their checkpoints independently. However, process independence is constrained to guarantee the eventual progress of the recovery line, and therefore processes may be forced to take additional checkpoints. The checkpoints that a process takes independently are local checkpoints while those that a process is forced to take are called forced checkpoints.

104 Communication-induced checkpoint (contd.) Protocol related information is piggybacked to the application messages: receiver uses the piggybacked information to determine if it has to force a checkpoint to advance the global recovery line. The forced checkpoint must be taken before the application may process the contents of the message, possibly incurring high latency and overhead: Simplest communication-induced checkpointing: force a checkpoint whenever a message is received, before processing it reducing the number of forced checkpoints is important. No special coordination messages are exchanged.

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks Amdahl s law in Chapter 1 reminds us that

More information

Physical Storage Media

Physical Storage Media Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

CSE 380 Computer Operating Systems

CSE 380 Computer Operating Systems CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note on Disk I/O 1 I/O Devices Storage devices Floppy, Magnetic disk, Magnetic tape, CD-ROM, DVD User

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] How does the OS caching optimize disk performance? How does file compression work? Does the disk change

More information

An Introduction to RAID

An Introduction to RAID Intro An Introduction to RAID Gursimtan Singh Dept. of CS & IT Doaba College RAID stands for Redundant Array of Inexpensive Disks. RAID is the organization of multiple disks into a large, high performance

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

I/O Hardwares. Some typical device, network, and data base rates

I/O Hardwares. Some typical device, network, and data base rates Input/Output 1 I/O Hardwares Some typical device, network, and data base rates 2 Device Controllers I/O devices have components: mechanical component electronic component The electronic component is the

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

HP AutoRAID (Lecture 5, cs262a)

HP AutoRAID (Lecture 5, cs262a) HP AutoRAID (Lecture 5, cs262a) Ion Stoica, UC Berkeley September 13, 2016 (based on presentation from John Kubiatowicz, UC Berkeley) Array Reliability Reliability of N disks = Reliability of 1 Disk N

More information

Distributed Systems 23. Fault Tolerance

Distributed Systems 23. Fault Tolerance Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors

More information

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability CDA 5140 Software Fault-tolerance - so far have looked at reliability as hardware reliability - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

More information

HP AutoRAID (Lecture 5, cs262a)

HP AutoRAID (Lecture 5, cs262a) HP AutoRAID (Lecture 5, cs262a) Ali Ghodsi and Ion Stoica, UC Berkeley January 31, 2018 (based on slide from John Kubiatowicz, UC Berkeley) Array Reliability Reliability of N disks = Reliability of 1 Disk

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System Database System Concepts See www.db-book.com for conditions on re-use Chapter 17: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011 CSE325 Principles of Operating Systems Mass-Storage Systems David P. Duggan dduggan@sandia.gov April 19, 2011 Outline Storage Devices Disk Scheduling FCFS SSTF SCAN, C-SCAN LOOK, C-LOOK Redundant Arrays

More information

Mladen Stefanov F48235 R.A.I.D

Mladen Stefanov F48235 R.A.I.D R.A.I.D Data is the most valuable asset of any business today. Lost data, in most cases, means lost business. Even if you backup regularly, you need a fail-safe way to ensure that your data is protected

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

CSE380 - Operating Systems. Communicating with Devices

CSE380 - Operating Systems. Communicating with Devices CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped

More information

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

The term physical drive refers to a single hard disk module. Figure 1. Physical Drive HP NetRAID Tutorial RAID Overview HP NetRAID Series adapters let you link multiple hard disk drives together and write data across them as if they were one large drive. With the HP NetRAID Series adapter,

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

Dependability. IC Life Cycle

Dependability. IC Life Cycle Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s

More information

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide

More information

ECE Enterprise Storage Architecture. Fall 2018

ECE Enterprise Storage Architecture. Fall 2018 ECE590-03 Enterprise Storage Architecture Fall 2018 RAID Tyler Bletsch Duke University Slides include material from Vince Freeh (NCSU) A case for redundant arrays of inexpensive disks Circa late 80s..

More information

hot plug RAID memory technology for fault tolerance and scalability

hot plug RAID memory technology for fault tolerance and scalability hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID Distributed Video ystems Chapter 5 Issues in Video torage and Retrieval art 2 - Disk Array and RAID Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 5.1

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

PowerVault MD3 Storage Array Enterprise % Availability

PowerVault MD3 Storage Array Enterprise % Availability PowerVault MD3 Storage Array Enterprise 99.999% Availability Dell Engineering June 2015 A Dell Technical White Paper THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS

More information

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University I/O, Disks, and RAID Yi Shi Fall 2017 Xi an Jiaotong University Goals for Today Disks How does a computer system permanently store data? RAID How to make storage both efficient and reliable? 2 What does

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management! Failure with

More information

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure Chapter 17: Recovery System Failure Classification! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management!

More information

Associate Professor Dr. Raed Ibraheem Hamed

Associate Professor Dr. Raed Ibraheem Hamed Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Computer Science Department 2015 2016 1 Points to Cover Storing Data in a DBMS Primary Storage

More information

Storage Devices for Database Systems

Storage Devices for Database Systems Storage Devices for Database Systems 5DV120 Database System Principles Umeå University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner Storage Devices for

More information

Error Detection And Correction

Error Detection And Correction Announcements Please read Error Detection and Correction sent to you by your grader. Lab Assignment #2 deals with Hamming Code. Lab Assignment #2 is available now and will be due by 11:59 PM on March 22.

More information

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016 416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS Assist. Prof. Dr. Volkan TUNALI PART 1 2 RECOVERY Topics 3 Introduction Transactions Transaction Log System Recovery Media Recovery Introduction

More information

Chapter 3. The Data Link Layer. Wesam A. Hatamleh

Chapter 3. The Data Link Layer. Wesam A. Hatamleh Chapter 3 The Data Link Layer The Data Link Layer Data Link Layer Design Issues Error Detection and Correction Elementary Data Link Protocols Sliding Window Protocols Example Data Link Protocols The Data

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Routing Journal Operations on Disks Using Striping With Parity 1

Routing Journal Operations on Disks Using Striping With Parity 1 Routing Journal Operations on Disks Using Striping With Parity 1 ABSTRACT Ramzi Haraty and Fadi Yamout Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty@beirut.lau.edu.lb, fadiyam@inco.com.lb

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

Frequently asked questions from the previous class survey

Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] Shrideep Pallickara Computer Science Colorado State University L29.1 Frequently asked questions from the previous class survey How does NTFS compare with UFS? L29.2

More information

In the late 1980s, rapid adoption of computers

In the late 1980s, rapid adoption of computers hapter 3 ata Protection: RI In the late 1980s, rapid adoption of computers for business processes stimulated the KY ONPTS Hardware and Software RI growth of new applications and databases, significantly

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

RAID (Redundant Array of Inexpensive Disks)

RAID (Redundant Array of Inexpensive Disks) Magnetic Disk Characteristics I/O Connection Structure Types of Buses Cache & I/O I/O Performance Metrics I/O System Modeling Using Queuing Theory Designing an I/O System RAID (Redundant Array of Inexpensive

More information

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery In This Lecture Database Systems Lecture 15 Natasha Alechina Transactions Recovery System and Media s Concurrency Concurrency problems For more information Connolly and Begg chapter 20 Ullmanand Widom8.6

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Physical Representation of Files

Physical Representation of Files Physical Representation of Files A disk drive consists of a disk pack containing one or more platters stacked like phonograph records. Information is stored on both sides of the platter. Each platter is

More information

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect 6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance.

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

System Malfunctions. Implementing Atomicity and Durability. Failures: Crash. Failures: Abort. Log. Failures: Media

System Malfunctions. Implementing Atomicity and Durability. Failures: Crash. Failures: Abort. Log. Failures: Media System Malfunctions Implementing Atomicity and Durability Chapter 22 Transaction processing systems have to maintain correctness in spite of malfunctions Crash Abort Media Failure 1 2 Failures: Crash Processor

More information

4. Error correction and link control. Contents

4. Error correction and link control. Contents //2 4. Error correction and link control Contents a. Types of errors b. Error detection and correction c. Flow control d. Error control //2 a. Types of errors Data can be corrupted during transmission.

More information

CISC 7310X. C11: Mass Storage. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 4/19/2018 CUNY Brooklyn College

CISC 7310X. C11: Mass Storage. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 4/19/2018 CUNY Brooklyn College CISC 7310X C11: Mass Storage Hui Chen Department of Computer & Information Science CUNY Brooklyn College 4/19/2018 CUNY Brooklyn College 1 Outline Review of memory hierarchy Mass storage devices Reliability

More information

CSE380 - Operating Systems

CSE380 - Operating Systems CSE380 - Operating Systems Notes for Lecture 17-11/10/05 Matt Blaze, Micah Sherr (some examples by Insup Lee) Implementing File Systems We ve looked at the user view of file systems names, directory structure,

More information

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks RAID and AutoRAID RAID background Problem: technology trends - computers getting larger, need more disk bandwidth - disk bandwidth not riding moore s law - faster CPU enables more computation to support

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

Lecture 9. I/O Management and Disk Scheduling Algorithms

Lecture 9. I/O Management and Disk Scheduling Algorithms Lecture 9 I/O Management and Disk Scheduling Algorithms 1 Lecture Contents 1. I/O Devices 2. Operating System Design Issues 3. Disk Scheduling Algorithms 4. RAID (Redundant Array of Independent Disks)

More information

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3 EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California,

More information

Defect Tolerance in VLSI Circuits

Defect Tolerance in VLSI Circuits Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).

More information