HARDWARE AND SOFTWARE ERROR DETECTION

Size: px
Start display at page:

Download "HARDWARE AND SOFTWARE ERROR DETECTION"

Transcription

1 HARDWARE AND SOFTWARE ERROR DETECTION Ravishankar K. Iyer, Zbigniew Kalbarczyk Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana, IL Phone: , Fax:

2 1 INTRODUCTION A reliable computer system needs to provide its normal level of service in the presence of hardware and software faults [Avi86, Lap92]. There are two philosophies of achieving this reliability: fault avoidance, wherein techniques prevent the occurrence of faults in the first place, and fault tolerance, wherein techniques allow the system to continue execution despite the occurrence of faults. Fault tolerance can be implemented using one of two basic approaches: error masking, in which the system avoids the effect of an error through some form of majority voting, and error detection and recovery, in which the system has notices the presence of an error, locates and isolates the error, reconfigures in a spare, and restarts the system [Ren84]. The objectives of this article are two-fold: (1) to introduce and characterize a variety of hardwareand software- based error detection techniques and (2) more importantly, to illustrate and discuss these techniques in real-world systems and applications. Consequently, the paper does not aim at providing an exhaustive set of possible detection techniques. Rather, it emphasizes examples and the experience of applying these techniques to actual systems. The techniques presented can be implemented in hardware and/or in software, and they can be applied to uniprocessor, multiprocessor, distributed, or networked systems. Selecting the right technique or a set of techniques is, of course, a design decision. This paper is an important step toward facilitating such a decision. Not only does it discuss the implementation aspects of different techniques, but it also provides detailed evaluation results (in terms of error coverage or performance overhead) of many of the techniques presented. The article is organized as follows. The first few sections focus on hardware techniques, including hardware redundancy, information redundancy (e.g., coding techniques), and time redundancy. Significant space is dedicated to watchdog timers, heartbeats, and capability and consistency checking (including data audit techniques and assertions). The article then provides an in-depth discussion of software-based, control-flow checking techniques, including evaluation (using fault injection) of two selected techniques. The article ends with a discussion of the application of data audit techniques in a telecommunication environment. 2 HARDWARE REDUNDANCY In this section, we discuss three basic forms of hardware redundancy: passive, active, and hybrid. Passive hardware redundancy relies on voting mechanisms to mask the occurrence of faults by using the concept of majority voting. They do not need fault detection or system reconfiguration. The most common form of passive redundancy is called triple modular redundancy (TMR), which triplicates the hardware necessary to perform the required operations and uses a voter to determine the output of the system. In this approach, the primary difficulty is the voter. If that fails, the entire system fails. A common approach to avoid this problem is to use three voters and provide three independent outputs. Figure 1 shows the two forms of TMR. Several stages of TMR can be interconnected using this approach by connecting the outputs of the voters of one TMR stage via the inputs of modules of the next TMR stage. The voting can be performed by either a hardware voter (performs the voting fast, but requires extra hardware logic) or a software voter (performs on processors executing normal computations as well, but is generally slow). A generalization of the TMR approach is N-modular redundancy (NMR), which uses N copies of a module instead of three. For example, the NASA Space Shuttle onboard computer system uses four computers on which a majority vote is performed [Skl76]. 2

3 Active hardware redundancy attempts to achieve fault tolerance by fault detection, fault location, and fault recovery. The most common form of fault detection is duplication and comparison, which uses two identical copies of hardware, having them perform the same computations in parallel and comparing the results as shown in Figure 2. A commercial product from Stratus computers uses a pair-and spare approach where two duplexed components are used for self-checking and fault tolerance. Two processor boards are used, where each board contains a pair of microprocessors used in duplicate and compare mode to check themselves. Module 1 Input Module 2 Voter Output Module 3 a) TMR with One Voter Input 1 Module 1 Voter 1 Output 1 Input 2 Module 2 Voter 2 Output 2 Input 3 Module 3 Voter 3 Output 3 b) TMR with Three Voters Figure 1: Triplication and Voting Module 1 Output Input Comparator Match/Mismatch Module 2 Figure 2: Duplication and Comparison A second form of active redundancy is standby sparing, in which one module is operational and one or more modules serve as standbys or spares, as illustrated in Figure 3. Various fault detection schemes are used to determine when a module has become faulty, and fault location is used to determine exactly which module is faulty. The reconfiguration operation in standby sparing can be viewed conceptually as a switch whose output is selected from one of the modules providing inputs to the switch. Standby sparing can bring a system back into full operation after occurrence of a fault, but it requires that a momentary disruption in performance occur while reconfiguration is 3

4 performed. If the disruption of processing must be minimized, hot standby sparing can be used, where the spares operate synchronously with the on-line modules and are prepared to take over at any time. Cold standby sparing uses unpowered spares that must be powered up and initialized prior to bringing the module into active service. The advantage of standby sparing is that in a system containing n identical modules, such as a multiprocessor, fault tolerance can be provided with k < n spare modules. Switch Figure 3: Standby Redundancy Hybrid hardware redundancy combines the attractive features of both the active and passive approaches [Joh89]. Fault masking is used to prevent the system from producing erroneous results, and fault detection, location, and recovery are used to reconfigure the system in the vent of a fault. The most common form of hybrid redundancy is that on N-modular redundancy with spares. In this paragraph, a basic core of N modules is arranged in a voting configuration. In addition, spares are provided to replace faulty units in the NMR core. 2.1 Comparing the Reliability of Simplex and TMR Systems The purpose of this example is to show how rather blind use of redundancy can lead to seemingly paradoxical results. As the following calculation show the mean time to failure (MTTF) of a TMR system is in fact lower than that of a comparable simplex (nonredundant) system. A system with triple-modular redundancy (TMR) includes three components, two of which are required for the system to function properly. Assuming that the reliability of a single component is given by: λt Rsimplex = e where λ is the failure rate of the simplex component, the MTTF for a simplex system is : λt MTTFsimplex = e = 1 λ The reliability of a TMR system can be expressed as : 3λt 3 2λt λt 2λt RTMR = e + ( 2 ) e (1 e ) = 3e and consequently : MTTFTMR = 3 λ 2 λ = λ MTTF > MTTF simplex TMR Figure 4 shows the reliability of TMR and simplex systems as functions of λt. Note that R TMR (t) R(t) 0 t t 0 R TMR (t) R(t) t 0 t < where t 0 = ln 2 λ 0.7 λ 4

5 TMR Reliability Simplex λt o lambda * t Figure 4: Comparative Reliability of TMR and Simplex Systems TMR improves reliability for short missions (t t 0 ). For long missions (t>t 0 ), TMR actually degrades reliability. The explanation for this observation is rather simple, until the first failure the system operates in a TMR configuration and R TMR > R simplex. After the first failure the two remaining components compete to fail resulting in a lower reliability than a plain simplex with same failure rate. The same type of behavior is valid for any NMR system, i.e., M-out-of-N system. 2.2 M-out-of-N Systems In M-out-of-N systems, the critical factor is mission time rather than MTTF. The reliability of M- out-of-n systems is very high in the beginning, when the spare components tolerate failures. These systems can provide reliability of over a 10-hour period. But reliability falls sharply as time goes on, system redundancy is exhausted, and more hardware is subject to failure. M-out-of- N systems are useful contexts such as aircraft control, where very high reliability is needed for a short period of time. M-out-of-N systems are also used in fault-tolerance multiprocessor (FTMP) [Lal83] and software-implemented fault tolerance (SIFT) [Wen72]. For the general M and N, M modules out of N modules need to function, as shown in Figure 5. The reliability of such a system is given in the following expression. R MN = N M i =0 N N R i i m (1 R m ) i N working N-1 working N-2 working N-M working V Example Failed Figure 5: M-out-of-N System Computer systems used on board spacecraft must guarantee long-term uninterrupted operation (often for years) despite permanent failures induced by radiation present in space and by random events. Standby redundancy for both processors and memory is usually a technique of choice. This problem addresses the design of memory units with redundancy. To improve reliability, each k-bit 5

6 memory word is expanded by s spare cells, so each row contains k + s memory cells. There are n rows in the memory system. Assume that the memory chips contain n rows and each row contains only one cell. The entire memory system, then, contains k + s chips. Each chip has a failure rate of λ and obeys the exponential failure law. Write an expression for the reliability of the complete memory system using column sparing. The expression must account for fault coverage. Suppose that k = 16, λ = 9.15x10-6 failures per hour, and a fault coverage C = Determine the number of spares that will maximize the reliability of the memory at the end of a 10-year space mission. Calculate the reliability of the memory system at the end of 10 years (calculate for the case with zero spares and for the case with an optimal number of spares). Describe the complexities in answering this problem if the spare chips have a lower failure rate λ s than the k primary chips. Solution We have a memory system, with k main and s spare memory chips. The chips fail at the same rate: λ = per hour. The error coverage factor is C = We wish to know the reliability of the system at the end of 10 years. There are k + s ways for i chips to fail. The probability that i specific chips fail, and the remaining ((k+s) - i) do not, is ( ) k + s t i ( R ( t) ) i i R1 1 1, and the probability that i failures are detected and covered is C i. Consequently, the reliability is the sum of the probabilities that 0, 1,, s chips have failed and can be expressed as follows: k + s s R 1 1 i= 0 i k + s i i i () t = R () t ( 1 R () t ) C The term R 1 (t) is the probability that a particular memory chip has not failed at time t, and is given λt by: R () t = e 1. Solving the equation above, we find that for s = 29 (when k = 16 and t =10 years), the reliability has optimal value. For s = 0, R(t) = , and for s = 29, R(t) = Clearly the 29 spares provide for a significant improvement in the reliability. The problem does become more complicated if the spares fail at a different rate than the main chips. To compute the probability that n chips failed, we need to compute the probability that a main chips failed, and b spare chips failed, for all combinations of a and b such that a + b = n. This can be done as is shown below: λ1t where Rλ () t e 1 chips. s k i+ j s j () t = R () t R () t 1 R () t i= 0, j > 0, j i k i k s i j j i j j i ( ) ( R () t ) C R λ1 λ 1 2 λ1 λ 2 λ2t = is the reliability of main chips and R ( t) e λ 2 = is the reliability of the spare 2.3 The Effect of a Voter The previous expression for reliability of a TMR system assumes a voter that is 100% reliable. If we assume voter reliability of R v, we have: 6

7 R TMRV = R V ( R 3 m + ( ) R 2 m (1 R Generally a voter can be implemented in the hardware or software. Design and implementation of robust voting mechanism is often considered as relatively simple task. In reality a set of functions that a voter (or voting mechanism) should support depends on the system and the application and might include: Guaranteeing a majority vote on the input data to the voter Providing the ability to detect its own errors (i.e., the voter should be self-checking) Determining the faulty application replica (node) Handling voting in loosely synchronized (coupled) systems Handling voting in tightly synchronized (coupled) systems Loosely synchronized systems require synchronization of inputs to the voter. It may also be difficult in these systems to determine voter timeout due to differences in the relative speeds of the machines and variations in network communication delays. Tightly synchronized systems, on the other hand, generally do not require an external synchronization of inputs to the voter. A very instructive example of an actual voter implementation is the hardware voting mechanism used in TMR architecture to guarantee that the three processing units synchronously serve external interrupts arriving to the system [Cut93] [Jew91]. The processors in TMR system are configured to operate as a single logical processor. To function as the single processor, the instruction streams of the three processors must be identical. The synchronization mechanisms are designed so that if the code stream of any processor diverges from the code stream of the other processors, than a failure is indicated. Interrupt synchronization presents one of most difficult challenges in maintaining a single logical processor view. All interrupts are required to occur synchronous to virtual time (referred to as the cycle count). The cycle count can be thought of as a representation of the virtual time just as seconds are a representation of the real time. The reference point for virtual time is a specific software event or instruction. The code stream after the reference point must be identical for the three processors. External exceptions (interrupts) are not inherently synchronous to virtual time. All interrupts that are generated by the I/O devices must be synchronized to virtual time (i.e., made synchronous with the individual processor s instruction stream) before they are presented to the processor. If an external interrupt is presented directly to the processor (i.e., without synchronization to virtual time) then the three processors would start to process the interrupt at different instructions and this could lead to an unacceptable, inconsistent state of the system. Interrupts are synchronized to virtual time by performing a distributed vote on the interrupts and then presenting them to the processor on a predetermined cycle count (virtual time). Figure 6 gives the block diagram of the interrupt synchronization logic. An external interrupt is delivered to the distributor present on each processing board (CPU unit). The distributor broadcasts the delivered interrupt to the other two CPUs via the inter-cpu bus. As a result there are three pending interrupts, one from each CPU. The interrupt voter (also present on each processing board) captures the pending interrupts and performs a vote to verify that all of the CPUs did receive the external interrupt request. On a predetermined cycle count, the interrupt voter presents the interrupt to m ))

8 microprocessor. All of the microprocessors will receive the interrupt on the same cycle count and thus the interrupt will be synchronized to virtual time. The interrupt voter uses a dedicated register (a holding register not shown in Figure 6) to save state information as to whether all CPUs captured and distributed an external interrupt. In error-free scenario, as described above, this state information is not necessary, i.e., external interrupts can be synchronized to virtual time without the use of holding register. The holding register provides a mechanism for the voter to know that the last interrupt vote cycle captured at least one, but not all, of the interrupt pending bits. There are two possible scenarios resulting in not all of the pending interrupts bits being set: (1) The external interrupt is asserted before the interrupt distribution cycle on some of the CPUs but after the interrupt distribution cycle on other CPUs; (2) At least one of CPUs fails in a way that prevents the correct operation of the distributor. In the former case the interrupt voter is guaranteed that all of the interrupt pending bits will be set on the next interrupt vote cycle and thus, all processors will receive the external interrupt at identical instructions. Consequently, if the interrupt voter discovers that the holding register has been set and not all of the interrupts pending bits are set, than an error must exist on one or more of the CPUs (the latter scenario). The interrupt voter presents the pending interrupt to the processor and also raises an interrupt-synchronization-error interrupt on the high priority interrupt level. The system software serves this high priority interrupt. Note that the scheme presented here represents a hardware implementation of the interactive consistency algorithm, which is used in distributed systems for providing data integrity of replicated processes. Hardware implementation, however is much less performance intensive than the software counterpart. Interrupt Logic External Interrupts Voter TMR Controller Interrupt Pending Distributor Microprocessor Interrupt Voter CPU Generated Interrupt CPU Unit A Distributor Interrupt Pending Microprocessor Interrupt Voter CPU Generated Interrupt CPU Unit B Inter-CPU Bus Distributor Microprocessor CPU Generated Interrupt Interrupt Pending Interrupt Voter CPU Unit C Figure 6: Interrupt Synchronization in a TMR System 3 INFORMATION REDUNDANCY Information redundancy is the addition of redundant information to data to allow fault detection, fault masking, and fault tolerance. An example of information redundancy is single error correction and double error detection (SEC-DED) code. A code's error detection and correction properties are based on its ability to partition a set of 2 n words each n-bit wide into a code space of 2 m words and a noncode space of 2 n -2 m words. Each code is constructed such that a given number of errors transforms a code-space word into a word in a noncode space. Decoding circuits detect errors by 8

9 identifying any word outside the code space. Error correction is performed by more extensive decoding that uniquely associates a noncode space word with the original code word transformed by the errors. 3.1 Fault Detection through Encoding The basic idea behind an error-detecting scheme is to add redundant information to the data being transmitted or stored to determine if errors have been introduced. The amount of check information is where error-detecting and error-correcting codes diverge. The formers include only enough redundant information to allow the receiver to determine that an error has occurred but not to locate it. Error correcting-codes, on the other hand, add enough redundancy to allow the receiver or reader to deduce what the transmitted or stored clock of data must have been. Hamming and Shannon first developed error-correcting codes in the late 1940s for use when storing data on magnetic disks and core memories. Since then, the use of more sophisticated mathematical techniques has led to the invention of a plethora of encoding and decoding methods. At the logic level, codes provide a means for masking or detecting errors. Formally, the code is a subset S of universe U of possible vectors. A noncode word is a vector in the set U-S. These relations are shown in Figure 7. In the figure, X 1 is a codeword < >, which due to a multiple bit error becomes noncodeword X 3 = < >, which is detectable. The codeword X 2 becomes another codeword X 4, which is not detectable. U = 2 8 vector S = even parity X 1 X 2 X 4 X 3 Figure 7: Logic of Error-Masking/Detecting Code The ability of a code to detect and correct errors is determined by the minimum separation, or Hamming distance, between the words of a code space, which is the minimum number of bit positions by which two words from the code can differ. The distance of a code is the minimum of Hamming distances between all pairs of code words. For example, for two code words x = (1011) and y = (0110), the Hamming distance is d(x,y) = 3. Using the notion of Hamming distance, it can be shown that, to detect all error patterns of a Hamming distance d, the code distance must be d+1; for example, a code with a distance of 2 can detect patterns with a distance of 1 (i.e., single-bit errors). To correct all error patterns with a Hamming distance of c, the code distance must be 2c+1. To detect all patterns with a Hamming distance of d and to correct all patterns with a Hamming distance of c, the code distance must be 2c+d+1 (note that d corresponds to the number of additional bit errors that can be detected). For example, code with a distance of 3 can detect and correct all single-bit errors (i.e., c=1 and d=0). 3.2 Parity One of the simplest error detecting codes is the parity code, where given an n-bit, one attaches an extra bit to convert it to an even or odd parity word. A simple decoding circuit using a set of XOR gates will detect any single bit error in the parity-coded word. Parity codes are used routinely in 9

10 computers to check errors in busses, memory, and registers. Table 1 compares parity codes for memories. Five strategies for calculating parity are considered: (1) bit-per-word parity, (2) bit-perbyte, (3) bit-per-multiple-chips, (4) bit-per-chip, and (5) interlaced parity. It is assumed that memory is constructed from individual chips where each chip contains several bits of data word. Parity Code Advantages Disadvantages Bit-per-word: one parity bit per data word Detects all single-bit Certain errors undetected, e.g., a word, errors including parity bit becomes all 1s, due to Bit-per-byte: each data portion (e.g., a byte) is protected by a separate parity bit; the parity of one group should be even and the parity of the other group should be odd Bit-per-multiple-chips: one bit from each chip is associated with a single parity bit Bit-per-chip: each parity bit is associated with one chip of the memory Interlaced: similar to the bit-per-multiplechips; must ensure that no two adjacent bits are from the same parity group Detects all-1s and all-0s conditions Detects failure of entire chip Detects single-bit errors and identifies chip with erroneous bit Detects errors in adjacent bits Table 1: Comparison of Parity Codes a failure of a bus or a set of data buffers. Ineffective for multiple errors, e.g., the whole-chip failure Cannot locate failure of complete chip Susceptible to whole-chip failure, i.e., a single chip error can result in multiple bits to be corrupted and this may go undetected. Parity groups are not based on physical organization of the memory In high-speed memories, single-bit error-correcting and double-bit error-detecting (SEC-DED) codes are most commonly used. The data before writing to the memory are passed to a parity generator. The generated parity bit (or bits) is (are) then stored in the memory together with the data. On read operation the data bits are passed into the parity checker that regenerates the parity bit (or bits) and compares it with the parity bit(s) stored in the memory when the original data were written to the memory. The single-bit parity code has a minimum Hamming distance of two. The following description brings more details on Hamming codes. In Hamming single-error correction code, c parity bits are added to a k-bit data word, forming a code word of k+c bits. The following expression can be used to determine number of necessary c check (parity) bits to protect k bits of information: 2 c + k + 1. Consider a data word of four information bits (d 0, d 1, d 2, d 3 ). According to the above expression, three parity bits (p 1, p 2, p 3 ) are needed to protect the four bits of data. To illustrate how the parity (check) bits are generated and checked, assume that the bits in the code word are numbered from 1 to k+c. Positions numbered as a power of two are reserved for the parity bits. The grouping of bits for parity generation and checking is determined based on a list of the binary numbers from 0 to 2 k 1, as illustrated in Figure 8. 10

11 Determining the bit groups ( three parity bits) Code Word p 1 p 2 d 0 p 3 d 1 d 2 d 3 Parity bits calculation p 1 = XOR of bits (3, 5, 7) p 2 = XOR of bits (3, 6, 7) p 3 = XOR of bits (5, 6, 7) Parity checking c 1 = XOR of bits (1, 3, 5, 7) c 2 = XOR of bits (2, 3, 6, 7) c 3 = XOR of bits (4, 5, 6, 7) Figure 8: Determining Parity/Check Bits for Hamming Code The first group is formed by the data bits in the positions corresponding to the 1-bits in the least significant bit of the binary count sequence (i.e., bits 1, 3, 5, 7). The second group is formed by the data bits in the positions corresponding to the 1-bits in second significant bit of the binary count sequence (i.e., bits 2, 3, 6, 7), and so on. Note that each group of bits starts with the number that is a power of two. These numbers are also the position number (in a code word) for the parity bits. The individual parity bits are calculated by performing an XOR operation on the data bits specified by a given group. For parity checking, the XOR operations also include the parity bit itself. As a result, the original data is encoded by generating a set of parity bits (p 1 p 2 p 3 ). To check correctness, the encoding process is repeated and a set of check bits (c 1 c 2 c 3 ) is generated. The binary word represented by the check bits c 1 c 2 c 3 forms a syndrome, which points directly to the position of the erroneous bit. Figure 9 shows relations between the syndrome values and the bit position in error for the example of a four-bit data word. The error can be corrected by complementing the corresponding bit. Erroneous bits Syndromes d d d d p p p Figure 9: Error Detection and Correction Using Syndromes The Hamming code discussed above can only detect and correct single bit errors. By adding an extra parity bit, the Hamming code can be used to correct single bit errors and to detect double errors. In the example of a data word consisting of four information bits, the additional parity bit, p 4, can be calculated as parity (XOR) over the first seven bits of the code word. For parity checking, the additional check bit c 4, is calculated over all eight bits of the code word. Figure 10 illustrates the four cases distinguishable by single-error correction (SEC) and double-error detection (DED) Hamming code. 11

12 c 1 c 2 c 3 c x 1 x 2 x 3 1 y 3 y 2 0 y No errors Single error (in a position x 1 x 2 x 3 ) is detected and can be corrected Double error is detected but cannot be corrected Error in parity bit p 4 Figure 10: Error Detection and Correction Using SEC-DED Code 4 CYCLIC REDUNDANCY CHECKS Cyclic redundancy checks (CRCs) are used to detect errors in communication channels, tapes, and disks. Cyclic codes are parity check codes with the additional property that the cyclic shift of a codeword is also a codeword. If (Cn 1,C n 1 KC 1,C 0 ) is a codeword, then (Cn 2,C n 3 KC 0,C n 1 ) is also a codeword. The wide use of CRCs is due mainly to two factors: (1) simplicity of implementation (the needed hardware includes linear feedback shift registers and EX-OR gates) and (2) ability to detect single-bit errors, multiple adjacent bit errors affecting fewer than n-k (for an (n,k) code) bits, and burst transient errors (typical of communication applications). The idea is to append a checksum to the end of the data frame in such a way that the polynomial represented by the resulting frame is divisible by the generator polynomial G(x) that the sender and receiver have agreed upon. When the receiver gets the checksummed frame, it divides it by G(x) and if the remainder is not zero, there has been a transmission error. It is then clear that the best generator polynomials are those less likely to divide evenly into a frame that contains errors. CRCs are distinguished by the generator polynomials they use. The property of cyclic code is often expressed as (n, k), where n is a total length of each cyclic code word and k is a length of redundancy code. The characteristics of a cyclic code are depending on the portion of these two factors, n and k, and its generating polynomial. IBM SDLC (Synchronous Data Link Control), a transmission protocol, employs CRC-16, which has 16 bits of redundancy code with a G(X) = X 16 + X 15 + X generating polynomial. Large transmission systems, such as Ethernet and Token Ring, use a 32-bit CRC for data protection. Other CRCs widely used in link level protocols include CRC-8, CRC-10, CRC-12. [Pet96] discusses their corresponding G(x)s. One of the problems with cyclic code is that we cannot directly specify the error bit position during the decoding process. If the nonzero remainder does not contain enough data of error bit position, the receiver cannot correct the error and has to request the sender to retransmit. This process requires a large buffer for both the sender and receiver to store all transmitted data to be retransmitted and to reconstruct the correct receiving order [Ben95]. This retransmission procedure is time consuming, but if the error rate is not too high, it is efficient. One approach to realizing error correction is using look-up tables [Man95]. Look-up tables contain all possible patterns of nonzero remainders and the position of the corrupt bit in the received data. This method can be implemented in a short time but requires huge data storage for the table. In some applications, the use of a look-up table is impractical in terms of its cost and overhead. 12

13 4.1 Checksums Checksums are commonly used in communication applications. The idea is to add up all the words to be transmitted and then transmit the sum (called the checksum) along with the data. At the receiving end the checksum is recalculated and compared with the original. If any of the data including the checksum is corrupted during transmission, the result is mismatch. This method will not protect against errors that cause data words to arrive out of order. The checksum codes differ in the way in which the checksum is generated. Performing the modulo-2 addition of the words to be transmitted and ignoring any overflow generates the single-precision checksum. Its weakness is that errors that cause the original and the recalculated checksums to differ only in the ignored bit position are not detected, as illustrated in Figure 11. A most significant data line stuck at 1 is an example of an error that exposes this flaw. Checksum Carry is ignored Sent Data d 3 d 2 d 1 d } DATA 1 Checksum a) The checksum is formed. d 0 d 1 Sender d 2 d X 3 Faulty Line Stack-at 1 Receiver Checksum on Received Data Received Checksum Received Data b) The error goes undetected. Figure 11: Error Scenario Not Detected by a Single-Precision Checksum (based on [Joh89]) In the Honeywell checksum adjacent data words are concatenated prior to computing the checksum, thus K n-bit words are grouped into K/2 2n-bit words. This structure has the capability of detecting a bit error that affects all words in the same bit position because it makes the two checksums to differ in two locations. Nevertheless, overflow can still cause loss of carry-bit information. The key disadvantage of using checksums is their limited capabilities of diagnosing the actual cause of the problem. The observed problem can be due to (1) an error in checksum calculation, (2) a transmission error, (3) a corruption of the original data (before the checksum computation). Using checksum one cannot determine which of the three scenarios happened. 4.2 Arithmetic Codes Arithmetic codes detect errors in arithmetic units like adders and multipliers. Arithmetic codes are useful in checking arithmetic operations, where parity codes would not be preserved under addition and subtraction. Separate arithmetic codes separate check symbols from data symbols. Nonseparate arithmetic codes combine check and data symbols. Some types of arithmetic codes are AN codes, residue codes, and bi-residue codes. d 3 d 2 d 1 d

14 For example, in AN arithmetic code, data X is multiplied by check base A to form A.X. Addition of code words is performed modulo M, where A divides M. This yields A( X+ m Y ) = AX + m AY. Dividing the result by A checks the operation correctness. If the result is zero, there is no error; otherwise, there is an error. This is illustrated in Figure 12. For more information about coding in reliable computer systems, the reader is referred to [Rao89, Blah84]. AX AY + M Residue Mod A A(X + M Y) Figure 12: Example Arithmetic Code 5 TIME REDUNDANCY The basic concept of time redundancy is the repetition of computations two ore more times and comparing the results to determine if a discrepancy exists. If an error is detected, the computations can be performed again to see if the disagreement remains or disappears. Such approaches are good for detecting errors due to transient faults, but cannot protect against errors resulting from permanent faults. Another form of time redundancy to handle permanent faults modifies the way the computations are performed the second time. One approach uses alternating logic for self-dual combinational circuits [Rey78], which performs a function on some set of inputs in one time instant, and performs the same function on the complemented input in a subsequent time step, the output of which should be the complement of the original function value of the original input. If the second value of the function is not the complement, an error is detected. The second approach uses recomputing with shifted operands [Pat82], which is applicable to bitsliced organizations of hardware. In the first step, the normal computation is performed on the operands and the results stored in a register. In the first step, the normal computation is performed on the operands and the results stored in a register. In the next step, the operands are shifted left by k bits, and the output is shifted right by k bits and compared with the result of the previous computation. Any error in k-1 consecutive bit slices of an arithmetic or logical operation will be detected by this method. The additional hardware requirement is the three shifters, the storage register to hold the results of the first computation, and the comparator. A variant of this method is called recomputing with swapped operands, where in the first two steps, the operation is performed in the normal form. In the following time step, the upper and lower halves of the operands are swapped such that a faulty bit slice operates on opposite halves of the operands in the two computations. The additional hardware requirements are in the form of several multiplexers, a storage register and a comparator. 6 WATCHDOG TIMERS Watchdog timers have been used since the early days of digital systems as an inexpensive method of error detection. A timer is implemented separately from the process that it monitors. The process being watched must reset the timer before the timer expires; otherwise, the watched process is assumed to be faulty. 14

15 Traditionally, watchdog timers are used to detect control flow errors that result in the timer not being reset [Pras89]. When the timer expires, the system is reset. Alternatively, instead of resetting the system, an interrupt can be triggered to initiate a recovery from the error. Watchdog timers can also be used much in the same way timeouts are used to monitor behavior of a single subsystem [Ore75]. The timeouts differ from watchdog timers in that they provide a finer check of control flow. Watchdog timers can be implemented in either hardware (the timer is generally external one that can be reset with a signal) or software (often run on the same processor as the process being monitored, but the timer is maintained as a separate process) [Sie98, Pras89]. A novel implementation of the watchdog timer effect without using a timer is the technique of extended-precision checksum-based control-flow checking [Sax90]. Extended-precision checksums are taken of a branch-free block of instructions as the sum total of the instructions or some transformation of the instructions. Before each block, the checksum value is sent to a buffer. As the instructions execute, they are subtracted from the buffer. When the block ends, or a branch occurs, a zero check signal is sent. If the buffer becomes zero or negative before the signal is set, a control-flow error has occurred. If the buffer is positive when the signal is set, an error has occurred as well. 6.1 Example Applications of Watchdog Timers Pluribus Reliable Multiprocessor. An example of a system designed with an extensive use of watchdog timers is the Pluribus multiprocessor [Sax90]. Pluribus was built primarily for research purposes; its main goal is high reliability. The behavior of Pluribus as a whole is not monitored, but hardware and software timers monitor almost every subsystem. This approach increases overall system reliability, since a subsystem that fails due to an intermittent or transient fault will be restarted and not allowed to cause a system failure. While Pluribus uses other error-detection techniques, those techniques are usually combined with a timer. The timers range from five microseconds to two minutes in duration. The Pluribus subsystems cycle with a characteristic time constant. During each cycle, the subsystem performs a complete self-check for consistency. Passage through the cycle means the subsystem is operating correctly; a lapse of too much time without a timer reset indicates that the subsystem has suffered a failure from which it cannot recover by itself. An example subsystem is the free message buffer list, where message buffers are stored when not in use. Buffers leave the list for at most two minutes, so a 2-minute timer is maintained for with buffer. If a timer runs out, this indicates that a failure has occurred and that the buffer will not be returning to the free list on its own, and the monitored buffer is forced back onto the free list. In this case, the failure caused a system performance degradation of less than two minutes, during which the system operated with too few message buffers. However, the error caused no data loss, since the timer facilitated a complete recovery. Another example is the failure of the mutual exclusion locks that are on each subsystem. A lock failure can cause the lock of a resource when no subsystem is using it. A subsystem trying to use the resource is put in a waiting loop. Since the lock failed, the resource will never become free, but a 1/15-second timer will interrupt the processor, which will arbitrarily unlock the resource. Aside form the temporary (1/15-second) degradation in system performance, the system is unaffected by the error. As in the first example, no data is lost and a complete recovery is possible. 15

16 A more drastic error, from which watchdog timers aid in partial recovery, is the permanent failure of a processor. When a processor fails, any message buffer it had not returned to the free buffer list will be returned by the operation of the timer monitoring the list, as described above. Likewise, any resource the processor had locked will be unlocked. While a complete recovery is not immediately possible, since the processor itself must of course be repaired or replaced, the system can remain up and the error can be limited to the single processor. VAX-11/780. A multiprocessor system design for more commercial applications that makes use of a watchdog timer is the VAX-11/780 [Sie98]. On this system the console processor runs a watchdog process that is reset when an interrupt line is strobed. If it is not strobed by a processor within 200 microseconds, this indicates a failure and the console processor attempts to determine the reason for the failure. Bell System Telephone Switches. Yet another system, which employs watchdog processors to detect errors, is the telephone-stored program switching system developed by Bell Systems [Con97]. External watchdog timers monitor proper program operation by triggering recovery when timers are not periodically reset. This allows early (before the error propagates and causes severe damage to the system) detection of problems caused by software errors and consequently easier recovery. It should be noted that despite of watchdog based error detection, software techniques known as audit were main line of defense against errors. Mars Sojourner. An example of where a watchdog timer demonstrated its ability to detect errors is NASA s Mars Pathfinder mission of the Sojourner rover [Jon97]. The computer system that controlled the Sojourner rover uses a real-time preemptive multithreaded operating system. Tasks are scheduled based on priorities that reflect their relative urgency. Due to a design flaw, a condition known as priority inversion could occur. To illustrate priority inversion let consider the following example execution scenario: (1) a low priority thread obtains a mutually exclusive lock to access shared data, (2) under this conditions a long running task with higher (than the low priority thread) priority is scheduled due to an interrupt, and (3) the higher priority thread needs access to the data locked by the lower priority task. As a result (1) the lower priority task is prevented from running by the higher priority thread and (2) the high priority task is also prevented from running because it blocks waiting for the low priority thread to release the lock. Using watchdog timer the above scenario was detected and the system restarted. However, full restart caused loss of data, and the repetitive resets seriously limited the correct work of the system of the Mars Rover. The problem was eventually diagnosed and the software was patched to reestablish proper behavior. In this system, the recovery method applied when the watchdog timeout is a traditional system reset, a drastic but robust measure representing a good engineering practice. The availability of the system is much more important than the lost data due to the system reset. 6.2 Limitations of Watchdog Timers Watchdog timers are not ideal for detecting errors in digital systems. The reasons for this fall into four areas: 1. While the error detection is not limited to any particular fault model, watchdog timers only detect errors of a very specific type. The assumption is that any error will manifest itself as a control-flow error such that the system does not continue to reset the timer. If a control-flow error occurs but the program resets the timer in time, the error will go undetected. 16

17 2. Timer resets must be placed with care to be effective. They cannot be placed inside interrupt routines or loops (to avoid possibility of an infinite loop), but they must occur often enough that the timer cannot expire during any normal operation. 3. Only processes with relatively deterministic runtimes can be checked, since the error detection is based entirely on the time between timer resets. If the set time is shorter than the longest possible runtime of the checked process, it can expire even though there is no error. On the other hand, if the time is set too long, then even if a control-flow error occurs, the process may have enough time to get back to the point at which the timer is reset, and the error will not be detected. 4. A watchdog timer provides only an indication of possible process failure; a partially failed process may still be able to reset the timer. Coverage is limited, as neither the data nor the results are checked. When used to reset the system, a watchdog timer can improve availability (the mean time to recovery is shortened) but not reliability (failures are just as likely to occur). When the availability of a digital system is more important than the loss of data under some condition, the use of a watchdog timer to reset the system on the detection of an error is an appropriate choice. 7 HEARTBEATS Heartbeat is a common approach to detecting process and node failures in a distributed (networked) computing environment. Periodically, a monitoring entity sends a message (a heartbeat) to a monitored node or process and waits for a reply. If the monitored node does not respond within a predefined timeout interval, it is declared as failed and appropriate recovery action is initiated. 7.1 Limitations of Traditional Heartbeats There are two major problems associated with the traditional heartbeat scheme: The timeout period is pre-negotiated by the two parties or sometimes even hard-coded by the programmer. The predefined timeout value cannot adapt to changes in network traffic or to load variability on individual nodes. In cases of high network traffic, high load on the nodes, or a slow node, the timeout value can be too short and cause the monitoring node to declare a healthy node as faulty. Such a false alarm is undesirable in a distributed environment, especially for critical applications such as those used in commercial banking and in database systems. The monitored node is assumed to be healthy if is able to respond to a heartbeat message. This is usually acceptable for a single-threaded application. However, in a multithreaded application, an independent thread of execution is usually responsible for replying to the heartbeat message. The healthy operation of this thread does not necessarily imply the healthy operation of the entire multithreaded application. Other threads inside the process may be in a deadlock situation that keeps the entire process from making progress, alternatively other threads could be operating in a corrupted state that keeps the process from providing a proper service. Adaptive and smart heartbeat algorithms address these two problems. A heartbeat algorithm is called adaptive if the timeout value used by the monitor is not fixed but rather is periodically negotiated between the two parties to adapt to changes in the network traffic or node load. A heartbeat algorithm is called smart if the entity being monitored excites a set of predefined checks to verify the robustness of the entire process and only then responds to the monitoring process. 17

18 7.2 Designing Adaptive, Smart Heartbeats To illustrate the concept of adaptive, smart heartbeats, two independent, multithreaded processes: heartbeat replier and heartbeat monitor are created [Bas00]. Heartbeat_monitor is the monitoring entity that is responsible for periodically sending heartbeat request messages to the target node. Heartbeat_replier is the monitored entity responding to the heartbeat request messages sent by the monitor. The adaptive scheme uses Jacobson s algorithm [Tan96], which allows for adjusting the timeout value according to measured network performance in terms of round trip time (RTT) in message transmission. The heartbeat algorithm is made smart (i.e., has the ability to verify robustness of the entire process) using a null test message inside the process to test the healthy operation of all the threads within the process. In the following sections, we present the implementations of these two schemes. The heartbeat protocol is depicted in Figure 13. Periodically, the heartbeat monitor sends a heartbeat message to the heartbeat replier, clears the counter ack_missed, and starts the timer. The duration of the timer is dictated by the current value of the timeout variable associated with the heartbeat replier. Heartbeat Monitor Heartbeat Replier Heartbeat Period Timeout Expiration HB message HB ack RTT Figure 13: Protocol for Adaptive Heartbeat On the other side, the heartbeat replier responds with a heartbeat acknowledgment message. If heartbeat acknowledgment message is received by the heartbeat monitor before the time expires, the monitor assumes that the remote process is alive, otherwise the counter ack_missed is increased. If the counter has not reached its maximum value, a further heartbeat message can be sent from the heartbeat monitor to the heartbeat replier; otherwise, the remote process is assumed to be faulty. Crucial for the protocol are the values of the timeout and the heartbeat period. In general the heartbeat period can be fixed as a multiple of the current value of the timeout. It is however desirable to have a timeout value that adapts to the current response time of the remote process. The response time, as seen by the heartbeat monitor, is a function of the current load on the remote machine and the time required to transfer the heartbeat message and for the heartbeat acknowledgment, i.e., the response time is a function of the Round Trip Time (RTT). To calculate the RTT it is sufficient to include in the heartbeat message a timestamp whose value is the sending time. This timestamp will be sent back to the monitor by the replier; so, when the monitor receives a heartbeat acknowledgment, it can calculate the instantaneous RTT as the difference between the current time and that timestamp. However, it turns out that such a solution still does not perform well in the case of a variable workload. The main problem is the variability of 18

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 6 Coding I Chapter 3 Information Redundancy Part.6.1 Information Redundancy - Coding A data word with d bits is encoded

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

Fault-Tolerant Computing

Fault-Tolerant Computing Fault-Tolerant Computing Dealing with Mid-Level Impairments Oct. 2007 Error Detection Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant

More information

Chapter 3. The Data Link Layer. Wesam A. Hatamleh

Chapter 3. The Data Link Layer. Wesam A. Hatamleh Chapter 3 The Data Link Layer The Data Link Layer Data Link Layer Design Issues Error Detection and Correction Elementary Data Link Protocols Sliding Window Protocols Example Data Link Protocols The Data

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

Inst: Chris Davison

Inst: Chris Davison ICS 153 Introduction to Computer Networks Inst: Chris Davison cbdaviso@uci.edu ICS 153 Data Link Layer Contents Simplex and Duplex Communication Frame Creation Flow Control Error Control Performance of

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

(Refer Slide Time: 2:20)

(Refer Slide Time: 2:20) Data Communications Prof. A. Pal Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture-15 Error Detection and Correction Hello viewers welcome to today s lecture

More information

Advanced Computer Networks. Rab Nawaz Jadoon DCS. Assistant Professor COMSATS University, Lahore Pakistan. Department of Computer Science

Advanced Computer Networks. Rab Nawaz Jadoon DCS. Assistant Professor COMSATS University, Lahore Pakistan. Department of Computer Science Advanced Computer Networks Department of Computer Science DCS COMSATS Institute of Information Technology Rab Nawaz Jadoon Assistant Professor COMSATS University, Lahore Pakistan Advanced Computer Networks

More information

4. Error correction and link control. Contents

4. Error correction and link control. Contents //2 4. Error correction and link control Contents a. Types of errors b. Error detection and correction c. Flow control d. Error control //2 a. Types of errors Data can be corrupted during transmission.

More information

Chapter 3. The Data Link Layer

Chapter 3. The Data Link Layer Chapter 3 The Data Link Layer 1 Data Link Layer Algorithms for achieving reliable, efficient communication between two adjacent machines. Adjacent means two machines are physically connected by a communication

More information

Fault-Tolerant Computing

Fault-Tolerant Computing Fault-Tolerant Computing Hardware Design Methods Nov. 2007 Hardware Implementation Strategies Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

Chapter 10 Error Detection and Correction. Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 10 Error Detection and Correction. Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 10 Error Detection and Correction 0. Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Note The Hamming distance between two words is the number of differences

More information

CHAPTER 2 Data Representation in Computer Systems

CHAPTER 2 Data Representation in Computer Systems CHAPTER 2 Data Representation in Computer Systems 2.1 Introduction 37 2.2 Positional Numbering Systems 38 2.3 Decimal to Binary Conversions 38 2.3.1 Converting Unsigned Whole Numbers 39 2.3.2 Converting

More information

CHAPTER 2 Data Representation in Computer Systems

CHAPTER 2 Data Representation in Computer Systems CHAPTER 2 Data Representation in Computer Systems 2.1 Introduction 37 2.2 Positional Numbering Systems 38 2.3 Decimal to Binary Conversions 38 2.3.1 Converting Unsigned Whole Numbers 39 2.3.2 Converting

More information

CSE 123: Computer Networks

CSE 123: Computer Networks Student Name: PID: UCSD email: CSE 123: Computer Networks Homework 1 Solution (Due 10/12 in class) Total Points: 30 Instructions: Turn in a physical copy at the beginning of the class on 10/10. Problems:

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

2.1 CHANNEL ALLOCATION 2.2 MULTIPLE ACCESS PROTOCOLS Collision Free Protocols 2.3 FDDI 2.4 DATA LINK LAYER DESIGN ISSUES 2.5 FRAMING & STUFFING

2.1 CHANNEL ALLOCATION 2.2 MULTIPLE ACCESS PROTOCOLS Collision Free Protocols 2.3 FDDI 2.4 DATA LINK LAYER DESIGN ISSUES 2.5 FRAMING & STUFFING UNIT-2 2.1 CHANNEL ALLOCATION 2.2 MULTIPLE ACCESS PROTOCOLS 2.2.1 Pure ALOHA 2.2.2 Slotted ALOHA 2.2.3 Carrier Sense Multiple Access 2.2.4 CSMA with Collision Detection 2.2.5 Collision Free Protocols 2.2.5.1

More information

2.4 Error Detection Bit errors in a frame will occur. How do we detect (and then. (or both) frames contains an error. This is inefficient (and not

2.4 Error Detection Bit errors in a frame will occur. How do we detect (and then. (or both) frames contains an error. This is inefficient (and not CS475 Networks Lecture 5 Chapter 2: Direct Link Networks Assignments Reading for Lecture 6: Sections 2.6 2.8 Homework 2: 2.1, 2.4, 2.6, 2.14, 2.18, 2.31, 2.35. Due Thursday, Sept. 15 2.4 Error Detection

More information

CS321: Computer Networks Error Detection and Correction

CS321: Computer Networks Error Detection and Correction CS321: Computer Networks Error Detection and Correction Dr. Manas Khatua Assistant Professor Dept. of CSE IIT Jodhpur E-mail: manaskhatua@iitj.ac.in Error Detection and Correction Objective: System must

More information

Ch. 7 Error Detection and Correction

Ch. 7 Error Detection and Correction Ch. 7 Error Detection and Correction Error Detection and Correction Data can be corrupted during transmission. Some applications require that errors be detected and corrected. 2 1. Introduction Let us

More information

UNIT-II 1. Discuss the issues in the data link layer. Answer:

UNIT-II 1. Discuss the issues in the data link layer. Answer: UNIT-II 1. Discuss the issues in the data link layer. Answer: Data Link Layer Design Issues: The data link layer has a number of specific functions it can carry out. These functions include 1. Providing

More information

CSMC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. Nov 1,

CSMC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. Nov 1, CSMC 417 Computer Networks Prof. Ashok K Agrawala 2018 Ashok Agrawala 1 Message, Segment, Packet, and Frame host host HTTP HTTP message HTTP TCP TCP segment TCP router router IP IP packet IP IP packet

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X Dependable Systems Hardware Dependability - Diagnosis Dr. Peter Tröger Sources: Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., 156881092X

More information

Lecture 5. Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1

Lecture 5. Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1 Lecture 5 Homework 2 posted, due September 15. Reminder: Homework 1 due today. Questions? Thursday, September 8 CS 475 Networks - Lecture 5 1 Outline Chapter 2 - Getting Connected 2.1 Perspectives on Connecting

More information

The Data Link Layer Chapter 3

The Data Link Layer Chapter 3 The Data Link Layer Chapter 3 Data Link Layer Design Issues Error Detection and Correction Elementary Data Link Protocols Sliding Window Protocols Example Data Link Protocols Revised: August 2011 & February

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Fault Tolerance & Reliability CDA Chapter 2 Additional Interesting Codes

Fault Tolerance & Reliability CDA Chapter 2 Additional Interesting Codes Fault Tolerance & Reliability CDA 5140 Chapter 2 Additional Interesting Codes m-out-of-n codes - each binary code word has m ones in a length n non-systematic codeword - used for unidirectional errors

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés RELIABILITY and RELIABLE DESIGN Giovanni Centre Systèmes Intégrés Outline Introduction to reliable design Design for reliability Component redundancy Communication redundancy Data encoding and error correction

More information

CSE 461: Framing, Error Detection and Correction

CSE 461: Framing, Error Detection and Correction CSE 461: Framing, Error Detection and Correction Next Topics Framing Focus: How does a receiver know where a message begins/ends Error detection and correction Focus: How do we detect and correct messages

More information

Lecture 6: Reliable Transmission. CSE 123: Computer Networks Alex Snoeren (guest lecture) Alex Sn

Lecture 6: Reliable Transmission. CSE 123: Computer Networks Alex Snoeren (guest lecture) Alex Sn Lecture 6: Reliable Transmission CSE 123: Computer Networks Alex Snoeren (guest lecture) Alex Sn Lecture 6 Overview Finishing Error Detection Cyclic Remainder Check (CRC) Handling errors Automatic Repeat

More information

CSEP 561 Error detection & correction. David Wetherall

CSEP 561 Error detection & correction. David Wetherall CSEP 561 Error detection & correction David Wetherall djw@cs.washington.edu Codes for Error Detection/Correction ti ti Error detection and correction How do we detect and correct messages that are garbled

More information

Jaringan Komputer. Data Link Layer. The Data Link Layer. Study the design principles

Jaringan Komputer. Data Link Layer. The Data Link Layer. Study the design principles Jaringan Komputer The Data Link Layer Data Link Layer Study the design principles Algorithms for achieving reliable, efficient communication between two adjacent machines at the data link layer Adjacent

More information

CSMC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala Set 4. September 09 CMSC417 Set 4 1

CSMC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala Set 4. September 09 CMSC417 Set 4 1 CSMC 417 Computer Networks Prof. Ashok K Agrawala 2009 Ashok Agrawala Set 4 1 The Data Link Layer 2 Data Link Layer Design Issues Services Provided to the Network Layer Framing Error Control Flow Control

More information

I. INTRODUCTION. each station (i.e., computer, telephone, etc.) directly connected to all other stations

I. INTRODUCTION. each station (i.e., computer, telephone, etc.) directly connected to all other stations I. INTRODUCTION (a) Network Topologies (i) point-to-point communication each station (i.e., computer, telephone, etc.) directly connected to all other stations (ii) switched networks (1) circuit switched

More information

COMPUTER NETWORKS UNIT I. 1. What are the three criteria necessary for an effective and efficient networks?

COMPUTER NETWORKS UNIT I. 1. What are the three criteria necessary for an effective and efficient networks? Question Bank COMPUTER NETWORKS Short answer type questions. UNIT I 1. What are the three criteria necessary for an effective and efficient networks? The most important criteria are performance, reliability

More information

Lecture / The Data Link Layer: Framing and Error Detection

Lecture / The Data Link Layer: Framing and Error Detection Lecture 2 6.263/16.37 The Data Link Layer: Framing and Error Detection MIT, LIDS Slide 1 Data Link Layer (DLC) Responsible for reliable transmission of packets over a link Framing: Determine the start

More information

Defect Tolerance in VLSI Circuits

Defect Tolerance in VLSI Circuits Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).

More information

Chapter 10 Error Detection and Correction 10.1

Chapter 10 Error Detection and Correction 10.1 Chapter 10 Error Detection and Correction 10.1 10-1 INTRODUCTION some issues related, directly or indirectly, to error detection and correction. Topics discussed in this section: Types of Errors Redundancy

More information

Data Link Layer: Overview, operations

Data Link Layer: Overview, operations Data Link Layer: Overview, operations Chapter 3 1 Outlines 1. Data Link Layer Functions. Data Link Services 3. Framing 4. Error Detection/Correction. Flow Control 6. Medium Access 1 1. Data Link Layer

More information

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission 1 Lecture 4: CRC & Reliable Transmission CSE 123: Computer Networks Chris Kanich Quiz 1: Tuesday July 5th Lecture 4: CRC & Reliable Transmission Lecture 4 Overview CRC toward a better EDC Reliable Transmission

More information

3. Data Link Layer 3-2

3. Data Link Layer 3-2 3. Data Link Layer 3.1 Transmission Errors 3.2 Error Detecting and Error Correcting Codes 3.3 Bit Stuffing 3.4 Acknowledgments and Sequence Numbers 3.5 Flow Control 3.6 Examples: HDLC, PPP 3. Data Link

More information

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 6: CODING THEORY - 2 Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian Agenda Hamming Codes

More information

Telecom Systems Chae Y. Lee. Contents. Flow Control Error Detection/Correction Link Control (Error Control) Link Performance (Utility)

Telecom Systems Chae Y. Lee. Contents. Flow Control Error Detection/Correction Link Control (Error Control) Link Performance (Utility) Data Link Control Contents Flow Control Error Detection/Correction Link Control (Error Control) Link Performance (Utility) 2 Flow Control Flow control is a technique for assuring that a transmitting entity

More information

The data link layer has a number of specific functions it can carry out. These functions include. Figure 2-1. Relationship between packets and frames.

The data link layer has a number of specific functions it can carry out. These functions include. Figure 2-1. Relationship between packets and frames. Module 2 Data Link Layer: - Data link Layer design issues - Error Detection and correction Elementary Data link protocols, Sliding window protocols- Basic Concept, One Bit Sliding window protocol, Concept

More information

TYPES OF ERRORS. Data can be corrupted during transmission. Some applications require that errors be detected and corrected.

TYPES OF ERRORS. Data can be corrupted during transmission. Some applications require that errors be detected and corrected. Data can be corrupted during transmission. Some applications require that errors be detected and corrected. TYPES OF ERRORS There are two types of errors, 1. Single Bit Error The term single-bit error

More information

CSE123A discussion session

CSE123A discussion session CSE23A discussion session 27/2/9 Ryo Sugihara Review Data Link Layer (3): Error detection sublayer CRC Polynomial representation Implementation using LFSR Data Link Layer (4): Error recovery sublayer Protocol

More information

Link Layer: Error detection and correction

Link Layer: Error detection and correction Link Layer: Error detection and correction Topic Some bits will be received in error due to noise. What can we do? Detect errors with codes Correct errors with codes Retransmit lost frames Later Reliability

More information

CS422 Computer Networks

CS422 Computer Networks CS422 Computer Networks Lecture 3 Data Link Layer Dr. Xiaobo Zhou Department of Computer Science CS422 DataLinkLayer.1 Data Link Layer Design Issues Services Provided to the Network Layer Provide service

More information

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect 6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance.

More information

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in

More information

TCP: Flow and Error Control

TCP: Flow and Error Control 1 TCP: Flow and Error Control Required reading: Kurose 3.5.3, 3.5.4, 3.5.5 CSE 4213, Fall 2006 Instructor: N. Vlajic TCP Stream Delivery 2 TCP Stream Delivery unlike UDP, TCP is a stream-oriented protocol

More information

CS 4453 Computer Networks Winter

CS 4453 Computer Networks Winter CS 4453 Computer Networks Chapter 2 OSI Network Model 2015 Winter OSI model defines 7 layers Figure 1: OSI model Computer Networks R. Wei 2 The seven layers are as follows: Application Presentation Session

More information

Errors. Chapter Extension of System Model

Errors. Chapter Extension of System Model Chapter 4 Errors In Chapter 2 we saw examples of how symbols could be represented by arrays of bits. In Chapter 3 we looked at some techniques of compressing the bit representations of such symbols, or

More information

Networking Link Layer

Networking Link Layer Networking Link Layer ECE 650 Systems Programming & Engineering Duke University, Spring 2018 (Link Layer Protocol material based on CS 356 slides) TCP/IP Model 2 Layer 1 & 2 Layer 1: Physical Layer Encoding

More information

Department of Computer and IT Engineering University of Kurdistan. Data Communication Netwotks (Graduate level) Data Link Layer

Department of Computer and IT Engineering University of Kurdistan. Data Communication Netwotks (Graduate level) Data Link Layer Department of Computer and IT Engineering University of Kurdistan Data Communication Netwotks (Graduate level) Data Link Layer By: Dr. Alireza Abdollahpouri Data Link Layer 2 Data Link Layer Application

More information

Chapter 6 Digital Data Communications Techniques

Chapter 6 Digital Data Communications Techniques Chapter 6 Digital Data Communications Techniques Asynchronous and Synchronous Transmission timing problems require a mechanism to synchronize the transmitter and receiver receiver samples stream at bit

More information

ECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II

ECE 333: Introduction to Communication Networks Fall Lecture 6: Data Link Layer II ECE 333: Introduction to Communication Networks Fall 00 Lecture 6: Data Link Layer II Error Correction/Detection 1 Notes In Lectures 3 and 4, we studied various impairments that can occur at the physical

More information

Data Link Networks. Hardware Building Blocks. Nodes & Links. CS565 Data Link Networks 1

Data Link Networks. Hardware Building Blocks. Nodes & Links. CS565 Data Link Networks 1 Data Link Networks Hardware Building Blocks Nodes & Links CS565 Data Link Networks 1 PROBLEM: Physically connecting Hosts 5 Issues 4 Technologies Encoding - encoding for physical medium Framing - delineation

More information

SRI RAMAKRISHNA INSTITUTE OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY COMPUTER NETWORKS UNIT - II DATA LINK LAYER

SRI RAMAKRISHNA INSTITUTE OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY COMPUTER NETWORKS UNIT - II DATA LINK LAYER SRI RAMAKRISHNA INSTITUTE OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY COMPUTER NETWORKS UNIT - II DATA LINK LAYER 1. What are the responsibilities of data link layer? Specific responsibilities of

More information

MYcsvtu Notes DATA REPRESENTATION. Data Types. Complements. Fixed Point Representations. Floating Point Representations. Other Binary Codes

MYcsvtu Notes DATA REPRESENTATION. Data Types. Complements. Fixed Point Representations. Floating Point Representations. Other Binary Codes DATA REPRESENTATION Data Types Complements Fixed Point Representations Floating Point Representations Other Binary Codes Error Detection Codes Hamming Codes 1. DATA REPRESENTATION Information that a Computer

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Point-to-Point Links. Outline Encoding Framing Error Detection Sliding Window Algorithm. Fall 2004 CS 691 1

Point-to-Point Links. Outline Encoding Framing Error Detection Sliding Window Algorithm. Fall 2004 CS 691 1 Point-to-Point Links Outline Encoding Framing Error Detection Sliding Window Algorithm Fall 2004 CS 691 1 Encoding Signals propagate over a physical medium modulate electromagnetic waves e.g., vary voltage

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES S. SRINIVAS KUMAR *, R.BASAVARAJU ** * PG Scholar, Electronics and Communication Engineering, CRIT

More information

Data Link Technology. Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science

Data Link Technology. Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science Data Link Technology Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science Agenda Functions of the data link layer Technologies concept and design error control flow

More information

Chapter 3 - Top Level View of Computer Function

Chapter 3 - Top Level View of Computer Function Chapter 3 - Top Level View of Computer Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Top Level View 1 / 127 Table of Contents I 1 Introduction 2 Computer Components

More information

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University Chapter 3 Top Level View of Computer Function and Interconnection Contents Computer Components Computer Function Interconnection Structures Bus Interconnection PCI 3-2 Program Concept Computer components

More information

Advantages and disadvantages

Advantages and disadvantages Advantages and disadvantages Advantages Disadvantages Asynchronous transmission Simple, doesn't require synchronization of both communication sides Cheap, timing is not as critical as for synchronous transmission,

More information

Data Link Layer (part 2)

Data Link Layer (part 2) Data Link Layer (part 2)! Question - What is a major disadvantage of asynchronous transmission? Reference: Chapters 6 and 7 Stallings Study Guide 6! Question - What is a major disadvantage of asynchronous

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Introduction to Computer Networks. 03 Data Link Layer Introduction

Introduction to Computer Networks. 03 Data Link Layer Introduction Introduction to Computer Networks 03 Data Link Layer Introduction Link Layer 1 Introduction and services 2 Link Layer Services 2.1 Framing 2.2 Error detection and correction 2.3 Flow Control 2.4 Multiple

More information

CSCI-1680 Link Layer I Rodrigo Fonseca

CSCI-1680 Link Layer I Rodrigo Fonseca CSCI-1680 Link Layer I Rodrigo Fonseca Based partly on lecture notes by David Mazières, Phil Levis, John Jannotti Last time Physical layer: encoding, modulation Today Link layer framing Getting frames

More information

1. Define Peripherals. Explain I/O Bus and Interface Modules. Peripherals: Input-output device attached to the computer are also called peripherals.

1. Define Peripherals. Explain I/O Bus and Interface Modules. Peripherals: Input-output device attached to the computer are also called peripherals. 1. Define Peripherals. Explain I/O Bus and Interface Modules. Peripherals: Input-output device attached to the computer are also called peripherals. A typical communication link between the processor and

More information

Homework 2 COP The total number of paths required to reach the global state is 20 edges.

Homework 2 COP The total number of paths required to reach the global state is 20 edges. Homework 2 COP 5611 Problem 1: 1.a Global state lattice 1. The total number of paths required to reach the global state is 20 edges. 2. In the global lattice each and every edge (downwards) leads to a

More information

Lecture 2 Error Detection & Correction. Types of Errors Detection Correction

Lecture 2 Error Detection & Correction. Types of Errors Detection Correction Lecture 2 Error Detection & Correction Types of Errors Detection Correction Basic concepts Networks must be able to transfer data from one device to another with complete accuracy. Data can be corrupted

More information

CMSC 2833 Lecture 18. Parity Add a bit to make the number of ones (1s) transmitted odd.

CMSC 2833 Lecture 18. Parity Add a bit to make the number of ones (1s) transmitted odd. Parity Even parity: Odd parity: Add a bit to make the number of ones (1s) transmitted even. Add a bit to make the number of ones (1s) transmitted odd. Example and ASCII A is coded 100 0001 Parity ASCII

More information

Data link layer functions. 2 Computer Networks Data Communications. Framing (1) Framing (2) Parity Checking (1) Error Detection

Data link layer functions. 2 Computer Networks Data Communications. Framing (1) Framing (2) Parity Checking (1) Error Detection 2 Computer Networks Data Communications Part 6 Data Link Control Data link layer functions Framing Needed to synchronise TX and RX Account for all bits sent Error control Detect and correct errors Flow

More information

The Data Link Layer Chapter 3

The Data Link Layer Chapter 3 The Data Link Layer Chapter 3 Data Link Layer Design Issues Error Detection and Correction Elementary Data Link Protocols Sliding Window Protocols Example Data Link Protocols Revised: August 2011 The Data

More information

11. SEU Mitigation in Stratix IV Devices

11. SEU Mitigation in Stratix IV Devices 11. SEU Mitigation in Stratix IV Devices February 2011 SIV51011-3.2 SIV51011-3.2 This chapter describes how to use the error detection cyclical redundancy check (CRC) feature when a Stratix IV device is

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques

More information

)454 6 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU

)454 6 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU INTERNATIONAL TELECOMMUNICATION UNION )454 6 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU $!4! #/--5.)#!4)/. /6%2 4(% 4%,%0(/.%.%47/2+ #/$%).$%0%.$%.4 %22/2#/.42/, 3934%- )454 Recommendation 6 (Extract

More information

Outline. EEC-484/584 Computer Networks. Data Link Layer Design Issues. Framing. Lecture 6. Wenbing Zhao Review.

Outline. EEC-484/584 Computer Networks. Data Link Layer Design Issues. Framing. Lecture 6. Wenbing Zhao Review. EEC-484/584 Computer Networks Lecture 6 wenbing@ieee.org (Lecture nodes are based on materials supplied by Dr. Louise Moser at UCSB and Prentice-Hall) Outline Review Data Link Layer Design Issues Error

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

CS 640 Introduction to Computer Networks. Role of data link layer. Today s lecture. Lecture16

CS 640 Introduction to Computer Networks. Role of data link layer. Today s lecture. Lecture16 Introduction to Computer Networks Lecture16 Role of data link layer Service offered by layer 1: a stream of bits Service to layer 3: sending & receiving frames To achieve this layer 2 does Framing Error

More information

Fault-Tolerant Computing

Fault-Tolerant Computing Fault-Tolerant Computing Hardware Design Methods Nov 2007 Self-Checking Modules Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing)

More information

Computer Peripherals

Computer Peripherals Computer Peripherals School of Computer Engineering Nanyang Technological University Singapore These notes are part of a 3rd year undergraduate course called "Computer Peripherals", taught at Nanyang Technological

More information