HARDWARE AND SOFTWARE ERROR DETECTION

Size: px

Start display at page:

Download "HARDWARE AND SOFTWARE ERROR DETECTION"

Robyn Agatha Dorsey
5 years ago
Views:

1 HARDWARE AND SOFTWARE ERROR DETECTION Ravishankar K. Iyer, Zbigniew Kalbarczyk Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana, IL Phone: , Fax:

2 1 INTRODUCTION A reliable computer system needs to provide its normal level of service in the presence of hardware and software faults [Avi86, Lap92]. There are two philosophies of achieving this reliability: fault avoidance, wherein techniques prevent the occurrence of faults in the first place, and fault tolerance, wherein techniques allow the system to continue execution despite the occurrence of faults. Fault tolerance can be implemented using one of two basic approaches: error masking, in which the system avoids the effect of an error through some form of majority voting, and error detection and recovery, in which the system has notices the presence of an error, locates and isolates the error, reconfigures in a spare, and restarts the system [Ren84]. The objectives of this article are two-fold: (1) to introduce and characterize a variety of hardwareand software- based error detection techniques and (2) more importantly, to illustrate and discuss these techniques in real-world systems and applications. Consequently, the paper does not aim at providing an exhaustive set of possible detection techniques. Rather, it emphasizes examples and the experience of applying these techniques to actual systems. The techniques presented can be implemented in hardware and/or in software, and they can be applied to uniprocessor, multiprocessor, distributed, or networked systems. Selecting the right technique or a set of techniques is, of course, a design decision. This paper is an important step toward facilitating such a decision. Not only does it discuss the implementation aspects of different techniques, but it also provides detailed evaluation results (in terms of error coverage or performance overhead) of many of the techniques presented. The article is organized as follows. The first few sections focus on hardware techniques, including hardware redundancy, information redundancy (e.g., coding techniques), and time redundancy. Significant space is dedicated to watchdog timers, heartbeats, and capability and consistency checking (including data audit techniques and assertions). The article then provides an in-depth discussion of software-based, control-flow checking techniques, including evaluation (using fault injection) of two selected techniques. The article ends with a discussion of the application of data audit techniques in a telecommunication environment. 2 HARDWARE REDUNDANCY In this section, we discuss three basic forms of hardware redundancy: passive, active, and hybrid. Passive hardware redundancy relies on voting mechanisms to mask the occurrence of faults by using the concept of majority voting. They do not need fault detection or system reconfiguration. The most common form of passive redundancy is called triple modular redundancy (TMR), which triplicates the hardware necessary to perform the required operations and uses a voter to determine the output of the system. In this approach, the primary difficulty is the voter. If that fails, the entire system fails. A common approach to avoid this problem is to use three voters and provide three independent outputs. Figure 1 shows the two forms of TMR. Several stages of TMR can be interconnected using this approach by connecting the outputs of the voters of one TMR stage via the inputs of modules of the next TMR stage. The voting can be performed by either a hardware voter (performs the voting fast, but requires extra hardware logic) or a software voter (performs on processors executing normal computations as well, but is generally slow). A generalization of the TMR approach is N-modular redundancy (NMR), which uses N copies of a module instead of three. For example, the NASA Space Shuttle onboard computer system uses four computers on which a majority vote is performed [Skl76]. 2

3 Active hardware redundancy attempts to achieve fault tolerance by fault detection, fault location, and fault recovery. The most common form of fault detection is duplication and comparison, which uses two identical copies of hardware, having them perform the same computations in parallel and comparing the results as shown in Figure 2. A commercial product from Stratus computers uses a pair-and spare approach where two duplexed components are used for self-checking and fault tolerance. Two processor boards are used, where each board contains a pair of microprocessors used in duplicate and compare mode to check themselves. Module 1 Input Module 2 Voter Output Module 3 a) TMR with One Voter Input 1 Module 1 Voter 1 Output 1 Input 2 Module 2 Voter 2 Output 2 Input 3 Module 3 Voter 3 Output 3 b) TMR with Three Voters Figure 1: Triplication and Voting Module 1 Output Input Comparator Match/Mismatch Module 2 Figure 2: Duplication and Comparison A second form of active redundancy is standby sparing, in which one module is operational and one or more modules serve as standbys or spares, as illustrated in Figure 3. Various fault detection schemes are used to determine when a module has become faulty, and fault location is used to determine exactly which module is faulty. The reconfiguration operation in standby sparing can be viewed conceptually as a switch whose output is selected from one of the modules providing inputs to the switch. Standby sparing can bring a system back into full operation after occurrence of a fault, but it requires that a momentary disruption in performance occur while reconfiguration is 3

4 performed. If the disruption of processing must be minimized, hot standby sparing can be used, where the spares operate synchronously with the on-line modules and are prepared to take over at any time. Cold standby sparing uses unpowered spares that must be powered up and initialized prior to bringing the module into active service. The advantage of standby sparing is that in a system containing n identical modules, such as a multiprocessor, fault tolerance can be provided with k < n spare modules. Switch Figure 3: Standby Redundancy Hybrid hardware redundancy combines the attractive features of both the active and passive approaches [Joh89]. Fault masking is used to prevent the system from producing erroneous results, and fault detection, location, and recovery are used to reconfigure the system in the vent of a fault. The most common form of hybrid redundancy is that on N-modular redundancy with spares. In this paragraph, a basic core of N modules is arranged in a voting configuration. In addition, spares are provided to replace faulty units in the NMR core. 2.1 Comparing the Reliability of Simplex and TMR Systems The purpose of this example is to show how rather blind use of redundancy can lead to seemingly paradoxical results. As the following calculation show the mean time to failure (MTTF) of a TMR system is in fact lower than that of a comparable simplex (nonredundant) system. A system with triple-modular redundancy (TMR) includes three components, two of which are required for the system to function properly. Assuming that the reliability of a single component is given by: λt Rsimplex = e where λ is the failure rate of the simplex component, the MTTF for a simplex system is : λt MTTFsimplex = e = 1 λ The reliability of a TMR system can be expressed as : 3λt 3 2λt λt 2λt RTMR = e + ( 2 ) e (1 e ) = 3e and consequently : MTTFTMR = 3 λ 2 λ = λ MTTF > MTTF simplex TMR Figure 4 shows the reliability of TMR and simplex systems as functions of λt. Note that R TMR (t) R(t) 0 t t 0 R TMR (t) R(t) t 0 t < where t 0 = ln 2 λ 0.7 λ 4

5 TMR Reliability Simplex λt o lambda * t Figure 4: Comparative Reliability of TMR and Simplex Systems TMR improves reliability for short missions (t t 0 ). For long missions (t>t 0 ), TMR actually degrades reliability. The explanation for this observation is rather simple, until the first failure the system operates in a TMR configuration and R TMR > R simplex. After the first failure the two remaining components compete to fail resulting in a lower reliability than a plain simplex with same failure rate. The same type of behavior is valid for any NMR system, i.e., M-out-of-N system. 2.2 M-out-of-N Systems In M-out-of-N systems, the critical factor is mission time rather than MTTF. The reliability of M- out-of-n systems is very high in the beginning, when the spare components tolerate failures. These systems can provide reliability of over a 10-hour period. But reliability falls sharply as time goes on, system redundancy is exhausted, and more hardware is subject to failure. M-out-of- N systems are useful contexts such as aircraft control, where very high reliability is needed for a short period of time. M-out-of-N systems are also used in fault-tolerance multiprocessor (FTMP) [Lal83] and software-implemented fault tolerance (SIFT) [Wen72]. For the general M and N, M modules out of N modules need to function, as shown in Figure 5. The reliability of such a system is given in the following expression. R MN = N M i =0 N N R i i m (1 R m ) i N working N-1 working N-2 working N-M working V Example Failed Figure 5: M-out-of-N System Computer systems used on board spacecraft must guarantee long-term uninterrupted operation (often for years) despite permanent failures induced by radiation present in space and by random events. Standby redundancy for both processors and memory is usually a technique of choice. This problem addresses the design of memory units with redundancy. To improve reliability, each k-bit 5

6 memory word is expanded by s spare cells, so each row contains k + s memory cells. There are n rows in the memory system. Assume that the memory chips contain n rows and each row contains only one cell. The entire memory system, then, contains k + s chips. Each chip has a failure rate of λ and obeys the exponential failure law. Write an expression for the reliability of the complete memory system using column sparing. The expression must account for fault coverage. Suppose that k = 16, λ = 9.15x10-6 failures per hour, and a fault coverage C = Determine the number of spares that will maximize the reliability of the memory at the end of a 10-year space mission. Calculate the reliability of the memory system at the end of 10 years (calculate for the case with zero spares and for the case with an optimal number of spares). Describe the complexities in answering this problem if the spare chips have a lower failure rate λ s than the k primary chips. Solution We have a memory system, with k main and s spare memory chips. The chips fail at the same rate: λ = per hour. The error coverage factor is C = We wish to know the reliability of the system at the end of 10 years. There are k + s ways for i chips to fail. The probability that i specific chips fail, and the remaining ((k+s) - i) do not, is ( ) k + s t i ( R ( t) ) i i R1 1 1, and the probability that i failures are detected and covered is C i. Consequently, the reliability is the sum of the probabilities that 0, 1,, s chips have failed and can be expressed as follows: k + s s R 1 1 i= 0 i k + s i i i () t = R () t ( 1 R () t ) C The term R 1 (t) is the probability that a particular memory chip has not failed at time t, and is given λt by: R () t = e 1. Solving the equation above, we find that for s = 29 (when k = 16 and t =10 years), the reliability has optimal value. For s = 0, R(t) = , and for s = 29, R(t) = Clearly the 29 spares provide for a significant improvement in the reliability. The problem does become more complicated if the spares fail at a different rate than the main chips. To compute the probability that n chips failed, we need to compute the probability that a main chips failed, and b spare chips failed, for all combinations of a and b such that a + b = n. This can be done as is shown below: λ1t where Rλ () t e 1 chips. s k i+ j s j () t = R () t R () t 1 R () t i= 0, j > 0, j i k i k s i j j i j j i ( ) ( R () t ) C R λ1 λ 1 2 λ1 λ 2 λ2t = is the reliability of main chips and R ( t) e λ 2 = is the reliability of the spare 2.3 The Effect of a Voter The previous expression for reliability of a TMR system assumes a voter that is 100% reliable. If we assume voter reliability of R v, we have: 6

7 R TMRV = R V ( R 3 m + ( ) R 2 m (1 R Generally a voter can be implemented in the hardware or software. Design and implementation of robust voting mechanism is often considered as relatively simple task. In reality a set of functions that a voter (or voting mechanism) should support depends on the system and the application and might include: Guaranteeing a majority vote on the input data to the voter Providing the ability to detect its own errors (i.e., the voter should be self-checking) Determining the faulty application replica (node) Handling voting in loosely synchronized (coupled) systems Handling voting in tightly synchronized (coupled) systems Loosely synchronized systems require synchronization of inputs to the voter. It may also be difficult in these systems to determine voter timeout due to differences in the relative speeds of the machines and variations in network communication delays. Tightly synchronized systems, on the other hand, generally do not require an external synchronization of inputs to the voter. A very instructive example of an actual voter implementation is the hardware voting mechanism used in TMR architecture to guarantee that the three processing units synchronously serve external interrupts arriving to the system [Cut93] [Jew91]. The processors in TMR system are configured to operate as a single logical processor. To function as the single processor, the instruction streams of the three processors must be identical. The synchronization mechanisms are designed so that if the code stream of any processor diverges from the code stream of the other processors, than a failure is indicated. Interrupt synchronization presents one of most difficult challenges in maintaining a single logical processor view. All interrupts are required to occur synchronous to virtual time (referred to as the cycle count). The cycle count can be thought of as a representation of the virtual time just as seconds are a representation of the real time. The reference point for virtual time is a specific software event or instruction. The code stream after the reference point must be identical for the three processors. External exceptions (interrupts) are not inherently synchronous to virtual time. All interrupts that are generated by the I/O devices must be synchronized to virtual time (i.e., made synchronous with the individual processor s instruction stream) before they are presented to the processor. If an external interrupt is presented directly to the processor (i.e., without synchronization to virtual time) then the three processors would start to process the interrupt at different instructions and this could lead to an unacceptable, inconsistent state of the system. Interrupts are synchronized to virtual time by performing a distributed vote on the interrupts and then presenting them to the processor on a predetermined cycle count (virtual time). Figure 6 gives the block diagram of the interrupt synchronization logic. An external interrupt is delivered to the distributor present on each processing board (CPU unit). The distributor broadcasts the delivered interrupt to the other two CPUs via the inter-cpu bus. As a result there are three pending interrupts, one from each CPU. The interrupt voter (also present on each processing board) captures the pending interrupts and performs a vote to verify that all of the CPUs did receive the external interrupt request. On a predetermined cycle count, the interrupt voter presents the interrupt to m ))

8 microprocessor. All of the microprocessors will receive the interrupt on the same cycle count and thus the interrupt will be synchronized to virtual time. The interrupt voter uses a dedicated register (a holding register not shown in Figure 6) to save state information as to whether all CPUs captured and distributed an external interrupt. In error-free scenario, as described above, this state information is not necessary, i.e., external interrupts can be synchronized to virtual time without the use of holding register. The holding register provides a mechanism for the voter to know that the last interrupt vote cycle captured at least one, but not all, of the interrupt pending bits. There are two possible scenarios resulting in not all of the pending interrupts bits being set: (1) The external interrupt is asserted before the interrupt distribution cycle on some of the CPUs but after the interrupt distribution cycle on other CPUs; (2) At least one of CPUs fails in a way that prevents the correct operation of the distributor. In the former case the interrupt voter is guaranteed that all of the interrupt pending bits will be set on the next interrupt vote cycle and thus, all processors will receive the external interrupt at identical instructions. Consequently, if the interrupt voter discovers that the holding register has been set and not all of the interrupts pending bits are set, than an error must exist on one or more of the CPUs (the latter scenario). The interrupt voter presents the pending interrupt to the processor and also raises an interrupt-synchronization-error interrupt on the high priority interrupt level. The system software serves this high priority interrupt. Note that the scheme presented here represents a hardware implementation of the interactive consistency algorithm, which is used in distributed systems for providing data integrity of replicated processes. Hardware implementation, however is much less performance intensive than the software counterpart. Interrupt Logic External Interrupts Voter TMR Controller Interrupt Pending Distributor Microprocessor Interrupt Voter CPU Generated Interrupt CPU Unit A Distributor Interrupt Pending Microprocessor Interrupt Voter CPU Generated Interrupt CPU Unit B Inter-CPU Bus Distributor Microprocessor CPU Generated Interrupt Interrupt Pending Interrupt Voter CPU Unit C Figure 6: Interrupt Synchronization in a TMR System 3 INFORMATION REDUNDANCY Information redundancy is the addition of redundant information to data to allow fault detection, fault masking, and fault tolerance. An example of information redundancy is single error correction and double error detection (SEC-DED) code. A code's error detection and correction properties are based on its ability to partition a set of 2 n words each n-bit wide into a code space of 2 m words and a noncode space of 2 n -2 m words. Each code is constructed such that a given number of errors transforms a code-space word into a word in a noncode space. Decoding circuits detect errors by 8

9 identifying any word outside the code space. Error correction is performed by more extensive decoding that uniquely associates a noncode space word with the original code word transformed by the errors. 3.1 Fault Detection through Encoding The basic idea behind an error-detecting scheme is to add redundant information to the data being transmitted or stored to determine if errors have been introduced. The amount of check information is where error-detecting and error-correcting codes diverge. The formers include only enough redundant information to allow the receiver to determine that an error has occurred but not to locate it. Error correcting-codes, on the other hand, add enough redundancy to allow the receiver or reader to deduce what the transmitted or stored clock of data must have been. Hamming and Shannon first developed error-correcting codes in the late 1940s for use when storing data on magnetic disks and core memories. Since then, the use of more sophisticated mathematical techniques has led to the invention of a plethora of encoding and decoding methods. At the logic level, codes provide a means for masking or detecting errors. Formally, the code is a subset S of universe U of possible vectors. A noncode word is a vector in the set U-S. These relations are shown in Figure 7. In the figure, X 1 is a codeword < >, which due to a multiple bit error becomes noncodeword X 3 = < >, which is detectable. The codeword X 2 becomes another codeword X 4, which is not detectable. U = 2 8 vector S = even parity X 1 X 2 X 4 X 3 Figure 7: Logic of Error-Masking/Detecting Code The ability of a code to detect and correct errors is determined by the minimum separation, or Hamming distance, between the words of a code space, which is the minimum number of bit positions by which two words from the code can differ. The distance of a code is the minimum of Hamming distances between all pairs of code words. For example, for two code words x = (1011) and y = (0110), the Hamming distance is d(x,y) = 3. Using the notion of Hamming distance, it can be shown that, to detect all error patterns of a Hamming distance d, the code distance must be d+1; for example, a code with a distance of 2 can detect patterns with a distance of 1 (i.e., single-bit errors). To correct all error patterns with a Hamming distance of c, the code distance must be 2c+1. To detect all patterns with a Hamming distance of d and to correct all patterns with a Hamming distance of c, the code distance must be 2c+d+1 (note that d corresponds to the number of additional bit errors that can be detected). For example, code with a distance of 3 can detect and correct all single-bit errors (i.e., c=1 and d=0). 3.2 Parity One of the simplest error detecting codes is the parity code, where given an n-bit, one attaches an extra bit to convert it to an even or odd parity word. A simple decoding circuit using a set of XOR gates will detect any single bit error in the parity-coded word. Parity codes are used routinely in 9

10 computers to check errors in busses, memory, and registers. Table 1 compares parity codes for memories. Five strategies for calculating parity are considered: (1) bit-per-word parity, (2) bit-perbyte, (3) bit-per-multiple-chips, (4) bit-per-chip, and (5) interlaced parity. It is assumed that memory is constructed from individual chips where each chip contains several bits of data word. Parity Code Advantages Disadvantages Bit-per-word: one parity bit per data word Detects all single-bit Certain errors undetected, e.g., a word, errors including parity bit becomes all 1s, due to Bit-per-byte: each data portion (e.g., a byte) is protected by a separate parity bit; the parity of one group should be even and the parity of the other group should be odd Bit-per-multiple-chips: one bit from each chip is associated with a single parity bit Bit-per-chip: each parity bit is associated with one chip of the memory Interlaced: similar to the bit-per-multiplechips; must ensure that no two adjacent bits are from the same parity group Detects all-1s and all-0s conditions Detects failure of entire chip Detects single-bit errors and identifies chip with erroneous bit Detects errors in adjacent bits Table 1: Comparison of Parity Codes a failure of a bus or a set of data buffers. Ineffective for multiple errors, e.g., the whole-chip failure Cannot locate failure of complete chip Susceptible to whole-chip failure, i.e., a single chip error can result in multiple bits to be corrupted and this may go undetected. Parity groups are not based on physical organization of the memory In high-speed memories, single-bit error-correcting and double-bit error-detecting (SEC-DED) codes are most commonly used. The data before writing to the memory are passed to a parity generator. The generated parity bit (or bits) is (are) then stored in the memory together with the data. On read operation the data bits are passed into the parity checker that regenerates the parity bit (or bits) and compares it with the parity bit(s) stored in the memory when the original data were written to the memory. The single-bit parity code has a minimum Hamming distance of two. The following description brings more details on Hamming codes. In Hamming single-error correction code, c parity bits are added to a k-bit data word, forming a code word of k+c bits. The following expression can be used to determine number of necessary c check (parity) bits to protect k bits of information: 2 c + k + 1. Consider a data word of four information bits (d 0, d 1, d 2, d 3 ). According to the above expression, three parity bits (p 1, p 2, p 3 ) are needed to protect the four bits of data. To illustrate how the parity (check) bits are generated and checked, assume that the bits in the code word are numbered from 1 to k+c. Positions numbered as a power of two are reserved for the parity bits. The grouping of bits for parity generation and checking is determined based on a list of the binary numbers from 0 to 2 k 1, as illustrated in Figure 8. 10

11 Determining the bit groups ( three parity bits) Code Word p 1 p 2 d 0 p 3 d 1 d 2 d 3 Parity bits calculation p 1 = XOR of bits (3, 5, 7) p 2 = XOR of bits (3, 6, 7) p 3 = XOR of bits (5, 6, 7) Parity checking c 1 = XOR of bits (1, 3, 5, 7) c 2 = XOR of bits (2, 3, 6, 7) c 3 = XOR of bits (4, 5, 6, 7) Figure 8: Determining Parity/Check Bits for Hamming Code The first group is formed by the data bits in the positions corresponding to the 1-bits in the least significant bit of the binary count sequence (i.e., bits 1, 3, 5, 7). The second group is formed by the data bits in the positions corresponding to the 1-bits in second significant bit of the binary count sequence (i.e., bits 2, 3, 6, 7), and so on. Note that each group of bits starts with the number that is a power of two. These numbers are also the position number (in a code word) for the parity bits. The individual parity bits are calculated by performing an XOR operation on the data bits specified by a given group. For parity checking, the XOR operations also include the parity bit itself. As a result, the original data is encoded by generating a set of parity bits (p 1 p 2 p 3 ). To check correctness, the encoding process is repeated and a set of check bits (c 1 c 2 c 3 ) is generated. The binary word represented by the check bits c 1 c 2 c 3 forms a syndrome, which points directly to the position of the erroneous bit. Figure 9 shows relations between the syndrome values and the bit position in error for the example of a four-bit data word. The error can be corrected by complementing the corresponding bit. Erroneous bits Syndromes d d d d p p p Figure 9: Error Detection and Correction Using Syndromes The Hamming code discussed above can only detect and correct single bit errors. By adding an extra parity bit, the Hamming code can be used to correct single bit errors and to detect double errors. In the example of a data word consisting of four information bits, the additional parity bit, p 4, can be calculated as parity (XOR) over the first seven bits of the code word. For parity checking, the additional check bit c 4, is calculated over all eight bits of the code word. Figure 10 illustrates the four cases distinguishable by single-error correction (SEC) and double-error detection (DED) Hamming code. 11

12 c 1 c 2 c 3 c x 1 x 2 x 3 1 y 3 y 2 0 y No errors Single error (in a position x 1 x 2 x 3 ) is detected and can be corrected Double error is detected but cannot be corrected Error in parity bit p 4 Figure 10: Error Detection and Correction Using SEC-DED Code 4 CYCLIC REDUNDANCY CHECKS Cyclic redundancy checks (CRCs) are used to detect errors in communication channels, tapes, and disks. Cyclic codes are parity check codes with the additional property that the cyclic shift of a codeword is also a codeword. If (Cn 1,C n 1 KC 1,C 0 ) is a codeword, then (Cn 2,C n 3 KC 0,C n 1 ) is also a codeword. The wide use of CRCs is due mainly to two factors: (1) simplicity of implementation (the needed hardware includes linear feedback shift registers and EX-OR gates) and (2) ability to detect single-bit errors, multiple adjacent bit errors affecting fewer than n-k (for an (n,k) code) bits, and burst transient errors (typical of communication applications). The idea is to append a checksum to the end of the data frame in such a way that the polynomial represented by the resulting frame is divisible by the generator polynomial G(x) that the sender and receiver have agreed upon. When the receiver gets the checksummed frame, it divides it by G(x) and if the remainder is not zero, there has been a transmission error. It is then clear that the best generator polynomials are those less likely to divide evenly into a frame that contains errors. CRCs are distinguished by the generator polynomials they use. The property of cyclic code is often expressed as (n, k), where n is a total length of each cyclic code word and k is a length of redundancy code. The characteristics of a cyclic code are depending on the portion of these two factors, n and k, and its generating polynomial. IBM SDLC (Synchronous Data Link Control), a transmission protocol, employs CRC-16, which has 16 bits of redundancy code with a G(X) = X 16 + X 15 + X generating polynomial. Large transmission systems, such as Ethernet and Token Ring, use a 32-bit CRC for data protection. Other CRCs widely used in link level protocols include CRC-8, CRC-10, CRC-12. [Pet96] discusses their corresponding G(x)s. One of the problems with cyclic code is that we cannot directly specify the error bit position during the decoding process. If the nonzero remainder does not contain enough data of error bit position, the receiver cannot correct the error and has to request the sender to retransmit. This process requires a large buffer for both the sender and receiver to store all transmitted data to be retransmitted and to reconstruct the correct receiving order [Ben95]. This retransmission procedure is time consuming, but if the error rate is not too high, it is efficient. One approach to realizing error correction is using look-up tables [Man95]. Look-up tables contain all possible patterns of nonzero remainders and the position of the corrupt bit in the received data. This method can be implemented in a short time but requires huge data storage for the table. In some applications, the use of a look-up table is impractical in terms of its cost and overhead. 12

13 4.1 Checksums Checksums are commonly used in communication applications. The idea is to add up all the words to be transmitted and then transmit the sum (called the checksum) along with the data. At the receiving end the checksum is recalculated and compared with the original. If any of the data including the checksum is corrupted during transmission, the result is mismatch. This method will not protect against errors that cause data words to arrive out of order. The checksum codes differ in the way in which the checksum is generated. Performing the modulo-2 addition of the words to be transmitted and ignoring any overflow generates the single-precision checksum. Its weakness is that errors that cause the original and the recalculated checksums to differ only in the ignored bit position are not detected, as illustrated in Figure 11. A most significant data line stuck at 1 is an example of an error that exposes this flaw. Checksum Carry is ignored Sent Data d 3 d 2 d 1 d } DATA 1 Checksum a) The checksum is formed. d 0 d 1 Sender d 2 d X 3 Faulty Line Stack-at 1 Receiver Checksum on Received Data Received Checksum Received Data b) The error goes undetected. Figure 11: Error Scenario Not Detected by a Single-Precision Checksum (based on [Joh89]) In the Honeywell checksum adjacent data words are concatenated prior to computing the checksum, thus K n-bit words are grouped into K/2 2n-bit words. This structure has the capability of detecting a bit error that affects all words in the same bit position because it makes the two checksums to differ in two locations. Nevertheless, overflow can still cause loss of carry-bit information. The key disadvantage of using checksums is their limited capabilities of diagnosing the actual cause of the problem. The observed problem can be due to (1) an error in checksum calculation, (2) a transmission error, (3) a corruption of the original data (before the checksum computation). Using checksum one cannot determine which of the three scenarios happened. 4.2 Arithmetic Codes Arithmetic codes detect errors in arithmetic units like adders and multipliers. Arithmetic codes are useful in checking arithmetic operations, where parity codes would not be preserved under addition and subtraction. Separate arithmetic codes separate check symbols from data symbols. Nonseparate arithmetic codes combine check and data symbols. Some types of arithmetic codes are AN codes, residue codes, and bi-residue codes. d 3 d 2 d 1 d

14 For example, in AN arithmetic code, data X is multiplied by check base A to form A.X. Addition of code words is performed modulo M, where A divides M. This yields A( X+ m Y ) = AX + m AY. Dividing the result by A checks the operation correctness. If the result is zero, there is no error; otherwise, there is an error. This is illustrated in Figure 12. For more information about coding in reliable computer systems, the reader is referred to [Rao89, Blah84]. AX AY + M Residue Mod A A(X + M Y) Figure 12: Example Arithmetic Code 5 TIME REDUNDANCY The basic concept of time redundancy is the repetition of computations two ore more times and comparing the results to determine if a discrepancy exists. If an error is detected, the computations can be performed again to see if the disagreement remains or disappears. Such approaches are good for detecting errors due to transient faults, but cannot protect against errors resulting from permanent faults. Another form of time redundancy to handle permanent faults modifies the way the computations are performed the second time. One approach uses alternating logic for self-dual combinational circuits [Rey78], which performs a function on some set of inputs in one time instant, and performs the same function on the complemented input in a subsequent time step, the output of which should be the complement of the original function value of the original input. If the second value of the function is not the complement, an error is detected. The second approach uses recomputing with shifted operands [Pat82], which is applicable to bitsliced organizations of hardware. In the first step, the normal computation is performed on the operands and the results stored in a register. In the first step, the normal computation is performed on the operands and the results stored in a register. In the next step, the operands are shifted left by k bits, and the output is shifted right by k bits and compared with the result of the previous computation. Any error in k-1 consecutive bit slices of an arithmetic or logical operation will be detected by this method. The additional hardware requirement is the three shifters, the storage register to hold the results of the first computation, and the comparator. A variant of this method is called recomputing with swapped operands, where in the first two steps, the operation is performed in the normal form. In the following time step, the upper and lower halves of the operands are swapped such that a faulty bit slice operates on opposite halves of the operands in the two computations. The additional hardware requirements are in the form of several multiplexers, a storage register and a comparator. 6 WATCHDOG TIMERS Watchdog timers have been used since the early days of digital systems as an inexpensive method of error detection. A timer is implemented separately from the process that it monitors. The process being watched must reset the timer before the timer expires; otherwise, the watched process is assumed to be faulty. 14

15 Traditionally, watchdog timers are used to detect control flow errors that result in the timer not being reset [Pras89]. When the timer expires, the system is reset. Alternatively, instead of resetting the system, an interrupt can be triggered to initiate a recovery from the error. Watchdog timers can also be used much in the same way timeouts are used to monitor behavior of a single subsystem [Ore75]. The timeouts differ from watchdog timers in that they provide a finer check of control flow. Watchdog timers can be implemented in either hardware (the timer is generally external one that can be reset with a signal) or software (often run on the same processor as the process being monitored, but the timer is maintained as a separate process) [Sie98, Pras89]. A novel implementation of the watchdog timer effect without using a timer is the technique of extended-precision checksum-based control-flow checking [Sax90]. Extended-precision checksums are taken of a branch-free block of instructions as the sum total of the instructions or some transformation of the instructions. Before each block, the checksum value is sent to a buffer. As the instructions execute, they are subtracted from the buffer. When the block ends, or a branch occurs, a zero check signal is sent. If the buffer becomes zero or negative before the signal is set, a control-flow error has occurred. If the buffer is positive when the signal is set, an error has occurred as well. 6.1 Example Applications of Watchdog Timers Pluribus Reliable Multiprocessor. An example of a system designed with an extensive use of watchdog timers is the Pluribus multiprocessor [Sax90]. Pluribus was built primarily for research purposes; its main goal is high reliability. The behavior of Pluribus as a whole is not monitored, but hardware and software timers monitor almost every subsystem. This approach increases overall system reliability, since a subsystem that fails due to an intermittent or transient fault will be restarted and not allowed to cause a system failure. While Pluribus uses other error-detection techniques, those techniques are usually combined with a timer. The timers range from five microseconds to two minutes in duration. The Pluribus subsystems cycle with a characteristic time constant. During each cycle, the subsystem performs a complete self-check for consistency. Passage through the cycle means the subsystem is operating correctly; a lapse of too much time without a timer reset indicates that the subsystem has suffered a failure from which it cannot recover by itself. An example subsystem is the free message buffer list, where message buffers are stored when not in use. Buffers leave the list for at most two minutes, so a 2-minute timer is maintained for with buffer. If a timer runs out, this indicates that a failure has occurred and that the buffer will not be returning to the free list on its own, and the monitored buffer is forced back onto the free list. In this case, the failure caused a system performance degradation of less than two minutes, during which the system operated with too few message buffers. However, the error caused no data loss, since the timer facilitated a complete recovery. Another example is the failure of the mutual exclusion locks that are on each subsystem. A lock failure can cause the lock of a resource when no subsystem is using it. A subsystem trying to use the resource is put in a waiting loop. Since the lock failed, the resource will never become free, but a 1/15-second timer will interrupt the processor, which will arbitrarily unlock the resource. Aside form the temporary (1/15-second) degradation in system performance, the system is unaffected by the error. As in the first example, no data is lost and a complete recovery is possible. 15

16 A more drastic error, from which watchdog timers aid in partial recovery, is the permanent failure of a processor. When a processor fails, any message buffer it had not returned to the free buffer list will be returned by the operation of the timer monitoring the list, as described above. Likewise, any resource the processor had locked will be unlocked. While a complete recovery is not immediately possible, since the processor itself must of course be repaired or replaced, the system can remain up and the error can be limited to the single processor. VAX-11/780. A multiprocessor system design for more commercial applications that makes use of a watchdog timer is the VAX-11/780 [Sie98]. On this system the console processor runs a watchdog process that is reset when an interrupt line is strobed. If it is not strobed by a processor within 200 microseconds, this indicates a failure and the console processor attempts to determine the reason for the failure. Bell System Telephone Switches. Yet another system, which employs watchdog processors to detect errors, is the telephone-stored program switching system developed by Bell Systems [Con97]. External watchdog timers monitor proper program operation by triggering recovery when timers are not periodically reset. This allows early (before the error propagates and causes severe damage to the system) detection of problems caused by software errors and consequently easier recovery. It should be noted that despite of watchdog based error detection, software techniques known as audit were main line of defense against errors. Mars Sojourner. An example of where a watchdog timer demonstrated its ability to detect errors is NASA s Mars Pathfinder mission of the Sojourner rover [Jon97]. The computer system that controlled the Sojourner rover uses a real-time preemptive multithreaded operating system. Tasks are scheduled based on priorities that reflect their relative urgency. Due to a design flaw, a condition known as priority inversion could occur. To illustrate priority inversion let consider the following example execution scenario: (1) a low priority thread obtains a mutually exclusive lock to access shared data, (2) under this conditions a long running task with higher (than the low priority thread) priority is scheduled due to an interrupt, and (3) the higher priority thread needs access to the data locked by the lower priority task. As a result (1) the lower priority task is prevented from running by the higher priority thread and (2) the high priority task is also prevented from running because it blocks waiting for the low priority thread to release the lock. Using watchdog timer the above scenario was detected and the system restarted. However, full restart caused loss of data, and the repetitive resets seriously limited the correct work of the system of the Mars Rover. The problem was eventually diagnosed and the software was patched to reestablish proper behavior. In this system, the recovery method applied when the watchdog timeout is a traditional system reset, a drastic but robust measure representing a good engineering practice. The availability of the system is much more important than the lost data due to the system reset. 6.2 Limitations of Watchdog Timers Watchdog timers are not ideal for detecting errors in digital systems. The reasons for this fall into four areas: 1. While the error detection is not limited to any particular fault model, watchdog timers only detect errors of a very specific type. The assumption is that any error will manifest itself as a control-flow error such that the system does not continue to reset the timer. If a control-flow error occurs but the program resets the timer in time, the error will go undetected. 16

17 2. Timer resets must be placed with care to be effective. They cannot be placed inside interrupt routines or loops (to avoid possibility of an infinite loop), but they must occur often enough that the timer cannot expire during any normal operation. 3. Only processes with relatively deterministic runtimes can be checked, since the error detection is based entirely on the time between timer resets. If the set time is shorter than the longest possible runtime of the checked process, it can expire even though there is no error. On the other hand, if the time is set too long, then even if a control-flow error occurs, the process may have enough time to get back to the point at which the timer is reset, and the error will not be detected. 4. A watchdog timer provides only an indication of possible process failure; a partially failed process may still be able to reset the timer. Coverage is limited, as neither the data nor the results are checked. When used to reset the system, a watchdog timer can improve availability (the mean time to recovery is shortened) but not reliability (failures are just as likely to occur). When the availability of a digital system is more important than the loss of data under some condition, the use of a watchdog timer to reset the system on the detection of an error is an appropriate choice. 7 HEARTBEATS Heartbeat is a common approach to detecting process and node failures in a distributed (networked) computing environment. Periodically, a monitoring entity sends a message (a heartbeat) to a monitored node or process and waits for a reply. If the monitored node does not respond within a predefined timeout interval, it is declared as failed and appropriate recovery action is initiated. 7.1 Limitations of Traditional Heartbeats There are two major problems associated with the traditional heartbeat scheme: The timeout period is pre-negotiated by the two parties or sometimes even hard-coded by the programmer. The predefined timeout value cannot adapt to changes in network traffic or to load variability on individual nodes. In cases of high network traffic, high load on the nodes, or a slow node, the timeout value can be too short and cause the monitoring node to declare a healthy node as faulty. Such a false alarm is undesirable in a distributed environment, especially for critical applications such as those used in commercial banking and in database systems. The monitored node is assumed to be healthy if is able to respond to a heartbeat message. This is usually acceptable for a single-threaded application. However, in a multithreaded application, an independent thread of execution is usually responsible for replying to the heartbeat message. The healthy operation of this thread does not necessarily imply the healthy operation of the entire multithreaded application. Other threads inside the process may be in a deadlock situation that keeps the entire process from making progress, alternatively other threads could be operating in a corrupted state that keeps the process from providing a proper service. Adaptive and smart heartbeat algorithms address these two problems. A heartbeat algorithm is called adaptive if the timeout value used by the monitor is not fixed but rather is periodically negotiated between the two parties to adapt to changes in the network traffic or node load. A heartbeat algorithm is called smart if the entity being monitored excites a set of predefined checks to verify the robustness of the entire process and only then responds to the monitoring process. 17

18 7.2 Designing Adaptive, Smart Heartbeats To illustrate the concept of adaptive, smart heartbeats, two independent, multithreaded processes: heartbeat replier and heartbeat monitor are created [Bas00]. Heartbeat_monitor is the monitoring entity that is responsible for periodically sending heartbeat request messages to the target node. Heartbeat_replier is the monitored entity responding to the heartbeat request messages sent by the monitor. The adaptive scheme uses Jacobson s algorithm [Tan96], which allows for adjusting the timeout value according to measured network performance in terms of round trip time (RTT) in message transmission. The heartbeat algorithm is made smart (i.e., has the ability to verify robustness of the entire process) using a null test message inside the process to test the healthy operation of all the threads within the process. In the following sections, we present the implementations of these two schemes. The heartbeat protocol is depicted in Figure 13. Periodically, the heartbeat monitor sends a heartbeat message to the heartbeat replier, clears the counter ack_missed, and starts the timer. The duration of the timer is dictated by the current value of the timeout variable associated with the heartbeat replier. Heartbeat Monitor Heartbeat Replier Heartbeat Period Timeout Expiration HB message HB ack RTT Figure 13: Protocol for Adaptive Heartbeat On the other side, the heartbeat replier responds with a heartbeat acknowledgment message. If heartbeat acknowledgment message is received by the heartbeat monitor before the time expires, the monitor assumes that the remote process is alive, otherwise the counter ack_missed is increased. If the counter has not reached its maximum value, a further heartbeat message can be sent from the heartbeat monitor to the heartbeat replier; otherwise, the remote process is assumed to be faulty. Crucial for the protocol are the values of the timeout and the heartbeat period. In general the heartbeat period can be fixed as a multiple of the current value of the timeout. It is however desirable to have a timeout value that adapts to the current response time of the remote process. The response time, as seen by the heartbeat monitor, is a function of the current load on the remote machine and the time required to transfer the heartbeat message and for the heartbeat acknowledgment, i.e., the response time is a function of the Round Trip Time (RTT). To calculate the RTT it is sufficient to include in the heartbeat message a timestamp whose value is the sending time. This timestamp will be sent back to the monitor by the replier; so, when the monitor receives a heartbeat acknowledgment, it can calculate the instantaneous RTT as the difference between the current time and that timestamp. However, it turns out that such a solution still does not perform well in the case of a variable workload. The main problem is the variability of 18

Reliable Computing I

Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the