Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy). Permanent fault detection using time redundancy. Self-checking circuits. Reconfigurable memory arrays. 1 Duplication with Complementary Logic This technique duplicates a given module and compares the outputs of the resulting two modules. As long as the comparator works correctly, a failure of any one of the two modules is detected. The problems with duplication with comparison are two fold: (1) the comparator may fail and (2) the approach assumes that only one of the two duplicated modules will fail at any given time, that is, it ignores common mode failures that cause the two modules to fail in the same fashion at the same time. So, we need to modify the design of duplication with comparison schemes to minimize the effect of common-mode failures. One technique useful in tackling problems with common-mode failures in VLSI circuits is in the use of complementary logic where one circuit uses positive logic (that is, logic 1) while the other circuit uses negative logic (that is, logic 0). Suppose we know the Boolean function realized by a circuit using positive logic, we can easily determine the function realized by the same circuit using negative logic using the concept of duality. Recall from Boolean algebra that the dual of a Boolean function can be formed by replacing AND operations with OR operations, OR operations with AND operations, 1s with 0s, and 0s with 1s. The variables and complement operations are not changed. For example, consider the function f(x 1, x 2, x 3 ) = x 1 x 2 + x 3 The dual of the function f is given by f d (x 1, x 2, x 3 ) = (x 1 + x 2 )x 3 We can use the dual function f d to obtain the complement of f by replacing each variable in f d with its complement. f(x 1, x 2, x 3 ) = f d ( x 1, x 2, x 3 ) = ( x 1 + x 2 ) x 3 Let X be a vector consisting of n input bits given by X = (x 1, x 2,..., x n ). If we apply X to an arbitrary Boolean function f and then apply X = ( x 1, x 2,..., x n to the function f d, where f and f d are duals, the resulting outputs will be complementary. That is, f d ( X) = f(x). 1989. These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley, 1
Fig. 1: Implementation of the function f(x 1, x 2, x 3) = x 1x 2 x 3 + x 1 x 2x 3 and its dual. Complementary logic can be used to implement a duplication with comparison approach to fault detection. Rather than use exact replicas of each module, the modules are designed as duals of each other. One module operates using positive logic and the other module operates operates using negative logic. If both modules are operating properly, the outputs will be complementary. There are three advantages of using complementary logic: (1) The use of dual implementations forces the use of separate masks to create the two modules. The possibility of common-mode failures resulting from design mistakes or mask problems is reduced. (2) The voltage transitions on the corresponding lines in the two modules are in opposite directions, and so, the possibility of faults that are sensitive to voltage transitions producing identical effects is reduced. (3) Corresponding lines in the two modules are always at different voltage levels, and so, a short between two such lines always results in one of the two lines having an erroneous value and the other line having the correct value. Consequently, the fault can be detected. Let us consider the design of a duplicate and compare scheme and the concept of complementary logic to realize the function f(x 1, x 2, x 3 ) = x 1 x 2 x 3 + x 1 x 2 x 3 The dual of f is given by f d (x 1, x 2, x 3 ) = (x 1 + x 2 + x 3 )( x 1 + x 2 + x 3 ) Fig. 1 shows the logic diagrams of the circuits that realize f and f d, respectively. The original function and its dual are now operated in parallel using complementary input combinations, as shown in Fig. 2. Logic values on corresponding lines in the two modules are complementary. The outputs, in the fault-free case, will also be complements, and can be compared to detect faults. 2 Fault Detection using Time Redundancy One of the problems with the duplicate and compare approach is the penalty paid in extra hardware. Time redundancy is a way to decrease the hardware overhead needed to achieve fault detection (or fault tolerance), at the expense of using additional time. The basic concept of time redundancy is to repeat computations in such a way that allows faults (both transient and permanent) to be detected. The approach used to detect 2
Fig. 2: Duplication with comparison using complementary logic. An example input and the values of the internal lines are also shown. Note the complementary values on the corresponding lines. transient faults is shown in Fig. 3. To detect permanent faults using time redundancy, the computation (or the data) must be modified when it is performed the second time, as shown in Fig. 4. We will consider two approaches that use time redundancy to detect permanent faults in VLSI circuits: (1) alternating logic and (2) recomputing with shifted operands (RESO). Alternating Logic The concept of alternating logic can be applied to general combinational logic circuits if the circuit possesses the property of self-duality. A combinational circuit is said to be self dual if and only if f(x) = f( X), Fig. 3: In time redundancy, the computations are repeated at different points in time, and the results are then compared. 3
Fig. 4: If time redundancy is used to tolerate permanent faults, the computations must be modified when they are performed the second time. where f is the Boolean expression for the circuit and X is the input vector for the circuit. In other words, a combinational circuit is self-dual if the output of the circuit for the input vector X is the complement of the output when the input vector X is applied. So, for a self-dual circuit, the application of an input X followed by the input X, produces outputs that alternate between 1 and 0. The key to detecting faults is determining that at least one input combination exists for which the fault does not result in alternating outputs. The full-adder circuit shown in Fig. 5 is a self-dual circuit. Any combinational circuit with n inputs can be transformed into a self dual circuit with no more than n + 1 input variables. The dual f d of an n-variable function f is given by The function given by f d = f( x 1, x 2,..., x n ) f sd = x n+1 f + x n+1 f d is a self-dual function because when x n+1 = 1, that is, when X = (x 1, x 2,..., x n, x n+1 ) = (x 1, x 2,..., x n, 1), then the value of f sd is f. When x n+1 = 0, that is, when we provide X = ( x 1, x 2,..., x n, 0), the value of f sd is f d. Thus, x n+1 is a control line that determines which of the two functions, f or f d, appear on the output line. Alternating logic detects a set of faults, if for every fault within the set, there is at least one input combination that produces non-alternating outputs. Fig. 6 shows the resulting truth table for the various stuck-at-1 or stuck-at-0 faults present in the full adder circuit in Fig. 5. As we can see, each stuck-type fault results in at least one set of non-alternating outputs being produced for complementary inputs at either the carry or the sum output. When using alternating logic, it is important to note that faults may not be immediately detected. For example, suppose that the full-adder contains a stuck-at-0 fault on line D. As we can see from the truth table, the sum output is not affected by this fault. So, we must depend on the carry output to detect this fault. The carry output, however, will have alternating outputs for the complimentary input combinations (000) and (111) as well as (001) and (110). So, the fault D/0 is not detected until the input combination (010) and (101), or the combination (011) and (100) is applied to the circuit. Depending on when these combinations are actually applied to the circuit, the time elapsed before the detection of the fault can be significant. 4
Fig. 5: A full-adder is a self-dual circuit. Complementary inputs produce complementary outputs. Recomputing with Shifted Operands Another form of time redundancy is called recomputing with shifted operands (RESO), and RESO was developed as a method to detect errors in arithmetic logic units (ALUs). (RESO is discussed in page 160 of the text book.) We will illustrate how RESO is used using the example of a n-bit ripple carry adder that performs. Suppose that the i th full-adder cell (or slice) is faulty and produces an erroneous value for the function s output at that bit slice. During the first computation when the operands are not shifted, the i th output of the circuit is erroneous. When the input operands are shifted left by one bit, the faulty bit slice then operates on, and corrupts the (i 1) th bit. When the result is shifted back to the right, the two results the first with unshifted operands and the second with shifted operands are either both correct, or they disagree in either (or both) the i th or the (i 1) th bits. Suppose we compute R = A + B, and the i th full adder is faulty. When the operands are unshifted R fault free = r n r n 1... r i r i 1... r 1 r 0 R faulty = r n r n 1... r i r i 1... r 1 r 0 (1) where r is the error in the result bit due to the faulty bit slice. A faulty bit slice can have one of three effects: the sum bit can be stuck at 0 or 1, the carry bit can be stuck at 0 or 1, or both the sum bit and the carry bit may be in error. The following table shows the effect of each possible error on the sum R. 5
Fig. 6: The truth table for single stuck-line faults in the full adder circuit of Fig. 5. Error Effect on Sum Sum is 0 2 i Sum is 1 +2 i Carry is 0 2 i+1 Carry is 1 +2 i+1 Sum is 0, carry is 0 (2 i+1 + 2 i ) = 3.2 i Sum is 0, carry is 1 2 i+1 2 i = +2 i Sum is 1, carry is 0 2 i 2 i+1 = 2 i Sum is 1, carry is 1 2 i+1 + 2 i = +3.2 i In summary, the result generated for the unshifted operands, if bit slice i is faulty is incorrect by one of [0, ±2 i, ±2 i+1, ±3.2 i ]. When the operands A and B are shifted to the left by two bits, the sum R computed, and then shifted right by two bits, we obtain R fault free = r n r n 1... r i 1 r i 2... r 1 r 0 R faulty = r n r n 1... r i 1 r i 2... r 1 r 0 (2) and a similar analysis of possible bit errors and their effect on the sum gives us Error Sum is 0 Sum is 1 Carry is 0 Carry is 1 Sum is 0, carry is 0 Sum is 0, carry is 1 Sum is 1, carry is 0 Sum is 1, carry is 1 Effect on Sum 2 i 2 +2 i 2 2 i 1 +2 i 1 (2 i 1 + 2 i 2 ) = 3.2 i 2 2 i 1 2 i 2 = +2 i 2 2 i 2 2 i 1 = 2 i 2 2 i 1 + 2 i 2 = +3.2 i 2 6
Fig. 7: The ALU structure using RESO. Summarizing, the result will be incorrect by one of [0, ±2 i 2, ±2 i 1, ±3.2 i 2 ]. Comparing the two tables, we see that the results of the two computations (that is, the unshifted and the one where the operands are shifted by two) cannot agree unless both are correct. The structure of an ALU that uses the RESO techniques is shown in Fig. 7. The additional hardware required for the technique are the three shifters, the storage register to hold the results of the first computation, and the comparator. Also, the ALU must be extended by 2 bits to allow the two-bit arithmetic shift to be performed without an overflow. The primary issues with the RESO approach are the additional hardware required and the lack of coverage provided for faults in the shifters and the comparator. 3 Self-Checking Logic Self-Checking logic is needed to tackle the checking the checker problem. In duplicate and compare approaches, it is necessary to compare the outputs of two modules. So, the basic problem is to ensure that the comparator is fault free, or to design a comparator that can detect its own fault, or a self-checking comparator. First, we define several terms that are important to understand self-checking technology. A circuit is said to be self-checking if it has the ability to detect the existence of a fault without the need for any externally applied stimulus (like what is done in circuit testing). In other words, a self-checking circuit determines if it contains a fault during the normal course of its operation. Self-checking logic is typically designed using coding techniques where the basic idea is to design a circuit that, when fault free and presented with a valid input code word, will produce the correct output code word. If a fault exists, however, the circuit should produce an invalid output code word so that the fault can be detected. A circuit is fault secure if any single fault within the circuit results in that circuit either producing the 7
Fig. 8: Basic structure of a TSC circuit. correct code word or producing a non-code word, for any valid input code word. In other words, if the circuit is fault secure, then the fault either has no effect on the output or the output is affected in such a way that it becomes an invalid code word. A circuit is self testing if there exists at least one valid input code word that will produce an invalid output code word when a single fault is present in the circuit. A circuit is said to be totally self checking (TSC) if it is both fault secure and self testing. So, in a TSC circuit, all single faults are detectable by at least one valid code word input, and if a given input combination does not detect the fault, the output is a correct code word output. The general structure of a TSC circuit is shown in Fig 8. During normal operation, coded inputs are applied to the circuit and coded outputs are produced at the circuit s output. Note that, rather than have a single-bit output that provides a faulty or not faulty indication, the output consists of two bits that are: (1) complementary if the input to the checker is a valid code word and the checker is fault free, or (2) non-complementary if the input to the checker is not a valid code word or the checker contains a fault. An obvious reason for using two checker outputs is to overcome the problem of the checker output becoming stuck at either the logic 0 or the logic 1 value. The most common TSC checker is the two-rail checker shown in Fig. 9. The two-rail checker is used to compare two words that would normally be complementary. If the words are complementary and the checker itself is fault free, the outputs of the checker should also be complementary. If the two words are not complementary or the checker contains a fault, the outputs of the checker should not be complementary. A simple design of a 2-bit TSC two-rail checker is shown in Fig. 9 where each of the two input words is two bits. The first input word is (x 0, x 1 ), and the second input word is (y 0, y 1 ). Valid code words on the inputs will have x 0 = ȳ 0 and x 1 = ȳ 1. From the logic of the circuit, we see that e 1 = x 0 y 1 + y 0 x 1 e 2 = x 0 x 1 + y 0 y 1 (3) 8
Fig. 9: The basic block diagram of the two-rail checker, and a simple 2-bit TSC two-rail checker. Provided the checker is fault free, the outputs of the TSC two-rail checker will reduce to e 1 = x 0 x 1 + x 0 x 1 = x 0 x 1 e 2 = x 0 x 1 + x 0 x 1 = (x 0 x 1 ) (4) and e 1 and e 2 are always complementary. Now, consider the cases where the checker is fault free, but the inputs are not complementary. In the first case, where x 0 = y 0 and x 1 = ȳ 1. The checker outputs become e 1 = x 0 x 1 + x 0 x 1 e 2 = x 0 x 1 + x 0 x 1 (5) which are identical for all possible values of x and y. In the second case where x 0 = ȳ 0 and x 1 = y 1, the outputs of the checker become e 1 = x 0 x 1 + x 0 x 1 e 2 = x 0 x 1 + x 0 x 1 (6) which are also identical for all possible values of x 0 and x 1. In the final case, where x 0 = y 0 and x 1 = y 1, 9
Fig. 10: An 8-bit TSC checker using 2-bit TSC checkers as building blocks. the outputs of the checker become e 1 = x 0 x 1 + x 0 x 1 = x 0 x 1 e 2 = x 0 x 1 + x 0 x 1 = x 0 x 1 (7) which are identical. We can also show that the TSC circuit is fault secure with respect to single stuck-line faults, and also satisfies the self-testing property. The proof is left to the reader. Finally, it is possible to create TSC two-rail checkers with the larger number of input bits using the circuit in Fig. 9 as the basic building block. Fig. 10 shows a hierarchical construction of a 8-bit TSC checker using 2-bit TSC checkers as building blocks. The notation e j i is used to denote the ith error signal from the j th checker, and e 1 and e 2 denote the primary error-signal outputs. The four checkers in the first level of the hierarchy each compare 2 bits from the 8-bit operands and each produce two error signals. Checkers in the second and third levels of the hierarchy verify that the error signals from the checkers at the first level are indeed complementary. A natural feature of the two-rail checker is the requirement that the two input operands be complements in the fault-free case. If we simply consider duplication with comparison, then the input from one of the modules must be inverted before the checking process. 10