A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores

Size: px

Start display at page:

Download "A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores"

Kristin York
6 years ago
Views:

1 J Electron Test (2016) 32: DOI /s A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores I. Wali 1 Arnaud Virazel 1 A. Bosio 1 P. Girard 1 S. Pravossoudovitch 1 M. Sonza Reorda 2 Received: 18 July 2015 /Accepted: 24 February 2016 /Published online: 1 March 2016 # Springer Science+Business Media New York 2016 Abstract Increasing vulnerability of transistors and interconnects due to scaling is continuously challenging the reliability of future microprocessors. Lifetime reliability is gaining attention over performance as a design factor even for lower-end commodity applications. In this work we present a low-power hybrid fault tolerant architecture for reliability improvement of pipelined microprocessors by protecting their combinational logic parts. The architecture can handle a broad spectrum of faults with little impact on performance by combining different types of redundancies. Moreover, it addresses the problem Responsible Editor: M. Abadir * Arnaud Virazel virazel@lirmm.fr of error propagation in nonlinear pipelines and error detection in pipeline stages with memory interfaces. Our case-study implementation of a fault tolerant MIPS microprocessor highlights four main advantages of the proposed solution. It offers (i) 11.6 % power saving, (ii) improved transient error detection capability, (iii) lifetime reliability improvement, and (iv) more effective fault accumulation effect handling, in comparison with TMR architectures. We also present a gate-level fault-injection framework that offers high fidelity to model physical defects and transient faults. Keywords Fault tolerance. Microprocessor. Single event transient. Permanent fault. Delay fault. Power consumption. High dependability. Fault injection 1 Introduction 1 2 I. Wali wali@lirmm.fr A. Bosio bosio@lirmm.fr P. Girard girard@lirmm.fr S. Pravossoudovitch pravo@lirmm.fr M. Sonza Reorda matteo.sonzareorda@polito.it Laboratoire d Informatique de Robotique et de Microélectronique de Montpellier, University of Montpellier / CNRS, 161 rue Ada, 34392, Cedex 5 Montpellier, France Politecnico di Torino, Torino, Italy Semiconductor is one of the most reliable inventions when properly engineered and used with longevity in mind. However, the increasing demand of fast and highly featured products has drastically changed the reliability realm in the recent years. Ever-shrinking power and cost budgets have pushed Complementary Metal-oxide Semiconductor (CMOS) scaling into nano-meter (sub-100 nm) regime [14]. The resulting increase in current and power densities due to high level of integration, combined with the increasing fragility of new technology nodes [7] cause early device and interconnect wear-out. Moreover, a high integration density leads to higher rate of manufacturing defects and the growing chip complexity renders them difficult to comprehensively test and some of the test-escaped manufacturing defects are encountered as errors during infield operation impairing the system

2 148 J Electron Test (2016) 32: reliability. In addition, there are failures not caused by wearout or manufacturing defects, but due to increased susceptibility of Very Large Scale Integration (VLSI) CMOS nodes to high energy particles from atmosphere or from within the packaging [6]. Devices operating at reduced supply voltages are more prone to charge-related phenomena caused by highenergy particle strikes. They experience particle-induced voltage transients called Single Event Transients (SETs) or particle-induced bit-flips in memory elements known as Single Event Upsets (SEUs). High-performance microprocessors being at the forefront of technology are becoming increasingly vulnerable to hard and soft errors. They use pipelining as a key technique to increase throughput, but the complexity of interactions between pipeline stages and memory interfaces present in stages make error detection and confinement a major hurdle in designing high-performance reliable processing cores. In addition, it is estimated that the susceptibility of Combinational Logic (CL) networks to SETs nearly doubles as the technology scales from 45 nm to 16 nm [10, 15, 19]. Hence it becomes inevitable for the industry to draw attention towards developing techniques to limit soft error rates in CL parts of modern microprocessors. The means of improving the reliability of high-performance microprocessors built in nano-metric technology nodes encompass techniques that tackle reliability issues at the level of technology, design and manufacturing. Absolutely necessary but these techniques are almost inevitably imperfect. Therefore it is essential to reduce the consequence of the Bremaining^ faults in modern microprocessors using fault tolerance techniques. Various solutions that use fault-tolerant techniques for robustness improvement of microprocessor can be found in the literature, but a very few can address tolerance to both transient and permanent faults. These techniques generally rely on slow recovery mechanisms, thus are not suitable for highly interactive applications. For example, the method presented in [11] has little area overhead but runs Built-In Self-Test (BIST) during periodic time intervals to detect the presence of permanent faults and uses deep rollbacks that have a severe impact on performance. Fault-tolerant architectures like Razor [5], CPipe [16] and STEM [1] incorporate power saving and performance enhancement mechanisms like Dynamic Frequency Scaling (DFS) and Dynamic Voltage Scaling (DVS) to operate circuits beyond their worst-case limits. These architectures generally target timing errors and are not effective to deal with permanent faults. For instance, Razor only deals with timing faults and CPipe duplicates CL blocks in the pipeline to detect and correct transient and timing errors and can also detect permanent faults, but it does not offer provision for their correction. Pair-and-a-Spare (PaS) Redundancy was first introduced in [8] as an approach that combines Duplication and Comparison with Standby Sparing. In this scheme each module copy is coupled with a Fault-Detection (FD) unit to detect hardware anomalies within the scope of individual module. A comparator is used to detect inequalities in the results from two active modules. In the case of inequality a switch decided which one of the two active modules is faulty by analyzing the reports from FD units and replaced the faulty module with a spare one [4]. This scheme was intended to prevent hardware faults from causing system failures. The scheme fundamentally lacks protection against transient faults and it incurs a large hardware overhead to accomplish the identification of faulty modules. Little research attention has been driven towards improving and utilizing this technique for reliability improvement of microprocessors at finer granularity levels until 2010, when J. Yao et al. proposed a variant of PaS without fault-localization capability [22] named DARA- TMR. Their scheme uses a triplicated pipeline and comparators placed after each set of pipeline register for detecting errors. The third standby pipeline is turned-on only when a large number of consecutive errors are detected, indicating the presence of a permanent fault. The scheme relies on lengthy reconfiguration procedures for recovery from errors due to permanent faults. Bit-wise Triple Modular Redundancy (TMR-b) is a classical approach that has long been used by designers to ensure high reliability, but the low-power demands of today s applications have made its use impractical in modern microprocessors. Besides, TMR-b systems suffer from corrupt outputs due to common-mode and failures that affect multiple modules. This problem was addressed in [12] byproposinganew Word-wise Triple Modular Redundancy (TMR-w) scheme. Although effective in dealing with common-mode and multiple failures, TMR-w lacks in erroneous glitch detection capability because of its voter that has a control path that is longer than its data-path. Additionally, both the TMR schemes do not offer lifetime reliability improvement as all the three logic copies undergo wear-out at the same rate. To address the need of having seamless reconfigurations in high dependability systems, we have previously proposed an architecture called Pipelined Hybrid Fault-Tolerant Architecture [20]. This architecture offers a high sensitivity to detect errors due to transient, permanent and timing faults and a swift recovery scheme that has little impact on circuit performance. However, that architecture is unable to deal with error propagation in non-linear pipelines and error detection in stages with asynchronous memory interfaces. By resolving these issues, the Pipelined Hybrid Fault-Tolerant Architecture evolves into a new Architecture called Hybrid Pair-and-a-Spare (HPaS), presented in this paper. Specifically, this work provides the following contributions: A novel fault-tolerant scheme called HPaS Architecture that offers full protection of CL parts of a non-linear pipeline microprocessor against transient, permanent and

3 J Electron Test (2016) 32: timing faults and marginal protection of memory elements with minimal performance degradation and power overhead. A comprehensive case study of its application to a MIPS microprocessor. The case study is augmented with experimental results of area, power and performance overhead costs associated with the HPaS architecture and an extensive comparison with TMR schemes. A generic fault-injection framework and several sets of fault-injection results to compare the fault-tolerant capability for different architectures. The remainder of this paper is organized as follows. In Section 2 we give a brief overview of the Pipelined Hybrid Fault-Tolerant Architecture with a discussion on error detection and correction principles. In Section 3 we present HPaS Architecture. Section 4 presents a case study of the application of the proposed architecture on a MIPS processor, some experimental results and a comparison with two TMR architectures. A fault-injection framework is presented in Section 5, with the results of fault-injection experiments carried out on the HPaS version of the considered microprocessor. Comparisons with related works are presented in Section 6. Finally, Section 7 concludes the paper and provides some perspectives. 2 Pipelined Hybrid Fault-Tolerant Architecture Coarse-grained recovery schemes are feasible for lower-end commodity microprocessors as they rely on techniques like data check-pointing [11] and software recovery routines [13], which can have a severe impact on performance if considered in the scope of high dependable applications. Hence, as a starting point towards achieving a stage-level fault-tolerant architecture for pipelined circuit, we proposed a Pipelined Hybrid Fault-Tolerant Architecture in [20]. This architecture employs information redundancy (as duplication and comparison) for error detection, timing redundancy (in the form of recomputation/rollback) for transient error correction and hardware redundancy (to support re-configuration) for permanent error correction. As shown in Fig. 1, the Pipelined Hybrid Fault-Tolerant Architecture employs triplication of CL modules. A set of multiplexers and demultiplexers in each stage is used to select two running CL copies and to put the third CL copy in standby mode during normal operation. Pipelined Hybrid Fault- Tolerant Architecture is driven by a control logic module, which is divided into distributed control logic blocks per stage and a central control logic unit. The distributed part consists of a state-machine that controls the configurations of the architecture through the reconfiguration multiplexers and demultiplexers. The central control unit manages the pipeline rollback and generates a global error signal. 2.1 Error Detection A special comparator called Pseudo-dynamic comparator compares the outputs computed by two running CL copies. It was proposed in [18] as a circuit level solution to achieve higher glitch detection capability and to reduce power consumption of traditional static comparators in duplication and comparison architectures. It can be seen in Fig. 1 that the comparator is placed across the pipeline register such that it gets to compare the output of the pipeline register po_x, which is a synchronous input, with the output of the secondary running copy vout_x, which is an asynchronous input. This orientation gives the pseudo-dynamic comparator its unique capability to detect transient errors that would otherwise escape due to common mode effect and also allow it to stay out of the critical path. Thus, it does no impact the temporal performances of the circuit [17]. The comparison takes place only during brief intervals of time referred to as comparison-window, represented in Fig. 2 as regions with dotted outline. The timing of comparisonwindow is defined by the high phase of a delayed clock signal dc. These brief comparisons allow keeping the switching activity in OR-tree of the comparator to a minimum, offering a 30 % power reduction compared with a static comparator [18]. The functioning of the pseudo-dynamic comparator requires specific timing constraints to be applied during synthesis of CL blocks. In typical pipelined circuits the contamination delay of CL should respect hold-time of the pipeline register latches. However, in the Pipelined Hybrid Fault-Tolerant Architecture, as the CL blocks also feed signals to the pseudo-dynamic comparator, CL outputs need to remain stable during the comparison. Since the comparison takes place just after a clock edge, any short path in the CL can cause the input signals of the comparator to start changing before the lapse of the comparison-window. Thus, the CL blocks have to be synthesized with minimum delay constraint, which is formally defined by: t cd t high þ δt delay t ccq t cs ð1þ Where: t cd is the propagation delay of the CL. t high is the time the clk stays high. δt delay is the amount of offset between clk and dc. t ccq is clk-to-q delay of flip-flop and. t cs is the propagation delay of reconfiguration multiplexers and demultiplexers.

4 150 J Electron Test (2016) 32: Fig. 1 Simplified view of the proposed Pipelined Hybrid Fault- Tolerant Architecture 2.2 Error Recovery The error recovery scheme uses stage-level granularity reconfigurations and single-cycle deep rollbacks. The shadow latches incorporated in pipeline registers keep one clock cycle preceding state of the pipeline register flip-flops. The comparison takes place after every clock cycle. Thus, error detection can invoke a reconfiguration and a rollback cycle, confining the error and preventing it from effecting the computation in the following cycles. In Fig. 2 we explain the error detection and correction principle of Pipelined Hybrid Fault-Tolerant Architecture shown in Fig. 1, with the help of three different fault scenarios among many other possible cases. Figure 2(a) is a timing diagram of system response to an occurrence of a permanent fault, (b) is in case of a delay fault and (c) shows the response in case of a SET occurrence. It can be noticed that either vout_b or po_b can be affected by an error depending on the fault location. In both the cases the comparator detects the inequality and flags error signal. In Fig. 2, this is represented by a comparison window marked with an inequality ( ) sign, which compares a correct resultwithafaultyonerepresentedbyanasterisk(*)sign.the error signal remains active for two cycles until the system returns back to normal operation after a reconfiguration and a recompilation cycle. It can also be noticed in Fig. 2 that the comparison window in the re-computation cycle (cycle # 3) is disabled. This is to avoid false error detection in the other stages due to rollback, which would otherwise trigger an indefinite chain of error detection events. Besides these three, many other fault scenarios are possible. For instance in case of permanent or timing fault it is possible that the system may not be able to isolate the fault by undergoing just one reconfiguration and a recompilation cycle. Since it is not possible to determine which of the two working copies of CL in a stage exhibits a permanent fault, the reconfiguration choice is irrespective of that. There is 50 % chance that the first reconfiguration eliminates the fault. In this case another error detection triggers a pair of reconfiguration and re-computation cycles, which will definitely be able to select two CL copies that provide good results. Thus the error recovery penalty in this case will be four cycles instead of two. With appropriate timing constraints applied, the Pipelined Hybrid Fault-Tolerant Architecture is capable of detecting and recovering errors due to faults within the scope of individual stages. It also proposes a methodology of how to improve the architecture not just to deal with errors due to faults occurring in the corresponding stage but also to take into account errors generated due to faults in CL blocks located in other stages and propagated through pipeline feedback and feed-forward signals. But the work in [20] does not show practical implementation of the proposed methodology of dealing with error propagation behavior in non-linear pipelines. It also lacks the implementation details of managing error detection in pipeline stages with asynchronous memory interface. 3 Hybrid Pair-And-A-Spare Architecture In this section we present the HPaS Architecture that gets its name from the fact that it works on the principle of triplicating CL modules, out of which two copies (referred to as pair) perform computation in parallel, while the third one (referred to as spare) stays in standby mode until an error is detected. The principle is the same as for the Pipelined Hybrid Fault-

5 J Electron Test (2016) 32: a detect errors. In this section we discuss the implications of these considerations on the design of the HPaS architecture. But first, in order to simplify this discussion we give the following definitions that classify the pipeline stages on the bases of the aforementioned characteristics: b Bounded stages: The simplest of all classes, in which the stage consists of a CL block which has all its inputs and outputs bounded by a single pair of consecutive pipeline registers (see Fig. 3a). Loosely-Bounded stages: Pipeline stages containing CL blocks that not only get inputs from its preceding register but also from CL blocks in other stages (referred to as influenced) and/or instead of just feeding the following pipeline register also feeds CL blocks in other stages (referred to as influencing) (seefig.3b). Unbounded stages: This class of stages holds those, which provide an interface to an asynchronous memory. (See Fig. 3c). 3.1 Dealing with Error Propagation in Nonlinear-Pipeline c Fig. 2 Error Detection and Correction in case of a Permanent fault in B1, b Timing fault in B1, c Transient fault in B2 Tolerant Architecture, but it offers better error detection and correction in non-linear pipelines and in pipelines with stages having asynchronous memory interfaces. The fault tolerance capability improvements are mainly due to the following characteristics. Formulation of a stage-level error detection and recovery scheme for a non-linear pipeline processor needs to consider the presence of feedback and feed forward connections because these paths have an impact on the error propagation behavior of the processor. Furthermore, asynchronous memory interfaces in pipeline stages make it difficult to Error manifestation due to faults in bounded and influenced stages remain confined within the stage itself during one clock period because of the absence of feedback or feed forward connections generating from these stages. Thus detection mechanism at the output of these stages is solely sufficient to detect these errors. Whereas in case of an influencing stage the fault effects may or may not remain confined in that stage during one clock cycle, because of the asynchronous feedback and feed forward paths to other influenced stages. In order to be able to detect such errors, which are due to faults in influencing stage and have manifested themselves at the output of influenced stage, the HPaS Architecture duplicates the feedback and feed forward connections. Otherwise if there is only a single set of feedback and feed forward signals, an error propagating through them will affect both running copies of CL block in the influenced stage in the similar way and may result in a common-mode failure. In TMR the error propagation dynamic of nonlinear-pipeline is handled by triplicating the feedback and feed forward connections, which results in a 1/3 times increase in feedback and feed forward interconnect area consumption in comparison with HPaS. The stage-level reconfiguration framework for recovering from permanent faults presented here is also based on the classification of CL discussed in Section 3. Let us assume that an error is detected at the output of an influenced stage. This error can be the result of a fault occurring in the stage itself or this error may have propagated to it through the pipeline feedback or feed forward paths from any of the stages that influence it. In such case, the HPaS architecture cannot identify

6 152 J Electron Test (2016) 32: Fig. 3 CL Classification a Bounded CL, b Loosely Bounded CL, c Unbounded CL a b c which stage is actually faulty, thus all the suspected stages (which include the influenced stage and all stages that influence it) undergo a reconfiguration. 3.2 Error Detection in Pipeline Stages with Asynchronous Memory Interface Memories generally take a large amount of silicon area, which makes it less practical to duplicate or triplicate them inorder to improve system reliability. On the other hand, their regular structure makes error detection and correction codes like Parity and Hamming much feasible. These methods have been in use to efficiently and effectively protect them against SEUs and permanent faults and are well known. But that is not enough to ensure the reliability of overall system if an unprotected CL drives the memory inputs and/or if an unprotected CL processes its outputs before being fed to the next pipeline register. In such cases, data anomalies during write operations are especially difficult to detect because during write operations there is no data at the output to compare and detect errors. Dual-port memories have been previously used as a means of communication between copies of redundant CPU [2]. In our architecture we propose a similar use of dual-port memory. In addition to that, memories are described in VHDL to have write-transparency. This property makes the memory latch to be transparent during write operations, which means that if a memory location is being updated as a result of a write operation, the data that is being written into the memory location appears on the data output bus as well [2]. The use of dual-port write-transparent memory allows propagating any error occurring in one of the two working copies of CL driving the memory, to the following pipeline register where they can be detected by the comparator after the next clock edge. Read operations are by nature transparent so there is no need to take them into special consideration. An example of the use of a dual-port write-transparent memory can be seen in Fig. 4 in the Memory stage of the microprocessor. 4 Case Study Application The proposed HPaS architecture was implemented on a MIPS microprocessor. The objective was to assess it with respect to area, power and performance overheads and its fault-tolerant capability and to compare it with a classical TMR scheme with bit-wise voter (TMR-b) and a TMR scheme that uses word-wise voter (TMR-w). Area estimations were obtained for four different versions of the microprocessor namely Baseline (BL), HPaS, TMR-b and TMR-w by synthesizing them with the Nangate 45 nm open cell library. A simple workload program that uses add-shift method to multiply two operands was used to obtain switching activity for each microprocessor version and using this activity, estimates of average power consumption were obtained. We devised a fault-injection framework to comprehensively evaluate the fault-tolerant capability of HPaS against permanent faults, timing faults and SETs and compared it with TMR. The four different versions of the microprocessor mentioned above are: Baseline Microprocessor: The microprocessor used, as our case-study platform is a 5-stage MIPS processor with 32-bit parallelism capable of operating at 100 MHz. It incorporates hazard-detection and data-forwarding mechanisms. The microprocessor was developed as an

7 J Electron Test (2016) 32: Fig. 4 Hybrid Pair-and-A-Spare Microprocessor academic learning exercise without considering any faulttolerance capability. HPaS Microprocessor: Starting with the BL microprocessor, the HPaS microprocessor was realized by triplicating CL blocks, duplicating feedback and feed-forward signals, inserting reconfiguration switches, modifying pipeline registers to incorporate rollback capability and by adding pseudo-dynamic comparators and HPaS control logic blocks. An architectural overview of the resulting structure is showninfig.4. Clouds represent CL between pipeline registers, namely IF (Instruction Fetch), ID (Instruction Decoder), EXE (EXEcution unit), MEM (data MEMory management) and WB (Write-Back signals). Three different memories are embedded in the microprocessor structure which are instruction memory, register-file and data memory. It can be seen that the data memory has two ports and thanks to its write transparency, any fault in the MEM stage CL block can be detected by the following comparator. It can also be noticed in Fig. 4 that an additional set of pipeline register and comparator is placed after the WB stage. These additional components are there to detect errors on write-back signals feeding the synchronous register file memory. TMR Microprocessor: Two versions of TMR microprocessors were implemented with a similar architecture. The only difference is the type of voter they use. One (TMR-b) has classical bit-wise voters and the second one (TMR-w) used word-wise voters [18]. The feedback and feed forward connections were also triplicated to deal with error propagation dynamic of non-linear pipeline. Voters were placed after each set of CL copies, such that CL outputs are voted to select one that has at least one common equivalent, to be feed to the following pipeline register. 4.1 Area Overhead Results A summary of the cost in terms of area for each microprocessor version is shown in Fig. 5. In each bar different colored regions represent the area occupied by individual components of the microprocessor. On the top the total area in μm 2 is reported. The area overhead related to the implementation of HPaS microprocessors is due to: Shadow Pipeline-Registers (for rollback capability) Triplication of CL blocks Fig. 5 Area overhead results

8 154 J Electron Test (2016) 32: Secondary port and Write-transparency of the Data Memory HPaS Control Logic Reconfiguration Switches Pseudo-dynamic Comparators. The overhead is 103 % with respect to the baseline microprocessor. This area overhead is less than 3 times, since only CL parts are triplicated in comparison with the basic approach [4] triplicating the entire microprocessor structure (including the memory). TMR microprocessor versions incur less area overhead (87 % and 90 % for TMR-b and TMR-w, respectively). In comparison with HPaS this reduction is mainly due to the absence of reconfiguration multiplexers and demultiplexers and rollback shadow latches. 4.2 Power Overhead Results Figure 6 shows the estimations of average power consumption for different versions of the MIPS microprocessor. These estimations were obtained by simulating the microprocessors running a simple workload program that uses a shift-add method to multiply two operands. In each bar different shaded regions represent the power consumption share of different components of microprocessors. On the top the total average power in mw dissipated by each microprocessor version is reported. It can be seen that power dissipated by CL in each stage (labeled as factors of b, c, d, e and f) in the HPaS microprocessor are little higher than twice of that in Baseline. This excess of power is the static power dissipated in the standby CL block and can be reduced by power gating techniques. In comparison, the CL in both TMR microprocessors consumes three times the power of the baseline microprocessor CL Blocks. The pseudo-dynamic comparators (labeled as a factor j) consumed around 31 % less power than the voter (labeled as a factor of i) in TMR schemes. The power consumption overhead of HPaS microprocessor is 75 % with respect to BL. This power is 11.6 % and 11.8 % less than that consumed by the TMR-b and TMR-w versions, respectively. 4.3 Performance Degradation There are two aspects in which the performance overhead of HPaS architecture can be assessed: (i) temporal performance degradation and (ii) error recovery overhead (discussed in Section 5(d)). The additional components inserted in the baseline datapath to implement the HPaS architecture, which include reconfiguration multiplexers and demultiplexers and a level of multiplexers in the pipeline registers for rollback capability, are accounted for the temporal performance degradation. Conversely, in the TMR schemes the voter circuit in the data-path is responsible for reducing circuit speed. Static timing analysis showed that for HPaS microprocessor temporal performance degradation is 3.8 %. This figure is obtained by comparing critical path delays of CL blocks with the maximum delay of aforementioned additional HPaS circuit components. In the same way the percentage of temporal performance degradation were obtained for TMR-b and TMR-w and were found to be 0.9 % and 8.5 % respectively. These figures show that the HPaS architecture incurs less temporal performance degradation then TMR-w but is more costly in comparison with TMR-b. 4.4 Lifetime Reliability Improvement Fig. 6 Power overhead results When a circuit enters into the wear-out phase of its lifetime, most of the wear-out mechanisms show early symptoms as increasing signal propagation latency prior to inducing permanent device failures [3]. The ability of the HPaS architecture to detect these early symptoms and act upon by causing reconfigurations reduces the aging effects on the system by distributing the stress on two of the three CL copies. The capability of selective sparing helps reduce the rate of failures and increase the life span of circuit parts that embed such fault-tolerant architecture.

9 J Electron Test (2016) 32: Fault-Tolerance Capability Assessment To assess and compare the fault-tolerance capability of the HPaS architecture, we performed simulation-based gate-level fault-injection experiments on each of the four versions of the microprocessor (mentioned in Section 4) using our ad-hoc fully automated fault-injection framework that uses a flow shown in Fig. 7. Gate-level simulation provides a suitable paradigm to perform fault-injection experiments because unlike micro architectural-level simulation, it offers high fidelity to model most of the physical defects and transient faults, and is much faster than transistor-level simulation. As shown in Fig. 7 the fault-injection flow is partitioned in two parts i) Fault List Extraction and ii) Fault-injection. In the former part a parsing script generates a fault site list using SDF (Standard Delay Format) and Activity files. Then, another script randomly selects (with some possible restrictions) fault sites from the entire set and assigns randomly generated (with some possible restrictions) fault-injection time and/or durations to each fault site. In the Fault-injection part of the flow the Fault Injection Campaign script is the one that recursively runs gate-level simulations and inject faults according to the Fault list. It uses another script labeled as SDF mutant that is responsible to inject timing faults. Finally, the Fault Injection Campaign generates a Fault Injection report that is analyzed to assess the fault tolerant capability of the design. More details of the Fault injection framework are provided in following sub-section. 5.1 Fault-Injection Framework We devised a fully automated fault-injection framework that is capable of injecting faults that closely represent real faults and have a realistic distribution. It uses simulator commands to alter the signal values during simulation to mimic the occurrence of permanent and transient faults, as it can be seen in Fig. 8a, b. For delay faults injection it uses SDF file mutation technique [9]. As shown in Fig. 8c, before each fault injection simulation the delay of a randomly selected net is increased by a random amount of time in the SDF file and this mutated SDF file is used for simulation. The transient and permanent fault sites are obtained by parsing the activity file and the delay fault sites list is extracted from the SDF file as shown in Fig. 8. Four identical sets of 30,000 faults, one per microprocessor version, distributed randomly in space and time, which include 10,000 of each type (transient, permanent and delay), were generated and used. Based on the possibilities offered by gate-level simulation to inject different types of faults, we can model them as: Permanent Fault Model: We used the standard stuck-at fault model to represent permanent defects. However, instead of arbitrarily assigning stuck-at-0 or stuck-at-1 to circuit nodes, we based this decision on current logic state of that node during simulation i.e. if it were at logic level-0 at the time of fault injection, a stuck-at-1 was forced and vise-versa. This modification results in a large number of faults to manifest themselves as errors. The fault injection time was randomly selected within a window that ensures that none of the faults are injected during circuit initialization or at a time too close to the end of the workload program. The fault locations were randomly selected within the CL area of the circuit. Hence, a permanent fault is defined by two parameters, fault location l and faultinjection time t as shown in Fig. 8a. Timing Fault Model: To model timing faults at gate-level we injected additional randomly generated delay value Δt between 2 ns to 6 ns ( 1 / 5 to 5 / 3 of Period) on the randomly chosen net l within the CL area, by modifying the delay of corresponding wire in SDF file. This additional delay causes the driven logic cone to violate timing and closely represents a resistive open defect. As shown in Fig. 8b, a timing fault is defined by random additional delay Δt and target net l. Transient Fault Model: We also evaluated the proposed architecture against SETs. They are modeled as digital pulses using three parameters: fault location l, faultinjection time t and duration d that represent SET pulse width as shown in Fig. 8c. Particle induced SET pulse widths vary depending on factors like type of radiation, struck node capacitance and process technology [21]. We randomly selected pulse widths from the range between 0.25 ns and 1.25 ns. The selection of this range was made considering the typically anticipated SET pulse widths in 45 nm technology. Whereas l and t were selected the same way as done for the permanent faults. Also the polarity of injected pulse was selected to be opposite to the signal state of l at the time of fault-injection. 5.2 Fault Effect Classification An analysis of the fault-injection report allows us to classify the injected faults into five categories based on the outcomes. Silent Faults: Faults that had no effect on the execution of the workload program are classified as Silent. The program terminates normally with no error detection, the result is correct and the contents of pipeline registers, register-file anddatamemoryarethesameasthoseofgoldenrun. Latent Faults: Faults are classified as latent if the program terminates normally, the result is correct, but the contents of pipeline registers, register-file or data memory are not the same as those of golden run. These stored errors can affect the computation at any later time moment

10 156 J Electron Test (2016) 32: Fig. 7 Fault-injection flow when the program makes use of data at the corrupt memory locations for computation. These types of faults effects are considered critical because they may result in erroneous computation without detection. Fail-silent Faults: The program terminates normally with no error detection and the result computed is wrong. These faults are the most critical as the result computed are wrong without any error indication. Corrected Faults: The program terminates normally with at least one error detection and correction, the execution result and the content of pipeline registers and register-file are the same as that of the golden run. Detected Faults: The faults that result in an error that is detected by the fault tolerant architecture but cannot be corrected are classified as detected faults. These cases are fail-safe in nature because the system at least indicates thepresenceoferror.thiscategoryofoutcomesareencountered when multiple modules fail at the same time and due to the lack of additional redundant resources the system fails to provide correct result. Unclassifiable: Some injected faults result in setup or hold violations and cause the unknown logic value X to propagate. These X-propagations are due to limitation of gatelevel simulation but in real circuits setup and hold violations may cause a faulty value to be stored in memory elements and this anomaly can be detected by the detection mechanism. Thus, in actual silicon test case these faults will either fall into corrected or silent fault categories, which are non-critical from the robustness point of view. Since using gate-level simulation we cannot make a distinction among them, we put them into the unclassifiable category. Among these six categories of faults we consider Latent and Fail-silent faults to be critical in our analysis of fault tolerant capability. These critical faults are the ones that escape the detection and lead to a failure. The ratio of the number of these critical faults w.r.t the number of total injected faults gives a figure of merit to compare the fault tolerant capability of different architectures. 5.3 Fault-Injection Results Figures 9a, b and c show the permanent, delay and transient fault injection results, respectively, for the four different versions of the considered microprocessor. The bars represent the number of faults that were found to fall in each of the fault categories discussed in section 5(b) on a logarithmic scale. On the top of each bar, the percentage share of the corresponding fault class with respect to the total of 10,000 injected faults is shown. Permanent fault injection results in Fig. 9a show that almost 80 % of faults injected in the BL version resulted in critical errors (Fail-Silent and Latent). In comparison, TMR-w offers a reliability improvement by a factor (as a ratio of number of critical faults in baseline to that of TMR-b) of 15.2, with 5.2 % of faults still falling in the critical errors categories. Conversely, in the case of HPaS and TMR-b none of the faults resulted in a critical error. The HPaS architecture corrected almost 53 % of permanent

11 J Electron Test (2016) 32: a a b b c c Fig. 8 Fault Injection Schemes for a Permanent Fault, b Transient Fault, c Delay Fault faults. However, because of the absence of an error flag in the TMR voter, it was impossible to distinguish between the Silent and Corrected fault category, thus both of these types of faults are grouped together and kept in the category of Silent Faults in the case of TMR-b and TMR-w. A similar fault distribution trend was obtained for Delay and Transient fault injection experiments as shown in Figs. 9b, c. The reliability improvement factor of TMR-w with respect to Baseline was found to be 6.3 and 9.0 for transient and delay faults respectively. None of the transient and delay faults resulted in critical errors for HPaS and TMR-b. In all the three types of fault injections we observe some common trends. We can see that TMR-w does not offer full protection. A static timing analysis revealed that faults that Fig. 9 Fault-injection outcome a Permanent fault injection, b Timing fault injection, c Transient fault injection resulted in transitions at the output of CL block very close to the capture edge, escaped the word-wise voter. Word-wise voter offers reduction in the probability of corrupted outputs in case of common mode and multiple module failures [18]. However it was found to be less effective when it comes to mitigate the inconsistencies in data that appear very close to the capture edge. The reason for this insensitivity is the presence of control path that is longer than the data path inside the word-wise voter unlike in bit-wise voter.

12 158 J Electron Test (2016) 32: Another observation is that, there are no Unclassifiable faults in case of TMR-b. As mentioned in section 5(b) these unclassifiable faults are linked to the X-propagation in gatelevel simulation and a voter structure filters these X-propagations, which is not true for a XOR gate that is the basic building block of the comparator used in the HPaS architecture. 5.4 Error Recovery Overhead Our single fault injection experiments show that all the errors due to injected SETs, 51.8 % of the errors due to permanent faults and 77.1 % of errors due to delay faults were recovered in a time equivalent to 2 clock cycles. The remaining errors due to permanent and delay faults were mitigated with a penalty of 4 cycles. In order to comprehensively measure the impact on performance due to different soft error rates we performed multiple transient fault injection experiments. The experiments involved subjecting the HPaS microprocessor to 300 sets, each composed of 2, 5 or 10 contemporaneous transient faults and measuring the encountered SEU rate and the corresponding time taken to recover from the errors. The rationalized results showed that the HPaS microprocessor with 1000 SEUs per second have a performance overhead of only % for one second of operation at 100 MHz and the overhead increases proportionally with the soft error rate. On the other hand, TMR-b and TMR-w have zero error recovery penalties, as TMR is an error masking scheme rather than an error detection and correction architecture. This shows that, although not equal to zero, as in the cases of TMR-b and TMR-w, the performance impact due to error recovery of HPaS microprocessor is quite low. 5.5 Fault Accumulation Effect Fault accumulation handling is an aspect of fault-tolerance capability that has a strong implication on the lifetime reliability of circuits and cannot be inferred from the single fault injection experiments. Therefore in this sub-section we present the results of multiple-fault injection experiments performed by injecting 300 sets, each composed of 2, 5 or 10 contemporaneous permanent faults in each of the four microprocessor versions. Figure 10 gives the summary of the results. The height of each colored region in the bars represent the percentage of injected faults that were found to fall in the corresponding category (defined in Section 5(b)) with the number of faults in each category labeled on corresponding region. The line plots in each bar plot shows the trend of the percentage of faults that produced critical outcome (i.e. failsilent and latent) represented by the red and orange colored regions. Since TMR-b is an error masking architecture that does not indicate the presence of error, instead just corrects them until more than one faulty CL copies manifest their faults in the same way at at-least one output. When faults accumulate due to wear-out and multiple copies start getting affected, TMR-b fails to correct them and the lack of any provision of indicating error ends up in failsilent outcomes. Therefore, we can see that with the increase in the number of contemporaneous permanent faults the share of faults that result in critical errors also show an increase. However TMR-w can detect errors even if multiple CL copies manifest fault effects in the same way on more than one output. Thus TMR-w can handle fault accumulation better than TMR-b as represented by the decreasing number of faults that result in critical outcomes with the increase in the number of contemporaneous permanent faults. But despite that, TMR-w ends up in few critical errors, which can be accounted to the fact that TMR-w voter has a control path that is longer than its data path and this causes some erroneous transitions to escape detection. Whereas HPaS microprocessor due to its high sensitivity to detect erroneous transients and the ability to detect errors even if more than one copies manifest the effect of fault at the output in a same way at the CL outputs. In all other possible scenarios HPaS, if cannot correct, it can at least indicate the presence of error and continue fail-safe operation. It can be seen in Fig. 10 that HPaS microprocessor did not result in any Fail-Silent or Latent outcome, showing its effectiveness in dealing with fault accumulation effect. Fig. 10 Multiple fault injection results Percentge 100% 75% 50% 25% 0% permanent faults 5 permanent faults 10 permanent faults Baseline HPaS TMR-b TMR-w Baseline HPaS TMR-b TMR-w Baseline HPaS TMR-b TMR-w variable Fail_Silent Latent Silent Corrected Detected Microprocessor

13 J Electron Test (2016) 32: Comparison with Related Work Several fault-tolerant architectures have been proposed by researchers in the past to address the circuit reliability concerns. A few of these relevant solutions including TMR-b, TMR-w, DARA-TMR [22, 23], PaS [8], CPipe [16], STEM [1] and Razor [5] were briefly discussed in Section 1. In this section we present a comparison of different merits of these architectures with those of the proposed HPaS architecture. Table 1 gives a summary of comparison of these architectures. Columns 1, 2, and 3 specifythetypeoffaultthatthese architectures have the capability to detect and correct. Columns 4 and 5 identify which of these architectures have or can possibly incorporate power conservation and performance improvement features of DVS and DFS. The type of redundancy used by these architectures is given in column 6. Columns 7, 8 and 9 give some figures of area, power and error recovery overheads respectively, associated with some of these architectures. Finally column 10 determines which of these schemes also improve the lifetime reliability of the circuit. These architectures can be broadly classified into two categories i) Full Protection; those which provide protection against transient, permanent and timing faults and ii) Partial Protection; those which can handle a subset of these types of faults. It can be see that TMR-b, TMR-w, DARA-TMR and HPaS fall in the first category. On the other hand, Razor can only detect and correct timing errors and STEM and CPipe architectures are capable to deal with timing and transient faults, thus are considered partial protection solutions. However, some of these partial protection techniques have features like DVS and DFS, as can be seen in column 4 and 5, which is not the case with full protection schemes considered here. On the other side, the HPaS architecture, due to its unique error detection and micro rollback capability, becomes an ideal candidate for the application of these performance and power optimization techniques. Now if we focus our attention to the overheads associated with each of the architectures that offer protection against all the three type of faults, we can observe that HPaS incurs area overhead slightly more that TMR-b and TMR-w but certainly less than DARA-TMR because DARA-TMR uses CL and Sequential Logic (SL) triplication, whereas HPaS triplicates CL but only duplicates the SL. In addition, DARA-TMR uses three times more comparators than HPaS. It can be observed in Table 1 that HPaS saves a significant amount of power in comparison with TMR-b and TMR-w. The error recovery overhead for TMR-b and TMR-w is zero because TMR is an error masking technique instead of an error detection and correction scheme. As its name suggests, DARA-TMR is based on TMR but treats permanent fault occurrence as a very rare phenomenon and does not offer a fast reconfiguration mechanism. On the other hand, HPaS incurs an error recovery penalty of either 2 or 4 cycles to mitigate any type of fault. Table 1 Summary of comparison of different related fault-tolerant architectures Lifetime reliability improvement Error Recovery Overhead (cycles) Power Overhead Area Overhead Hardware Redundancy Dynamic Voltage Scaling Dynamic Frequency Scaling Fault Accumulation effect handling Timing fault tolerance Transient fault Tolerance Permanent fault tolerance Partial Protection PaS CLTriplication No data in[11] No data in[11] No data in[11] Razor SL Duplication 1 %-3 % 3.1 % 1 STEM SL Triplication 14 %-15 % No data 1 or 3 CPipe CL and SL No data in [9] No data in [9] 1 Duplication Full Protection TMR-b CL Triplication 184 % 229 % 0 TMR-w CL Triplication 191 % 230 % 0 No data in [13, 23] No data in [13, 23] No data in [13, 23] CL and SL Triplication DARA- TMR 217 % 198 % 2 or 4 HPaS Feasible Feasible CL Triplication and SL duplication

14 160 J Electron Test (2016) 32: Although the partial protection schemes considered here have slightly lower error recovery penalties, they cannot be compared with that of HPaS because of the difference in their fault-tolerance capability. Among the eight fault-tolerant architectures considered in this comparison only PaS, CPipe and HPaS offer fault accumulation effect handling and HPaS is the only one that offers a life-time reliability improvement by selectively sparing the weakest CL block as discussed in Section 4(d). From this comparison it can be inferred that the proposed HPaS architecture offers a fault tolerant capability equivalent to TMR structures but with additional benefits of power saving and lifetime reliability improvement. It also provides the opportunity to apply power conservation and performance enhancement techniques like DVS and DFS. 7 Conclusion In this work, we present Hybrid Pair-and-A-Spare (HPaS), an improved architecture that ensures the robustness of combinational logic parts of non-linear pipeline processor cores, with little impact on performance and quite modest amount of overhead in terms of area. It offers better transient fault-tolerance capability and fault accumulation effect handling than TMR with added advantages of 11.6 % power saving and circuit lifetime reliability improvement. The proposed architecture implemented on the MIPS microprocessor was subjected to a fault-injection campaign of 30,000 single faults and 300 multiple faults in its combinational logic parts using our generic fully automated faultinjection framework. The HPaS version of the targeted microprocessor sustained correct operation without any failure. Although very effective in protecting combinational logic parts, our architecture offers only marginal protection for sequential elements. We intend to incorporate a state element fault-tolerant scheme to further improve its robustness. In addition, the timing error detection and correction capability of HPaS opens an interesting opportunity to use it with Dynamic Voltage and Frequency Scaling techniques to save more power. References 1. Avirneni NDP, Somani AK (2012) Low overhead soft error mitigation techniques for high-performance and aggressive designs. IEEE Trans Comput 61(4): E. Balaji and P. Krishnamurthy (1996) Modeling ASIC memories in VHDL. In: Proc. EURO-VHDL Design Automation Conference, pp J. A. Blome, S. Feng, S. Gupta, S. Mahlke (2006) Online timing analysis for wearout detection. In: Proc. of the 2nd Workshop on Architectural Reliability 4. Bubrova E (2013) BHardware redundancy,^ in Fault-Tolerant Design. Springer, New York 5. D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, T. Mudge, (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proc. of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp Introduction to Single-Event Upsets, white paper, Altera Corp (2013) 7. K. John, H.K. Chris (2011) Transistor Aging, IEEE Spectrum, 8. Johnson BW (1989) Design techniques to achieve fault tolerance. In: Design and analysis of Fault-Tolerant Digital Systems. Addison- Wesley Pub Comp. Inc, USA, pp Li M-L., P. Ramachandran, U.R. Karpuzcu, S.K.S. Hari, S.V. Adve (2009) Accurate microarchitecture-level fault modeling for studying hardware faults. In: Proc. of the 15th IEEE International Symposium on High Performance Computer Architecture, pp P. Liden et al. (1994) On latching probability of particle induced transients in combinational networks. In: Proc. of the Symp on Fault-Tolerant Computing, pp M.Mehrara,M.Attariyan,S.Shyam,K.Constantinides,V. Bertacco and T. Austin(2007) Low-Cost Protection for SER Upsets and Silicon Defects. In: Proc. of the Design, Automation Test in Europe Conference, pp S. Mitra, E.J. McCluskey (2000) Word-voter: a new voter design for triple modular redundant systems. In: Proc. of the 18th IEEE VLSI Test Symposium, pp M. Prvulovic et al. (2002) ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proc. of the Int Symp on Computer Architecture, pp Semiconductor Industry Association (2010) International Technology Roadmap for Semiconductors (ITRS) 15. P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger and L. Alvisi (2002) Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proc. of the Int Conf on Dependable Systems and Networks, pp ,. 16. V. Subramanian, A.K. Somani (2008) Conjoined Pipeline: Enhancing Hardware Reliability and Performance through Organized Pipeline Redundancy. In: Proc. 14th IEEE Pacific Rim International Symposium on Dependable Computing, pp D.A.Tran,A.Virazel,A.Bosio,L.Dilillo,P.Girard,S. Pravossoudovitch and H.-J. Wunderlich (2011) A hybrid fault tolerant architecture for robustness improvement of digital circuits. In: Proc. of the Asian Test Symposium, pp D. A. Tran, A. Virazel, A. Bosio, L. Dilillo, P. Girard, A. Todri, M.E. Imhof and H.-J. Wunderlich (2012) A pseudo-dynamic comparator for error detection in fault tolerant architectures. In: Proc. of the VLSI Test Symposium, pp J. Velamala, R. LiVolsi, M. Torres and Yu Cao (2011) Design sensitivity of Single Event Transients in scaled logic circuits. In: Proc. of the Design Automation Conference, pp I. Wali, A. Virazel, A. Bosio, L. Dilillo, P. Girard, A. Todri (2014) Protecting combinational logic in pipelined microprocessor cores against transient and permanent faults,. In: Proc. of the Int. Symp. on Design and Diagnostics of Electronic Circuits Systems, pp. 223, 225

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,