A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy

Similar documents
CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

FAULT TOLERANT SYSTEMS

Experimental Study Of Fault Cones And Fault Aliasing

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

DESIGN AND ANALYSIS OF SOFTWARE FAULTTOLERANT TECHNIQUES FOR SOFTCORE PROCESSORS IN RELIABLE SRAM-BASED FPGA

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

Self-checking combination and sequential networks design

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Single Event Upset Mitigation Techniques for SRAM-based FPGAs

Defect Tolerance in VLSI Circuits

FAULT TOLERANT SYSTEMS

Fault Tolerance. The Three universe model

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

CHAPTER 1 INTRODUCTION

Multiple Event Upsets Aware FPGAs Using Protected Schemes

Fine-Grain Redundancy Techniques for High- Reliable SRAM FPGA`S in Space Environment: A Brief Survey

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Dependability tree 1

Reliable Architectures

Error Resilience in Digital Integrated Circuits

Leso Martin, Musil Tomáš

MicroCore Labs. MCL51 Application Note. Lockstep. Quad Modular Redundant System

Fault-tolerant techniques

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

DATAPATH ARCHITECTURE FOR RELIABLE COMPUTING IN NANO-SCALE TECHNOLOGY

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs

Configurable Fault Tolerant Processor (CFTP) for Space Based Applications

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Reliability Improvement in Reconfigurable FPGAs

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

Very Large Scale Integration (VLSI)

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters

Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline

Fault Tolerance. Distributed Systems IT332

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

FAULT TOLERANT SYSTEMS

Improving FPGA Design Robustness with Partial TMR

An Integrated ECC and BISR Scheme for Error Correction in Memory

Digital Systems Testing

Exploiting Unused Spare Columns to Improve Memory ECC

Fault Simulation. Problem and Motivation

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance Dealing with an imperfect world

SEE Tolerant Self-Calibrating Simple Fractional-N PLL

Self-Checking Fault Detection using Discrepancy Mirrors

ECE 574 Cluster Computing Lecture 19

Quadded GasP: a Fault Tolerant Asynchronous Design

On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPGAs

Implementation Issues. Remote-Write Protocols

VLSI Test Technology and Reliability (ET4076)

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Fault-tolerant system design using novel majority voters of 5-modular redundancy configuration

Fault Tolerance. Distributed Software Systems. Definitions

Selective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures

Radiation Hardened System Design with Mitigation and Detection in FPGA

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Validation of the Proposed Hardness Analysis Technique for FPGA Designs to Improve Reliability and Fault-Tolerance

WITH the continuous decrease of CMOS feature size and

Formal Verification throughout the Development of Robust Systems

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA

ARCHITECTURE DESIGN FOR SOFT ERRORS

Dynamic Partial Reconfiguration of FPGA for SEU Mitigation and Area Efficiency

Last Class:Consistency Semantics. Today: More on Consistency

Initial Single-Event Effects Testing and Mitigation in the Xilinx Virtex II-Pro FPGA

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

CS252 Graduate Computer Architecture Midterm 1 Solutions

Fault-Tolerant Embedded System

Eventual Consistency. Eventual Consistency

Utilizing Parallelism of TMR to Enhance Power Efficiency of Reliable ASIC Designs

CS152 Computer Architecture and Engineering

Fault-Tolerant Embedded System

SPECIAL ISSUE ENERGY, ENVIRONMENT, AND ENGINEERING SECTION: RECENT ADVANCES IN BIG DATA ANALYSIS (ABDA) ISSN:

Today: Fault Tolerance. Replica Management

11. SEU Mitigation in Stratix IV Devices

Distributed Systems COMP 212. Lecture 19 Othon Michail

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

A New Fault Injection Approach to Study the Impact of Bitflips in the Configuration of SRAM-Based FPGAs

Chapter 8 Fault Tolerance

SAN FRANCISCO, CA, USA. Ediz Cetin & Oliver Diessel University of New South Wales

Chapter 8. Coping with Physical Failures, Soft Errors, and Reliability Issues. System-on-Chip EE141 Test Architectures Ch. 8 Physical Failures - P.

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

A Performance Degradation Tolerable Cache Design by Exploiting Memory Hierarchies

Dependability. IC Life Cycle

Distributed Operating Systems

A Fault Tolerant Superscalar Processor

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Chip, Heal Thyself. The BulletProof Project

Using Error Detection Codes to detect fault attacks on Symmetric Key Ciphers

Hardware Implementation of a Fault-Tolerant Hopfield Neural Network on FPGAs

Digital Integrated Circuits

Fault-Tolerant Computing

Distributed Systems 24. Fault Tolerance

Transcription:

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09, MS 12 W. Robert Daasch, Professor Integrated Circuits Design and Test Laboratory

Problem Statement In a fault tolerant system containing three redundant processing mitigate the effects of a single faulty PE. When more than one PE is faulty Triple Modular Redundancy can select a faulty output. Time distributed voting (TDV) proposes an alternative to TMR to extend fault coverage when multiple PE s are faulty. 16 April 2012 2

Outline Contributions Background Methods: Time Distributed Voting (TDV) Results Conclusion and Recommendations 16 April 2012 3

Contributions Time Distributed Voting (TDV) extends coverage in active fault tolerant systems CAM based Verilog HDL TDV prototype: Finds voting opportunities by detecting data stream commonalities Aligns PE result for voting execution Characterization of aliasing in the ISCAS 85 C6288 benchmark 16 April 2012 4

Background - Faults A fault is any upset that modifies a circuit to the point of failure. Faults may be caused by: Random Defects Introduced during fabrication. May be detectable at test or latent Soft errors - occur online, may be recovered or reset Hard errors occur online, may cause permanent damage to the circuit 16 April 2012 5

Random Defects Defects failures may be detectable at test May fail during the product s useful lifetime. As minimum feature size gets smaller, the circuit becomes sensitive to smaller defects. Smaller defects occur at a greater frequency. A.25um defect will occur 8 times more frequently than a.5um defect. Relative frequency approximation. 16 April 2012 6

Online Soft and Hard Errors Soft Errors A change to the state of a device or transient Caused by a heavy ionizing particle, cosmic ray, proton, etc. No permanent damage, recovered by reset Hard Errors Burnout Latch up Electro Migration Permanently damages the device 16 April 2012 7

Fault Mitigation Manufacturers employ techniques to improve yield and reliability in the presence of faults. Fault tolerant designs may preserve the functionality of the system when a component fails. Passive fault tolerance: masking erroneous results while leaving the faulty circuit in the system Active fault tolerance: identifies faulty circuits and removes or replaces them in the system Triple modular redundancy (TMR) is a passive fault tolerant technique 16 April 2012 8

Triple Modular Redundancy (TMR) Three redundant processing elements operate same input PE results are evaluated in a voting algorithm Majority result is the system output Any single erroneous result masked by the majority result 16-bit Mult [0] Result[0] 88777777 Input 8888ffff 16-bit Mult [1] Result[1] 88877778 16-bit Mult [2] Result[2] 88877778 Majority Voting Algorithm Output 88877778 16 April 2012 9

Shortcomings of TMR Additional area and power is required for redundant PE s and voting logic TMR provides coverage for cases as they arise TMR only provides reliable coverage when, at most, only a single PE is faulty What happens when two PE s are faulty? Can we identify the fault free PE using random inputs? 16 April 2012 10

Fault Cones and Aliasing A fault s cone is all output bits affected by the fault Fault cones f2 and f3 can overlap Aliasing possible when input activates both faults Aliasing is faulty PEs agreeing to wrong result i0 i1 i2 i3 i4 i5 i6 i7 i0 i1 i2 i3 i4 i5 i6 i7 i0 i1 i2 i3 i4 i5 i6 i7 f1 PE1 f2 PE2 f3 PE3 o0 o1 o2 o3 o4 o5 o6 o7 o0 o1 o2 o3 o4 o5 o6 o7 o0 o1 o2 o3 o4 o5 o6 o7 16 April 2012 11

Majority Voting Majority voting systems with multiple faulty PE s generate: Indeterminate outcomes no PE results match Cases with no majority, or the majority is wrong. Faulty PE s correct results vote with healthy PEs When PEs contain different faults, majority voting may still favor the healthy (Golden) PEs TMR systems assume single fault Eliminates potential for indeterminate (null) voting Eliminates potential aliased voting 16 April 2012 12

Example Fault Cones and Aliasing i0 i1 i2 i3 i4 i5 i6 i7 i0 i1 i2 i3 i4 i5 i6 i7 i0 i1 i2 i3 i4 i5 i6 i7 f1 PE1 f2 PE2 f3 PE3 o0 o1 o2 o3 o4 o5 o6 o7 o0 o1 o2 o3 o4 o5 o6 o7 o0 o1 o2 o3 o4 o5 o6 o7 f1 activated f2 activated f3 activated Bit-level voter Word-level Voter Comment 0 0 0 no fault observed no fault observed 0 0 1 f3 observed f3 observed 0 1 0 f2 observed f2 observed 0 1 1 possible aliasing on o5 possible word aliasing f2 and f3 overlap 1 0 0 f1 observed f1 observed 1 0 1 no bit level aliasing indeterminate f1 and f3 no overlap 1 1 0 no bit level aliasing indeterminate f1 and f2 no overlap 1 1 1 possible aliasing on o5 indeterminate 16 April 2012 13

Time Distributed Voting TDV A statistical opportunity exists for faulty PEs to help identify healthy PE s by accumulating voting results over time TDV identifies healthy and faulty PE s over time Alternative to TMR masking erroneous results If fault is not activated PE output is correct When fault is activated not all PE output incorrect 16 April 2012 14

TDV Prototype Verilog HDL prototype was used to simulate TDV Features: Three PE s operating on independent data streams CAM-based FIFO s to detect voting opportunities PE result alignment and vote execution 16 April 2012 15

TDV Block Diagram stream [0] stream [1] stream [2] FIFO[0] FIFO[1] FIFO[2] Commonality Detection input[0] input[1] input[2] PE[0] PE[1] PE[2] result[0] result[1] result[2] output stream[0] output stream[1] output stream[2] common symbol array result array[0] result array[1] result array[2] time-distributed majority voting algorithm weight[0] weight[1] weight[2] 16 April 2012 16

TDV Prototype FIFO[0] FIFO[1] FIFO[2] Search Field1 B C A Search Field2 C A B DATA[31] C D E DATA[30] F G B DATA[ ] DATA[1] A B C DATA[0] H C I stream [0] FIFO[0] input[0] PE[0] stream [1] FIFO[1] input[1] PE[1] stream [2] FIFO[2] input[2] PE[2] Commonality Detection Active Input A B C HIT 0 0 1 PE Result X Y Z The CAM-based FIFO buffers provide data to the PE s When a FIFO HIT is detected, the input pattern and its PE result are cached Results from other buffers are cached when available When the results from all PE are cached, voting is executed PE weights (tallies) are adjusted common symbol array C result[0] result array[0] Z result[1] result array[1] Z result[2] result array[2] time-distributed majority voting algorithm weight[0] weight[1] weight[2] Z output stream[0] output stream[1] output stream[2] 16 April 2012 17

TDV Processing Element (C6288) The ISCAS 85 C6288 Benchmark is used as the prototype processing element 15 by 16 array structure. 16-bit Multiplier (32 input bits, 32 output bits) C6288 contains 2448 nodes that may be modeled as SA0 or SA1 faults. Total 4896 fault nodes. (4879 observable) Minimum set size of 12 patterns to achieve 100% fault coverage (observable faults) 150 Pseudorandom patterns to achieve 100% fault coverage (observable faults) 16 April 2012 18

C6288 PE MSB LSB 16 April 2012 19

C6288 PE (continued) C6288 symmetry high rate of aliasing C6288 AND gates compute partial products C6288 half and full adders sum partial products MSB LSB 16 April 2012 20

Adders used in C6288 Top-row half adders lack the Ci input Single half adder in the bottom row lacks the B input 16 April 2012 21

Results: Coverage of Single Stuck-at Faults 100% 90% Fault Coverage Fault Coverage 80% 70% 60% 50% 40% 1 10 100 1,000 Number of Pseudorandom Input Patterns Simulated 1,200 pseudorandom test patterns for each of the 4,896 fault nodes Achieved 100% single stuck-at fault coverage with 150 pseudorandom input patterns. 16 April 2012 22

Results: Aliasing Characterization Random Patterns Activated Faults Unactivated Faults Non-Aliasing Faults Aliasing Faults Aliasing Faults (%) 2 3,302 1,577 14 3,288 99.58% 3 3,974 905 28 3,946 99.30% 6 4,481 398 34 4,447 99.24% 12 4,768 111 34 4,734 99.29% 25 4,821 58 28 4,793 99.42% 50 4,874 5 28 4,846 99.43% 100 4,877 2 28 4,849 99.43% 200 4,879 0 28 4,851 99.43% 400 4,879 0 28 4,851 99.43% 800 4,879 0 28 4,851 99.43% 1,200 4,879 0 28 4,851 99.43% All fault pairs simulated for the N= 4,878 faults N(N-1)/2 = 11,982,960 fault pairs 99+% of faults in the array have aliasing fault pairs 28 faults displayed no aliasing 16 April 2012 23

Results: Aliasing Characterization Aliasing was observed in about 3.2% of all activated fault combinations Random Patterns Activated Fault Combinations Aliasing Fault Combinations Aliasing Fault Combinations(%) 2 5449951 96361 1.77% 3 7894351 145437 1.84% 6 10037440 205827 2.05% 12 11364528 276495 2.43% 25 11618610 317298 2.73% 50 11875501 353489 2.98% 100 11890126 369431 3.11% 200 11899881 377027 3.17% 400 11899881 378817 3.18% 800 11899881 379375 3.19% 1200 11899881 379437 3.19% 16 April 2012 24

Results: Aliasing Characterization Quantiles 100.00% maximum 9.5857 99.50% 9.4001 97.50% 6.947 90.00% 5.6071 75.00% quartile 3.9373 50.00% median 3.1128 25.00% quartile 2.1439 10.00% 1.2781 2.50% 0.5154 0.50% 0.0466 0.00% minimum 0.0206 Percentage of 4,878 fault combinations that alias (4851 singular faults) On average a singular fault aliases with ~3.2% of it s possible fault combinations At most, a singular fault aliases with ~9.6% of it s possible fault combinations Moments Mean 3.224832 Std Dev 1.637596 Std Err Mean 0.023512 Upper 95% Mean 3.270927 Lower 95% Mean 3.178738 N 4851 16 April 2012 25

Results: Aliasing Characterization Equivalent faults modify the circuit in the same way Display identical symptoms Red faults are equivalent Equivalent faults ~15,000 of aliasing fault pairs Aliasing extends equivalent faults 16 April 2012 26

Results: Aliasing Characterization In the C6288, aliasing frequency is modulated by propagation paths and fault pair proximity. Propagation paths Faults that propagate through both adder outputs alias more than faults that only propagate through a single adder output. Proximity - Aliasing is only observed for fault pairs in the same or adjacent column. Equivalence is not required. A singular fault (blue) may alias with faults in close columnar proximity (red) All aliasing fault pairs are in the same or adjacent column (pawn movement) 16 April 2012 27

Results: TDV Fault Coverage TDV simulation assumes a single healthy (Golden) PE and two faulty (Faulty1 & Faulty2) PE s For each pseudorandom input pattern, the PE weights are updated as shown in the table below. (Minority PE weight gets decremented, Majority PE weights get incremented) TDV voting is non-biased, the outcomes are not skewed to favor the Golden PE The voter does not know what the correct answer is The object of is to identify the Golden PE using the accumulated TDV outcomes Input Pattern Result[0] Result[1] Result[2] Weight[0] Weight[1] Weight[2] A X X X +0 +0 +0 A Y X X -1 +1 +1 A X Y X +1-1 +1 A X X Y +1 +1-1 A X Y Z +0 +0 +0 16 April 2012 28

Results: TDV Fault Coverage Golden Golden Golden Faulty1 Faulty2 Faulty1 Faulty2 Faulty1 Faulty2 In 96.8% of fault pairs, the Golden PE is correctly identified and no aliasing is observed for any of the input patterns. The green regions in the figure indicate voting results that favor the Golden PE No voting results conspire against the Golden PE In 1.8% of fault pairs, the Golden PE was correctly identified even though aliasing was observed. The red region indicates voting results that conspire against the Golden PE. As long as there is more green than red, TDV correctly identifies the Golden PE. In 1.37% of fault pairs, the Golden PE was evicted because heavy aliasing was observed. The red region is large enough to overcome the green regions. These cases are heavy aliasing fault pairs and equivalent fault pairs. 16 April 2012 29

Results: TDV Fault Coverage Adding more test patterns changes the snapshot of which aliasing faults get coverage. The table shows the TDV outcome incrementally as input pattern set size is increased to 1,200 (Green=Correct; Red=Incorrect) 200 400 800 1200 PE1 PE2 PE3 Patterns Patterns Patterns Patterns Golden N997 N2263 0 0 0 1 Golden N4297 N4796 0 0 1 0 Golden N752 N2514 0 0 1 1 Golden N411 N3247 0 1 0 0 Golden N1000 N2759 0 1 0 1 Golden N997 N2008 0 1 1 0 Golden N872 N1121 0 1 1 1 Golden N868 N1121 1 0 0 0 Golden N3162 N4676 1 0 0 1 Golden N1823 N2577 1 0 1 0 Golden N3508 N4513 1 0 1 1 Golden N843 N3362 1 1 0 0 Golden N3745 N4760 1 1 0 1 Golden N435 N2694 1 1 1 0 16 April 2012 30

Results: TDV Fault Coverage Random Patterns Total Fault Pairs % Correct No-Aliasing Pairs % Aliasing Pairs % Correct Aliasing Pairs % Total Correct Pairs % Incorrect Aliasing Pairs 200 11,899,881 96.83% 3.17% 1.79% 98.62% 1.38% 400 11,899,881 96.82% 3.18% 1.80% 98.62% 1.38% 800 11,899,881 96.81% 3.19% 1.82% 98.63% 1.37% 1,200 11,899,881 96.81% 3.19% 1.83% 98.64% 1.36% For the C6288 array circuit TDV covers all observable single faulty PE cases covered by lockstep TMR. TDV extends fault coverage to 98.6% of multiple faulty PE s for which TMR provides no coverage. 16 April 2012 31

Conclusions and Recommendations Lockstep TMR can fail in the presence of multiple faulty PE s. Time distributed voting (TDV) is an alternative to lockstep TMR TDV extended coverage to 98.6% of multiple faulty PE s for. C6288 benchmark alias simulation TMR provides no coverage for 1.4% fault pairs 16 April 2012 32

Conclusions and Recommendations Aliasing extends beyond equivalent faults Conventional fault collapsing does eliminate fault pair aliasing TDV does not ensure detection of faulty elements in all cases. TDV evicted the healthy PE 3.2% of the fault pairs TDV requires analysis of frequency of aliasing fault pairs TDV may require engineered test patterns to maximize coverage. 16 April 2012 33

Conclusions and Recommendations TDV provides effective alternative to lockstep TMR TDV provides fault tolerant design of systems in the presence of multiple faults. Adding PE s reduces aliasing faulty PE s Aliasing reduced from PE s with different implementation for same function Engineering a minimal test pattern to identify alias fault pairs 16 April 2012 34

Acknowledgement Andrew Baldwin was a part-time graduate student The work described in this talk was his MS thesis 16 April 2012 35