Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University

Size: px

Start display at page:

Download "Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University"

Lynne Griffin
5 years ago
Views:

1 Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1

2 Hardware Failures: Major Concern Permanent: our focus Temporary 2

3 Tolerating Permanent Hardware Failures Detection Diagnosis Recovery Self-repair 3

4 Tolerating Permanent Hardware Failures Detection Diagnosis Recovery Self-repair Concurrent error detection: expensive 4

5 Tolerating Permanent Hardware Failures Detection Diagnosis Online Self-Test and Diagnostics Non-continuous Recovery Low power Self-repair Concurrent No visible downtime 5

6 Tolerating Permanent Hardware Failures Detection Diagnosis Online Self-Test and Diagnostics Localize failures Recovery Diagnosis Self-repair 6

7 Tolerating Permanent Hardware Failures Detection Diagnosis Replace / bypass faulty component(s) Self-repair Recovery Self-repair 7

8 Tolerating Permanent Hardware Failures Detection Diagnosis Correction of corrupted data and states Checkpoint Failure occurs Failure detected Rollback Recovery Self-repair 8

9 This Lecture Detection Diagnosis Recovery Self-repair 9

10 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 10

11 Memory Organization en rd/wr logic addr Row decoder Column decoder Column multiplexers data_in Data registers data_out Sense amplifiers 11

12 Memory Functional Fault Models Memory cell array faults stuck-at, transition, coupling, etc. Address decode faults (AFs) no cell accessed, multiple cells accessed, etc. Read/write logic faults Equivalent to memory cell array faults Single bit / word / column / row faults dominate [Aitken 04, Kurdahi 06, Sridharan 12] 12

13 Spare Rows / Columns en rd/wr logic Spare column addr Row decoder Column decoder Column multiplexers Spare rows data_in Data registers data_out Sense amplifiers 13

14 Memory Self-Repair Using Spare Rows en rd/wr logic addr Row decoder Column decoder Column multiplexers data_in Data registers data_out Sense amplifiers Row decoder modified 14

15 Memory Self-Repair Using Spare Columns en rd/wr logic addr Row decoder Column decoder Multiplexers Column multiplexers Selfrepair control data_in Data registers data_out Sense amplifiers Additional multiplexers and select logic 15

16 Spare Words en rd/wr logic Same inputs as RAM addr Row decoder Column decoder Column multiplexers Spare words pool data_in Data registers Sense amplifiers data_out Self-repair control [Kim 98] 16

17 Redundancy in Memories Essential to improve yield and reliability Widely used in commercial RAMs Related topics How much redundancy? Built-in repair analysis: redundancy allocation 17

18 How Much Redundancy? Yield analysis and yield learning Example: negative binomial memory yield model YYYYY = (1 + YY Y) Y: defect desntiy, Y: memory area Y: defect clustering coefficient (measured to be 2 or 3) For a memory array with N rows and 1 spare row YYYYY = YYYYY+(Y + 1)(YYYYY)(1 YYYYY) 18

19 How Much Redundancy? Yield analysis and yield learning Example: negative binomial memory yield model 19

20 Built-In Repair Analysis Spare rows and columns NP complete Various algorithms and EDA tools exist 20

21 Special Self-Repair Techniques for Caches Caches: affects performance, NOT functionality Block address Tag Index Block offset V Tag Data / ECC Decoder =? Hit/miss Data out 21

22 Cache Line Disable/Delete Exist in commercial systems Intel [Chang 07], IBM [Sanda 08], etc. Block address Tag Decoder Index Block offset Fault-tolerance bit V FT Tag Data / ECC =? Hit/miss Data out 22

23 Setting the FT Bit Cache line read ECC calculate and check ECC error? No Yes Correct and flag error; log faulty location Done ECC error at same location already occurred within a time threshold? No Yes Set FT bit 23

24 PADded Cache Reconfigurable cache with programmable decoder Conventional decoder 24

25 PADded Cache Reconfigurable cache with programmable decoder Additional tag bits needed Programmable decoder ~ 5% area cost [Shirvani 99] 16KB, direct-mapped, 1-level programmability Reduce cost at the price of granularity 25

26 Cache Line Delete vs. PADded Cache 26

27 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 27

28 Core Sparing Utilize multi-/many-core architectures Core disabling also possible Already in commercial products IBM BlueGene/Q Nvidia Geforce Cisco Metro Fine-grained approaches? 28

29 Core Cannibalization Many-core designs with small in-order cores 3.5% area cost (OpenRISC 1200) [Romanescu 08] 29

30 Core Cannibalization: Discussion What about diagnosis? Routing performance impact Additional wire delay pipeline stages Modified branch resolution, bypass logic, etc. Decreased clock frequency Small vs. large number of faulty cores 30

31 Microarchitectural Block Disabling Disable half pipeline way in superscalar designs 12% area cost (includes diagnosis) Issue queue F D R Rd E M W C F D R Rd E M W C [Schuchman 05] 31

32 Microarchitectural Block Disabling Disable half pipeline way in superscalar designs 12% area cost (includes diagnosis) Issue queue F S D R S Rd E M W C * * * * * * * *modified stages F S D R S Rd E M W C * * * * * * * [Schuchman 05] 2 shift stages (S) added and lots design modifications 32

33 Microarchitectural Block Disabling Disable half pipeline way in superscalar designs 12% area cost (includes diagnosis) Issue queue, old half F S D R S Rd E M W C * * * * * * * *modified stages F S D R S Rd E M W C * * * * * * * Issue queue, new half Split issue queue (same for store buffer, not shown) [Schuchman 05] 33

34 Microarchitectural Block Disabling: Discussion Expensive, complex, intrusive Diagnosis logic Reconfiguration logic Coverage issues E.g., fetch stage not covered 34

35 Architectural Core Salvaging Many-core CISC processor designs Application 1 Application n Advanced ISA Basic ISA... Advanced ISA Basic ISA Core 1 Core n [Powell 09] 35

36 Architectural Core Salvaging Many-core CISC processor designs Application 1 Application n Basic uops Basic uops Advanced ISA Basic ISA... Advanced ISA Basic ISA Core 1 Core n [Powell 09] 36

37 Architectural Core Salvaging Many-core CISC processor designs Application 1 Application n Advanced uops Basic uops Advanced ISA Basic ISA... Advanced ISA Basic ISA Core 1 Core n [Powell 09] 37

38 Architectural Core Salvaging Many-core CISC processor designs Application 1 Thread swap Application n Advanced uops Basic uops Advanced ISA Advanced ISA Basic ISA... Basic ISA Core 1 Core n [Powell 09] 38

39 Architectural Core Salvaging Many-core CISC processor designs Application n Thread swap Application 1 Basic uops Advanced uops Advanced ISA Advanced ISA Basic ISA... Basic ISA Core 1 Core n [Powell 09] 39

40 Architectural Core Salvaging: Discussion What about diagnosis Applicability CISC-like architectures Performance Depends Coverage issues ~50% coverage of execution 40

41 Self-Repair for Processor Cores Which technique to choose? Software-assisted techniques? 41

42 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 42

43 Uncore Prevalent in SoCs Uncore examples IBM Power 7 Cache / DRAM controller Accelerators I/O interfaces NVIDIA Tegra Cisco Network Processor 43

44 Uncore Self-Repair Essential OpenSPARC T2 Uncore 12% Processor cores 12% Memories 76% Cores, memories, networks-on-chip Many existing techniques 44

45 Key Message [Li ITC13] 20 Existing sparing techniques Chip area impact (%) 0 Self-repair coverage (%) 98 45

46 Key Message [Li ITC13] 20 Existing sparing techniques Chip area impact (%) 7.5 Our techniques Self-repair coverage (%) 98 46

47 Key Message [Li ITC13] 20 Existing sparing techniques Chip area impact (%) 7.5 Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% Our techniques Self-repair coverage (%) 98 47

48 Existing Techniques Inadequate OpenSPARC T2 Uncore 48

49 Existing Techniques Inadequate OpenSPARC T2 Uncore Original Spare 49

50 Existing Techniques Inadequate OpenSPARC T2 Original Spare Steering logic 50

51 Existing Techniques Inadequate OpenSPARC T2 Original Power gating Spare Steering logic 20% chip area cost 51

52 New Uncore Self-Repair Techniques ERRS: Enhanced Resource Reallocation and Sharing SHE: Sparing through Hierarchical Exploration OpenSPARC T2 8 cores, 64 threads 500M transistors 7.5% chip area cost 52

53 Outline Introduction Self-repair for uncore ERRS SHE Conclusion 53

54 Basic Resource Reallocation & Sharing (RRS) Already-existing similar components No spares 3. Reroute Crossbar blocks 4 cores 4 cores Self-repair control L2 mem 0 L2 bank control 0 L2 bank control 1 L2 mem 1 1. Fault detected 8 cycle overhead (faulty case only) 2. Resource sharing 54

55 Basic RRS Performance Impact Fault-free scenario Crossbar blocks 4 cores 4 cores Self-repair control L2 mem 0 L2 bank MISS! control 0 L2 bank MISS! control 1 L2 mem 1 Fill buffer 55

56 Basic RRS Performance Impact Single faulty component: 70% impact Crossbar blocks 4 cores 4 cores Self-repair control L2 mem 0 L2 bank control 0 L2 bank control 1 L2 mem 1 Fill buffer DATADATA 56

57 Enhanced RRS (ERRS) Idea Identify Mitigate No Desirable tradeoff? Analyze Done Yes 57

cores Self-repair control L2 mem 0 L2 bank control

58 ERRS Mitigates RRS Performance Impact Single faulty component: 3% impact Crossbar blocks 4 cores 4 cores Self-repair control L2 mem 0 L2 bank control 0 L2 bank control 1 L2 mem 1 Fill buffer Fill buffer 58

59 ERRS on OpenSPARC T2 DRAM data return: duplicated DRAM bank control FSM: entries doubled L2 hit processing: duplicated L2 miss fill buffer: entries doubled 3.2% area, 2.7% power 59

60 Performance Evaluation Setup Simulators GEM5 64 single-issue in-order processor cores Simulated CMP 1KB private L1 data and instruction caches 4MByte shared L2 cache (8 banks) 4 DRAM controllers 60

61 ERRS vs. Basic RRS: Performance Impact ERRS: 3% performance impact 3% 1 faulty L2 controller 1 faulty DRAM controller 0.7% Average CPI overhead 0% Basic RRS RRS ERRS 0.0% Basic RRS RRS ERRS 64-core CMP PARSEC benchmark programs 61

62 ERRS vs. Basic RRS: Performance Impact ERRS: 5% performance impact 1 faulty L2 controller 1 faulty DRAM controller 18% 25% Average CPI overhead 0% Basic RRS ERRS 0% 64-core CMP Stressed programs Basic RRS RRS ERRS 62

63 ERRS for Multiple Faulty Components 25 Graceful performance degradation Average CPI impact (%) 0 2 DRAM controllers 4 L2 bank controllers 4 L2 bank controllers + 2 DRAM controllers 2 L2 bank controllers 64-core CMP PARSEC benchmark programs Faulty components 63

64 Which Faults Repairable? delay bridging stuckat All faults inside a component 64

65 Which Faults Not Repairable? Single points of failure Primary inputs Primary outputs Steering logic Metric: self-repair coverage % non-single points of failure 65

66 ERRS Component Self-Repair Coverage 97% 97% 98% 97% 97% 97% 97% 96% 97% 97% 97% 97% 98% 97% 97% 66

67 ERRS Chip Area/Power Impact & Coverage 20 Existing sparing techniques Post-layout chip area impact (%) 3.2 Power: 3% ERRS 0 75 Self-repair coverage (%) 98 67

68 Outline Introduction Self-repair for uncore ERRS SHE Conclusion 68

69 SHE: Sparing through Hierarchical Exploration Component RTL SHE Sparing for component Minimize area cost Identical blocks spare shared Balanced coverage vs. area 69

70 The Design Hierarchy Network controller pio Level 1 rdmc Level 2 chnl_16 chnl_15 chnl_1 Level 3 Lowest level 70

71 Sparing in Different Levels: Coarse-Grained pio Network controller pio pio rdmc rdmc rdmc Original Spare / steering logic Level 2 71

72 Sparing in Different Levels: Mixed Levels Network controller pio rdmc pio pio chnl_16 chnl_15 chnl_1 chnl Level 2 / Level 3 Original Spare / steering logic 72

73 Sparing in Different Levels: Fine-Grained Network controller pio rdmc Original Spare / steering logic Lowest level 73

74 How to Obtain Sweet-Spot? Balanced tradeoff High Area overhead Self-repair coverage Low Deeper in hierarchy Algorithm details in paper 74

75 SHE Results 20 Existing sparing techniques Post-layout chip area impact (%) 3.2 ERRS 0 75 Self-repair coverage (%) 98 75

76 SHE Results 20 Existing sparing techniques Post-layout chip area impact (%) 7.5 SHE power impact: 0.2% SHE performance impact: 0% ERRS + SHE 0 Self-repair coverage (%) 98 76

77 What About Diagnosis? Traditional fault diagnosis difficult Effect-cause: infeasible RTL unavailable Cause-effect: impractical 7 petabyte fault dictionary [Beckler ITC12] 77

78 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns 78

79 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns Blocks = self-repair granularity Block 1 Block n 79

80 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns Pass/fail 1 Local test logic Pass/fail n Fail self-repair needed Diagnosis = detection Block 1 Block n = self-repair granularity 80

81 Summary: Self-Repair of Uncore Components 20 Existing sparing techniques Post-layout chip area impact (%) 7.5 Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% ERRS + SHE 3.2 ERRS 0 75 Self-repair coverage (%) 98 81

82 References [Aitken 04] Aitken, R., A Modular Wrapper Enabling High Speed BIST and Repair for Small Wide Memories, Proc. Intl. Test Conf., pp , [Chang 07] Chang, J., et al., The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series, IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp , [Kurdahi 06] Kurdahi, F.J., et al., System-Level SRAM Yield Enhancement, Proc. Intl. Symp. Quality Electronic Design,

83 References [Li VTS10] Li, Y., et al., Concurrent Autonomous Self-Test for Uncore Components in System-on-Chips, Proc. VLSI Test Symposium, [Li DATE08] Li, Y., S. Makar, and S. Mitra, CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns, Proc. Design, Automation, and Test in Europe, pp , [Li ITC 13] Li, Y., et al., Self-Repair of Uncore Components in Robust System-on-Chips: An OpenSPARC T2 Case Study, Proc. IEEE Intl. Test Conf., pp. 1-10,

84 References [Powell 09] Powell, M.D., et al., Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance, Proc. Intl. Symp. on Computer Architecture, pp , [Romanescu 08] Romanescu, B.F., and D.J. Sorin, Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults, Proc. Intl. Conf. on Parallel Architectures and Compilation Techniques, pp ,

85 References [Sanda 08] Sanda, P.N, et al., Fault-Tolerant Design of the IBM Power6 Microprocessor, IEEE Micro, vol. 28, no. 2, pp , [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, Rescue: A Microarchitecture for Testability and Defect Tolerance, Proc. Intl. Symp. on Computer Architecture, pp , [Shirvani 99] Shirvani, P.P., and E.J. McCluskey, PADded Cache: A New Fault-Tolerance Technique for Cache Memories, Proc. VLSI Test Symp., pp ,

86 References [Shivakumar 03] Shivakumar, P., et al., Exploiting Microarchitectural Redundancy for Defect Tolerance, Proc. Intl. Conf. on Computer Design, pp , [Sridharan 12] Sridharan, V., et al., A Study of DRAM Failures in the Field, Proc. Intl. Conf. High Performance Computing, Networking, Storage and Analysis, pp. 1-11, [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, Rescue: A Microarchitecture for Testability and Defect Tolerance, Proc. Intl. Symp. on Computer Architecture, pp ,

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance