AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Similar documents
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

A Fault Tolerant Superscalar Processor

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Eric Rotenberg Jim Smith North Carolina State University Department of Electrical and Computer Engineering

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

AR-SMT: Coarse-Grain Time Redundancy for High Performance General Purpose Processors

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Speculation and Future-Generation Computer Architecture

Hyperthreading Technology

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multithreaded Processors. Department of Electrical Engineering Stanford University

The University of Texas at Austin

Lecture-13 (ROB and Multi-threading) CS422-Spring

SPECULATIVE MULTITHREADED ARCHITECTURES

The Use of Multithreading for Exception Handling

Handout 2 ILP: Part B

Parallel Computer Architecture

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

LIMITS OF ILP. B649 Parallel Architectures and Programming

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

The Implications of Multi-core

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Speculative Multithreaded Processors

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Advanced Processor Architecture

Use-Based Register Caching with Decoupled Indexing

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

" # " $ % & ' ( ) * + $ " % '* + * ' "

Precise Exceptions and Out-of-Order Execution. Samira Khan

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CS425 Computer Systems Architecture

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Hardware-Based Speculation

TRIPS: Extending the Range of Programmable Processors

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

ARCHITECTURE DESIGN FOR SOFT ERRORS

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Exploitation of instruction level parallelism

Dataflow: The Road Less Complex

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Dynamic Scheduling. CSE471 Susan Eggers 1

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

TDT 4260 lecture 7 spring semester 2015

Getting CPI under 1: Outline

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Instruction Level Parallelism (ILP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Superscalar Machines. Characteristics of superscalar processors

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Introducing Multi-core Computing / Hyperthreading

Superscalar Processors

Execution-based Prediction Using Speculative Slices

Dynamic Memory Dependence Predication

Simultaneous Multithreading Processor

CS 654 Computer Architecture Summary. Peter Kemper

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Portland State University ECE 587/687. Memory Ordering

THREAD LEVEL PARALLELISM

5008: Computer Architecture

Simultaneous Multithreading Architecture

Case Study IBM PowerPC 620


A Study of Slipstream Processors

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Multiple Instruction Issue and Hardware Based Speculation

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Lecture 18: Multithreading and Multicores

Architectures for Instruction-Level Parallelism

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Hardware-based Speculation

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

! Readings! ! Room-level, on-chip! vs.!

Transcription:

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu

High-Performance Microprocessors General-purpose computing success - Proliferation of high-performance single-chip microprocessors Rapid performance growth fueled by two trends - Technology advances: circuit speed and density - Microarchitecture innovations: exploiting instruction-level parallelism (ILP) in sequential programs Both technology and microarchitecture performance trends have implications to fault tolerance 2

Technology and Fault Tolerance Technology-driven performance improvements - GHz clock rates - Billion-transistor chips High clock rate, dense designs - low voltages for fast switching and power management - high-performance and undisciplined circuit techniques - managing clock skew with GHz clocks - pushing the technology envelope potentially reduces design tolerances in general => Entire chip prone to frequent, arbitrary transient faults 3

Microarchitecture and Fault Tolerance Conventional fault-tolerant techniques - Specialized techniques (e.g. ECC for memory, RESO for ALUs) do not cover arbitrary logic faults - Pervasive self-checking logic is intrusive to design - System-level fault tolerance (e.g. redundant processors) too costly for commodity computers A microarchitecture-based fault-tolerant approach - Microarchitecture performance trends can be easily leveraged for fault-tolerance goals - Broad coverage of transient faults - Low overhead: performance, area, and design changes 4

Talk Outline Technology and microarchitecture trends: implications to fault tolerance Time redundancy spectrum AR-SMT Leveraging microarchitecture trends - Simultaneous Multithreading (SMT) - Speculation concepts - Hierarchical processors (e.g. Trace Processors) Performance results Summary and future work 5

Time Redundancy Spectrum Program-level time redundancy Processor Program P Processor Program P Instruction re-execution Processor instr i Program P dynamic parallel scheduling execution units i i FU FU FU FU 6

Time Redundancy Spectrum Processor Program P AR-SMT time redundancy Program P 7

AR-SMT High Level A-stream fetch PROCESSOR R-stream commit R-stream DELAY BUFFER A-stream A => Active stream R => Redundant stream SMT => Simultaneous MultiThreading 8

AR-SMT Fault Tolerance Delay Buffer - Simple, fast, hardware-only state passing for comparing thread state - Ensures time redundancy: the A- and R-stream copies of an instruction execute at different times - Buffer length adjusted to cover transient fault lifetimes Transient fault detection and recovery - Fault detected when thread state does not match - Error latency related to length of Delay Buffer - Committed R-stream state is checkpoint for recovery 9

Leveraging Microarchitecture Trends Simultaneous Multithreading Control Flow and Data Flow Prediction Hierarchy and Replication 10

Simultaneous multithreading SMT [Tullsen,Eggers,Emer,Levy,Lo,Stamm: ISCA-23] - Multiple contexts space-share a wide-issue superscalar processor - Key ideas Performance: better overall utilization of highly parallel uniprocessor Complexity: leverage well-understood superscalar mechanisms to make multiple contexts transparent SMT is (most likely) in next-generation product plans 11

Control and Data Prediction i1: r1 <= r6-11 i2: r2 <= r1<<2 i3: r3 <= r1+r2 i4: branch i5,r3==0 01 00 11 00 i5: 11 r5 <= 10 01 00 11 00 11 (a) program segment cycle 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 pipeline latency i5 (b) base schedule cycle 1 i1 i2 i3 i4 i5 2 3 4 5 6 7 8 9 (c) with prediction Delay Buffer contains perfect predictions for R-stream! Existing prediction-validation hardware provides fault detection! 12

Trace Processors Trace: a long (16 to 32) dynamic sequence of instructions, and the fundamental unit of operation in Trace Processors Replicated PEs => inherent, coarse level of hardware redundancy - Permanent fault coverage: A-/R-stream traces go to different PEs - Simple re-configuration: remove a PE from the resource pool Processing Element 0 Global Registers Predicted Local Registers Registers Next Trace Predict Trace Cache Global Rename Maps Live-in Value Predict Issue Buffers Func Units Processing Element 1 Processing Element 2 Data Cache Speculative State Processing Element 3 13

AR-SMT on a Trace Processor FETCH DISPATCH ISSUE EXECUTE RETIRE Trace Predictor PE A-stream PE A-stream A-stream Trace Cache Decode Rename Allocate PE PE R-stream Commit State Reclaim Resources R-stream Trace "Prediction" from Delay Buffer Reg File PE A-stream D-cache Disambig. Unit time-shared space-shared 14

Performance Results normalized execution time (single thread = 1) 1.3 1.2 1.1 1 4 PE 8 PE comp gcc go jpeg li 15

Performance Results % of all cycles 30 25 20 15 10 5 0 A-stream R-stream 0 1 2 3 4 5 6 7 8 number of PEs used 16

Summary Technology-driven performance improvements - New fault environment: frequent, arbitrary transient faults Leverage microarchitecture performance trends for broadcoverage, low-overhead fault tolerance - SMT-based time redundancy - Control and data prediction - Hierarchical processors Introducing a second, redundant thread increases execution time by only 10% to 30% 17

Other Interesting Perspectives D. Siewiorek. Niche Successes to Ubiquitous Invisibility: Fault-Tolerant Computing Past, Present, and Future, FTCS-25 - (Quote) Fault-tolerant architectures have not kept pace with the rate of change in commercial systems. - Fault tolerance must make unconventional in-roads into commodity processors: leverage the commodity microarchitecture. P. Rubinfeld. Managing Problems at High Speeds, Virtual Roundtable on the Challenges and Trends in Processor Design, Computer, Jan. 1998. - Implications of very high clock rate, dense designs 18

Future Work Performance and design studies - Isolate individual performance impact of SMT-ness and prediction aspects - Debug Buffer size vs. performance: SMT scheduling flexibility - Support more threads - Design and validate a full AR-SMT fault-tolerant system Derive fault models (collaboration?) Simulate fault coverage - fault injection - test different parts of processor: coverage varies? - vary duration of transient faults - permanent faults and re-configuration 19