Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Similar documents
A Study of Slipstream Processors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Execution-based Prediction Using Speculative Slices

Exploiting Large Ineffectual Instruction Sequences

Speculative Parallelization in Decoupled Look-ahead

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

November 7, 2014 Prediction

Transient-Fault Recovery Using Simultaneous Multithreading

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Eric Rotenberg Jim Smith North Carolina State University Department of Electrical and Computer Engineering

ECE404 Term Project Sentinel Thread

The Processor: Instruction-Level Parallelism

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Lecture-13 (ROB and Multi-threading) CS422-Spring

A Fault Tolerant Superscalar Processor

Exploitation of instruction level parallelism

Abstract KOPPANALIL, JINSON JOSEPH. A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

A Study of Control Independence in Superscalar Processors

Pre-Computational Thread Paradigm: A Survey

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

SPECULATIVE MULTITHREADED ARCHITECTURES

Handout 2 ILP: Part B

Multithreaded Processors. Department of Electrical Engineering Stanford University

Locality-Based Information Redundancy for Processor Reliability

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Speculative Multithreaded Processors

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Dynamic Memory Dependence Predication

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Simultaneous Multithreading Processor

AR-SMT: Coarse-Grain Time Redundancy for High Performance General Purpose Processors

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Processor (IV) - advanced ILP. Hwansoo Han

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Techniques for Efficient Processing in Runahead Execution Engines

Chapter 4. The Processor

Trace Processors. 1. Introduction. Abstract. 1.1 Trace processor microarchitecture

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Full Datapath. Chapter 4 The Processor 2

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

TDT 4260 lecture 7 spring semester 2015

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Wide Instruction Fetch

Superscalar Processor Design

Hardware-Based Speculation

COMPUTER ORGANIZATION AND DESI

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lec 11 How to improve cache performance

A Mechanism for Verifying Data Speculation

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Chapter 4. The Processor

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Getting CPI under 1: Outline

Instruction Level Parallelism

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

5008: Computer Architecture

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Hardware-based Speculation

COMPUTER ORGANIZATION AND DESIGN

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Multiple Instruction Issue and Hardware Based Speculation

Advanced Instruction-Level Parallelism

Multithreaded Value Prediction

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Instruction Level Parallelism

Control Hazards. Branch Prediction

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Transient Fault Detection via Simultaneous Multithreading

Chapter 4 The Processor (Part 4)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Lecture: Out-of-order Processors

Limitations of Scalar Pipelines

Transcription:

Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu

Many means to an end Program is merely a specification - Processor executes full dynamic instruction stream - Can construct shorter instruction stream with same overall effect original dynamic instruction stream Final Output IDENTICAL equivalent, shorter instruction stream Final Output Result: as little as 20% of the original dynamic program can produce the same result - Tech report - Exploiting Large Ineffectual Instruction Sequences, Rotenberg, Nov 1999 2

Many means to an end Key idea - Only need a small part of program to make full, correct, forward progress - The catch: Speculative Must monitor original program to determine essential component 3

Cooperative Redundant Threads 1. Speculatively create a shorter version of the program - Operating system creates two redundant processes - Monitor one of the programs for: Ineffectual writes Highly-predictable branches - With high confidence, but no certainty, future instances of ineffectual and branch-predictable computation are bypassed in the other program copy 4

Cooperative Redundant Threads 2. Run the two versions on a single-chip multiprocessor (CMP) or simultaneous multithreaded processor (SMT) - Names Short program: Advanced Stream, or A-stream Full program: Redundant Stream, or R-stream - A-stream speculatively runs ahead and communicates control/data outcomes to R-stream - R-stream consumes outcomes as predictions but still redundantly produces same information R-stream executes more efficiently R-stream verifies the speculative A-stream; if A-stream deviates, its context is recovered from R-stream 5

Cooperative Redundant Threads Two potential benefits 1. Improved single-program performance Faster than running only the original program 2. Improved fault tolerance Partial redundancy allows detection of transient hardware faults Can also tolerate faults via the existing recovery mechanism 6

Talk Outline Introduction to CRT Microarchitecture description Understanding performance benefits Performance results Understanding fault tolerance benefits The bigger picture: harnessing CMP/SMT processors Conclusions Future work 7

A-stream Example Microarchitecture from IR-detector R-stream Branch Pred. I-cache IR-predictor I-cache Branch Pred. D-cache Execute Core Reorder Buffer Delay Buffer Execute Core Reorder Buffer D-cache Recovery Controller IR-detector to IR-predictor 8

A-stream creation Creating the A-stream 1. IR-predictor: instruction-removal predictor Built on top of conventional branch predictor Generates next PC in a new way - Next PC reflects skipping past any number of instructions that would otherwise be fetched/executed - Also indicates which instructions within fetch block to discard 2. IR-detector: monitor R-stream, detect candidate instructions for future removal IR-detector indicates removal info to IR-predictor Repeated indications cause IR-predictor to remove future instances 9

Indexed like gshare IR-predictor (Base) Each table entry contains info for one dynamic basic block - Tag - 2-bit counter to predict branch - Per-instruction resetting confidence counters Updated by IR-detector Counter incremented if instr. detected as removable Counter reset to zero otherwise Saturated counter => instruction removed from A-stream when next encountered 10

IR-predictor (Improved) Reducing fetch cycles in the A-stream base IR-predictor improved IR-predictor A A B B C D C D 11

IR-predictor (Improved) Bypassing fetch => same effect as taken branch! Previous example - Convert branch ending block A to a taken branch whose target is D At least two possible methods 1. Include converted target (D) and implied intervening branch outcomes (B,C) in block A s entry 2. Include intervening branch outcomes (B,C) in block A s entry, but separate BTB to store numerous targets per static branch 12

Monitor retired R-stream instructions for three triggering conditions 1. Unreferenced writes 2. Non-modifying writes IR-detector 3. Correctly-predicted branches Select triggering instructions as candidates for removal Also select their computation chains for removal - Can remove an instruction if all consumers are known (value has been killed) and all are selected for removal - Facilitated by reverse data flow graph (R-DFG) circuits 13

IR-detector New Instr. OPERAND RENAME TABLE merge instruction into R-DFG R-DFG VALID BIT REF BIT VALUE PRODUCER - select non-modifying writes and unreferenced writes for removal - kill instructions update IR-predictor 14

Delay Buffer A simple FIFO queue for communicating outcomes - A-stream pushes - R-stream pops Actually two buffers - Control flow buffer Complete history of control flow as determined by A-stream Instruction-removal information (for matching partial data outcomes w/ instructions in R-stream) - Data flow buffer Partial history of data flow, for instructions executed in A-stream Source/dest. register values and memory addresses 15

IR-mispredictions Instruction-removal misprediction (IR-misprediction) - Instructions were removed from A-stream that shouldn t have been removed - Undetectable by A-stream - IR-mispredictions corrupt A-stream context and must be resolved by the R-stream 16

Handling IR-mispredictions Three things needed 1. Detect IR-mispredictions Both R-stream and IR-detector perform checks 2. Get ready for state recovery Backup IR-predictor (branch predictor) Flush delay buffer, flush ROB A, flush ROB R PC A =PC R 3.... 17

Handling IR-mispredictions (cont.) 3. Pinpoint corrupted architectural state in A-stream and recover state from R-stream Entire register file copied from R-stream to A-stream Recovery controller maintains list of potentially tainted memory addresses Communicate restore values via Delay Buffer, reverse direction 18

IR-misprediction Detection Usually surface as branch/value mispredictions in R-stream Some IR-mispredictions take long time to show symptoms IR-detector can detect a problem sooner - Compare predicted & computed removal information - Checks are redundant with R-stream checks, but recovery model requires last line of defense - Last line of defense bounds state in recovery controller 19

IR-misprediction Recovery store 1: executed in A-stream store 2: skipped in A-stream ROB of A-stream Delay Buffer ROB of R-stream IR-detector add store 1 (possible store-undo) remove store 1 add store 2 (possible store-do) recovery controller remove store 2 20

A-stream Review from IR-detector R-stream Branch Pred. I-cache IR-predictor I-cache Branch Pred. D-cache Execute Core Reorder Buffer Delay Buffer Execute Core Reorder Buffer D-cache Recovery Controller IR-detector to IR-predictor 21

Talk Outline Introduction to CRT Microarchitecture description Understanding performance benefits Performance results Understanding fault tolerance benefits The bigger picture: harnessing CMP/SMT processors Conclusions Future work 22

Understanding Performance A-stream s perspective - Performance is better simply because program is shorter - R-stream plays secondary role of validation R-stream s perspective - Better branch prediction [Pre-execution: Roth&Sohi, Zilles&Sohi, Farcy et. al.] A-stream is a helper thread - Better value prediction (program-based, not history-based) confident predictions unverified Predictor unconfident predictions A-stream verified R-stream 23

Understanding Performance (cont.) What if fetch & execution bandwidth were unlimited? - Critical path through program = serialized dependence chains of mispredicted branches - A-stream cannot reduced this critical path! 24

Understanding Performance (cont.) Reasoning about instruction fetch and execution - More execution bandwidth devoted to R-stream (more units, bigger ROB) => A-stream less effective - UNLESS A-stream can also bypass instruction fetching - Raw instruction fetch bandwidth not as easily increased Branch predictor throughput Taken branches (trace predictors and trace caches...) Having a second program counter is great alternative if it can run ahead 25

Talk Outline Introduction to CRT Microarchitecture description Understanding performance benefits Performance results Understanding fault tolerance benefits The bigger picture: harnessing CMP/SMT processors Conclusions Future work 26

Experimental Method Detailed execution-driven simulator - Faithfully models entire microarchitecture A-stream produces possibly bad control/data, R-stream checks A-stream and recovers, etc. Simulator validation: independent functional simulator checks timing simulator (R-stream retired instr.) - Simplescalar ISA and compiler Inherit inefficiency of MIPS ISA and gcc compiler SPEC95 integer benchmarks, run to completion (100M - 200M instructions) 27

Single Processor Configuration instruction cache data cache superscalar core execution latencies single processor size/assoc/repl = 64kB/4-way/LRU line size = 16 instructions 2-way interleaved miss penalty = 12 cycles size/assoc/repl = 64kB/4-way/LRU line size = 64 bytes miss penalty = 14 cycles reorder buffer: 64, 128, or 256 entries dispatch/issue/retire bandwidth: 4-/8-/16-way superscalar n fully-symmetric functional units (n = issue bandwidth) n loads/stores per cycle (n = issue bandwidth) address generation = 1 cycle memory access = 2 cycles (hit) integer ALU ops = 1 cycle complex ops = MIPS R10000 latencies 28

New Component Configuration IR-predictor IR-detector delay buffer recovery controller new components for cooperating threads 2 20 entries, gshare-indexed (16 bits of global branch history) 16 confidence counters per entry confidence threshold = 32 R-DFG = 256 instructions, unpartitioned data flow buffer: 256 instruction entries control flow buffer: 4K branch predictions number of outstanding store addresses = unconstrained recovery latency (after IR-misprediction detection): 5 cycles to start up recovery pipeline 4 register restores per cycle (64 regs performed first) 4 memory restores per cycle (mem performed second) minimum latency (no memory) = 21 cycles 29

Models SS(64x4): single 4-way superscalar proc. with 64 ROB entries. SS(128x8): single 8-way superscalar proc. with 128 ROB entries. SS(256x16): single 16-way superscalar proc. with 256 ROB entries. CMP(2x64x4): CRT on a CMP composed of two SS(64x4) cores. CMP(2x64x4)/byp: Same as previous, but A-stream can bypass instruction fetching. CMP(2x128x8): CRT on a CMP composed of two SS(128x8) cores. CMP(2x128x8)/byp: Same as previous, but A-stream can bypass instruction fetching. SMT(128x8)/byp: CRT on SMT, where SMT is built on top of SS(128x8). 30

IPC Results 6 5 4 SS (64x4) SS (128x8) SS (256x16) CMP(2x64x4) CMP(2x64x4)/byp CMP(2x128x8) CMP(2x128x8)/byp SMT(128x8)/byp IPC 3 2 1 0 comp gcc go jpeg li m88k perl vortex 31

Using a Second Processor for CRT 35% 30% 25% CMP(2x64x4) vs. SS(64x4) CMP(2x64x4)/byp vs. SS(64x4) CMP(2x128x8) vs. SS(128x8) CMP(2x128x8)/byp vs. SS(128x8) % IPC improvement 20% 15% 10% 5% 0% -5% comp gcc go jpeg li m88k perl vortex AVG -10% 32

CRT on 2 small cores VS. 1 large core 25% 20% CMP(2x64x4)/byp vs. SS(128x8) CMP(2x128x8)/byp vs. SS(256x16) 15% 10% % IPC difference 5% 0% -5% -10% -15% comp gcc go jpeg li m88k perl vortex AVG -20% -25% -30% -35% 33

SMT Results 20% 15% SMT(128x8)/byp vs. SS(128x8) % IPC Improvement 10% 5% 0% -5% -10% -15% -20% -25% -30% comp gcc go jpeg li m88k perl vortex 34

Instruction Removal fraction of dynamic instructions 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% prop writes/branches prop writes writes prop branches branches comp gcc go jpeg li m88k perl vortex 35

Prediction Benefits % IPC improvement over SS(64x4) 35% 30% 25% 20% 15% 10% 5% 0% -5% SS(64x4) + context-based value prediction CMP(2x64x4)/byp -- no value prediction CMP(2x64x4)/byp comp gcc go jpeg li m88k perl vortex 36

Talk Outline Introduction to CRT Microarchitecture description Understanding performance benefits Performance results Understanding fault tolerance benefits The bigger picture: harnessing CMP/SMT processors Conclusions Future work 37

Fault Tolerance [FTCS-29 - AR-SMT, Rotenberg, June 99] Formal analysis left for future work Assumptions - Single transient fault model - Fault eventually manifests as a bad value, appearing as a misprediction in R-stream Time redundancy provides certain guarantees - Single fault may cause simultaneous but different errors in both A-stream and R-stream - Streams are shifted in time: guarantees the two redundant copies of an instruction will not both be affected 38

Fault Tolerance Scenario #1 Scenario #2 Scenario #3 A-stream R-stream X X X 39

Fault Tolerance Scenario #1 - Fault detectable, but indistinguishable from IR-misprediction! - Must assume IR-misprediction 1. Don t make any special considerations If fault does not flip R-stream arch. state, don t care about source of problem recovery works! (pipeline coverage) Otherwise, the system is bad and we are unaware of it 2. Try to distinguish faults If no prior unresolved IR-predictions, it s a fault invoke software (e.g., restart) Otherwise, default to 1) above 3. ECC on R-stream register file, D$: always fault tolerant 40

Fault Tolerance Scenario #2 - Affected R-stream instruction doesn t have redundant A-stream equivalent, nothing to compare with - May propagate and detect later, but possibly too late - Currently: no coverage for scenario #2 (future work) Scenario #3 - IR-misprediction detected before fault can cause problems Summary - Can (potentially) tolerate all faults that affect redundantly executed instructions 41

Talk Outline Introduction to CRT Microarchitecture description Understanding performance benefits Performance results Understanding fault tolerance benefits The bigger picture: harnessing CMP/SMT processors Conclusions Future work 42

Bigger Picture 1. Multithreaded processors will be prevalent in the future. 2. There is vast, untapped potential for harnessing multithreaded processors in new ways. 3. A single multithreaded processor can and should flexibly provide many capabilities. 4. A multithreaded processor can and should be leveraged without making fundamental changes to existing components/mechanisms. CRT is a concrete application of these principles. 43

Conclusions CRT: flexible, comprehensive functionality within a single strategic architecture - multiprogrammed/parallel workload performance (CMP/SMT) - single-program performance with improved reliability (CRT) - high reliability with less performance impact (AR-SMT / SRT) Performance results - 12% average improvement harnessing otherwise unused PE - CRT on 2 small cores has comparable IPC to 1 large core, but with faster clock and more flexible architecture - Majority of benchmarks show significant A-stream reduction (50%); CRT on 8-way SMT improves their performance 10%-20% - Benefits: resolving mispredictions in advance + quality value prediction - Demonstrated importance of bypassing instruction fetching 44

Future Work 1. CRT - Understanding performance - Microarchitecture design space - Pipeline organization - Fault tolerance - System-level issues - Adaptivity 2. Fundamental variations of CRT - Streamlining R-stream - Other A-stream shortening approaches - Scaling to N threads - Approximate A-streams 3. Other novel CMP/SMT applications 45

Related Work Fault tolerance in high-perf. commodity microprocessors - AR-SMT [Rotenberg, FTCS-29] - DIVA [Austin, MICRO-32] - SRT [Reinhardt, Mukherjee, ISCA-27] - FTCS-29: panel on using COTS in reliable systems - [Rubinfeld, Computer] Much prior work exploiting repetition, redundancy, predictability in programs - Instruction reuse, block reuse, trace-level reuse, computation reuse - Value speculation, silent writes - Motivation for creating shorter A-stream 46

Related Work Understanding backward slices, pre-execution - [Farcy et. al., MICRO-31], [Zilles&Sohi, ISCA-27], [Roth et. al., ASPLOS-8 / Tech Reports] Explicitly identify difficult computation chains, possibly for pre-execution Instruction-removal is inverted with same effect: A-stream is less-predictable subset but runs ahead A-stream entire, redundant program instead of many specialized kernels Speculative multithreading [e.g., Multiscalar, DMT] - Replicated programs: no forking/merging of spec. thread state needed DataScalar: redundant programs to eliminate memory reads 47