Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Similar documents
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Execution-based Prediction Using Speculative Slices

THE advent of the billion-transistor processor era presents

Detecting Global Stride Locality in Value Streams

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

ECE404 Term Project Sentinel Thread

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Are we ready for high-mlp?

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Speculative Multithreaded Processors

Low-Complexity Reorder Buffer Architecture*

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Speculative Parallelization in Decoupled Look-ahead

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Design of Experiments - Terminology

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Locality-Based Information Redundancy for Processor Reliability

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Understanding The Effects of Wrong-path Memory References on Processor Performance

Dynamic Memory Dependence Predication

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Microarchitecture Overview. Performance

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Microarchitecture Overview. Performance

Superscalar Organization

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Kilo-instruction Processors, Runahead and Prefetching

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Superscalar Processor Design

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Precise Instruction Scheduling

CS433 Homework 2 (Chapter 3)

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

Speculative Multithreaded Processors

Speculative Execution for Hiding Memory Latency

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Non-Uniform Instruction Scheduling

15-740/ Computer Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Dataflow: The Road Less Complex

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture-13 (ROB and Multi-threading) CS422-Spring

Implicitly-Multithreaded Processors

SEVERAL studies have proposed methods to exploit more

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Implicitly-Multithreaded Processors

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

The Use of Multithreading for Exception Handling

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Filtered Runahead Execution with a Runahead Buffer

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Instruction scheduling. Based on slides by Ilhyun Kim and Mikko Lipasti

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Reducing Reorder Buffer Complexity Through Selective Operand Caching

OUT-OF-ORDER COMMİT PROCESSORS. Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Multithreaded Processors. Department of Electrical Engineering Stanford University

Opportunistic Transient-Fault Detection

Continual Flow Pipelines

EECS 470 Final Exam Fall 2013

Pre-Computational Thread Paradigm: A Survey

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

POSH: A TLS Compiler that Exploits Program Structure

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Handout 2 ILP: Part B

Threshold-Based Markov Prefetchers

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Multithreaded Value Prediction

TDT 4260 lecture 7 spring semester 2015

The Predictability of Computations that Produce Unpredictable Outcomes

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Transcription:

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida

New Challenges in Billion-Transistor Processor Era Performance How to convert increasing number of transistors into performance improvement? Chip Multi-processors for improved system throughput Issues Contemporary CMPs rely on explicit thread-level parallelism Lack of parallel tasks resulting in idle processors and underutilization of the system Single-thread performance is not addressed Objective: utilize multi-core processors collaboratively to improve performance of single-thread applications University of Central Florida 2

Performance Bottlenecks of Single-Thread Applications Memory-wall problem Promising solution: Out-of-order processing with very-large instruction windows [Akkary et.al.], [Cristal et.al.], [Lebeck et.al.], [Srinivasan, et.al.] Often involves Scalable design of critical resources including LSQ, issue queue, register file, etc. Our approach builds a very-large instruction window without the need for any large centralized structures. University of Central Florida 3

Motivation Dual-core execution () Outline Overview Comparison to Run-Ahead Execution Architectural Design Experimental Results Related work beyond performance Conclusion University of Central Florida 4

Dual-Core Execution In-order fetch superscalar core Out-of-order processing front processor result queue back processor superscalar core In-order retire consists of two cores coupled with a queue The front processor pre-processes the instructions in a fast yet very accurate way (no correctness requirement) The retired instructions are kept in the result queue The back processor re-executes the instruction stream University of Central Florida 5

How Does Work? In-order fetch superscalar core Out-of-order processing front processor result queue back processor superscalar core In-order retire The front processor runs faster by invalidating long-latency cachemissing loads, similar to run-ahead execution [Dundas and Mudge], [Mutlu et.al.] Speculative execution results as load misses and their dependents are invalidated Branch mispredictions dependent on cache misses are not resolved highly accurate execution results as independent operations are not affected Accurate prefetches to warm up caches Correctly resolves independent branch mispredictions University of Central Florida 6

How Does Work? (cont.) In-order fetch superscalar core Out-of-order processing front processor result queue back processor superscalar core In-order retire Re-execution ensures the correctness and provides precise program state. Resolve branch mispredictions dependent on long-latency cache misses The back processor makes faster progress with the help from the front processor. Highly accurate instruction stream Warmed up data caches University of Central Florida 7

How Does Work? (cont.) In-order fetch superscalar core Out-of-order processing front processor result queue back processor superscalar core In-order retire The result queue keeps a large number of in-flight instructions, which do not reserve any centralized structures Highly scalable instruction window University of Central Florida 8

Example Small Window: Load 1 Miss Load 2 Miss Load 3 Miss Compute Stall Compute Stall Miss 1 Miss 2 Miss 3 Runahead: Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Load 3 Hit Compute Miss 1 Runahead Miss 2 Compute Runahead Load 3 Miss Saved Cycles From [Mutlu et.al. ISCA05] University of Central Florida 9

Runahead: Load 1 Miss Run-Ahead Execution vs. Load 2 Miss Load 1 Hit Load 2 Hit Compute Miss 1 Runahead Miss 2 Compute Runahead Load 3 Miss Saved Cycles : front processor Load 1 Miss Load 2 Miss Load 3 Miss Compute Miss 1 Miss 2 Miss 3 : back processor Load 1 Miss Load 2 Hit Load 3 Hit Compute Stall Compute Saved Cycles University of Central Florida 10

Motivation Dual-core execution () Outline Overview Comparison to Run Ahead Execution Architectural Design Experimental Results Related work beyond performance Conclusion University of Central Florida 11

Architectural Design of : Front Processor From front processor s instruction cache Resolving branch misprediction (local) Fetch Dispatch Issue Reg Read Execution Write Back Retire Result queue Read/Write Run ahead cache Physical register file I N V LSQ INV Read L1 D-cache Changes Invalidation and INV propagation Store instructions don t write to D-caches when being commited Run-ahead cache for speculative stores Similar to run-ahead execution except no checkpointing and mode transition University of Central Florida 12

Architectural Design of : Back Processor branch misprediction resolution IF ID IS RR EX WB RET Front processor IF ID IS RR EX WB RET Back processor Result Queue Fetches instructions from the result queue Branch targets computed by the front processor become the corresponding predictions Branch misprediction recovery: Squash instructions from the front, back processors and result queue Copy the architectural state (register value and PC) from the back to front processor University of Central Florida 13

Architectural Design of : Memory Hierarchy Front core R/W cache Read L1 Data Cache (front core) Write to both L1 caches Read only L2 Cache (shared) Write Back core L1 Data Cache (back core) Read/ Write Read Separate L1 D-caches and shared L2 cache The front processor does not commit stores to its L1 D-cache The back processor updates both L1 D-caches University of Central Florida 14

Motivation Dual-core execution () Outline Overview Comparison to Run Ahead Execution Architectural Design Experimental Results Related work beyond performance Conclusion University of Central Florida 15

Methodology MIPS R10000 style superscalar processor 4-way issue, 128-entry ROB, 64-entry issue queue, 64- entry LSQ. 32 kb 2-way L1 caches, 1024 kb 8-way L2 cache, L2 miss latency: 220 cycles Branch predictor: 64k-entry G-share; 32k-entry BTB Stride-d stream-buffer hardware prefetcher 1024-entry result queue, 4 kb 4-way run-ahead cache Latency for copying architectural register values from back to front processor: 16 or 64 cycles Memory-intensive spec2000 benchmarks (>40% speedup with perfect L2) and two computation-intensive benchmarks, bzip2 and gap. University of Central Florida 16

120% 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Latency Hiding using mode unstalled execution stall with full ROB (other) stall with full ROB (due to misses) empty ROB wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 bzip2 gap gcc mcf parser twolf vpr ammp art equake swim average University of Central Florida 17 Normalized execution time

120% 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Latency Hiding using mode unstalled execution stall with full ROB (other) stall with full ROB (due to misses) empty ROB wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 wo rc _64 bzip2 gap gcc mcf parser twolf vpr ammp art equake swim average On average, achieves 41% speedup University of Central Florida 18 Normalized execution time

Non-uniform Branch Handling Br mispred. per 1K retired insns 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 bzip2 gap gcc mcf parser twolf vpr ammp front proc. (w o rc) back proc. (w o rc) front proc. back proc. art equake swim Average On average, 0.65 mispredictions per 1K insn. in the back processor University of Central Florida 19

Related Work Run-ahead execution [Dundas and Mudge], [Mutlu et.al.] Large instruction window processors [Akkary et.al.], [Cristal et.al.], [Lebeck et.al.], [Srinivasan, et.al.] eliminates the need for scalable designs for critical centralized resources has somewhat lower performance potential as instructions in the result queue can not be issued. Slipstream processors [Sundaramoorthy et.al.] [Purser et. al.] Similar high level architecture Fundamentally different ways to achieve performance improvement Higher performance with much less hardware overhead (see the paper) Flea-flicker two-pass pipelining [Barnes et. al.] Re-execution in the back processor is the key to relieve the front processor of correctness constraints and enable it to run further ahead with much less complexity overhead. University of Central Florida 20

Motivation Dual-core execution () Outline Overview Comparison to Run Ahead Execution Architectural Design Experimental Results Related work beyond performance Conclusion University of Central Florida 21

beyond Performance In-order fetch superscalar core Out-of-order processing front processor result queue back processor superscalar core In-order retire Exploiting the redundant execution in to achieve transient-fault tolerance. Result queue carries the results (if not invalid) from the front processor Compared to the back processor results to detect transient faults Discrepancy => fault recovery using the existing branch misprediction recovery Achieves both performance enhancement and fault tolerance simultaneously [Zhou, CAL] University of Central Florida 22

beyond Performance 100% Redundancy Checking Coverage 90% 80% 70% 60% 50% 40% bzip2 gap gcc mcf parser tw olf vpr ammp art equake sw im average High redundancy coverage (81%) with no observable performance loss compared to original Slight modification to the back processor to enable full redundancy coverage and still achieves significant performance improvement (22.5%) University of Central Florida 23

Dual-core execution Conclusion & Future Work Builds upon two-way CMPs Relative simple cores Small hardware changes Significant speedups from tolerating memory access latency Efficient ways to improve transient-fault tolerance Future work Improving energy/power efficiency No need to re-execute every instruction if don t care for reliability University of Central Florida 24

Thank you and Questions? University of Central Florida 25