Execution-based Prediction Using Speculative Slices

Similar documents
Speculative Multithreaded Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Speculative Multithreaded Processors

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Execution-based Prediction Using Speculative Slices

ECE404 Term Project Sentinel Thread

The Use of Multithreading for Exception Handling

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Understanding The Effects of Wrong-path Memory References on Processor Performance

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Use-Based Register Caching with Decoupled Indexing

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

POSH: A TLS Compiler that Exploits Program Structure

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Multithreaded Value Prediction

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bumyong Choi Leo Porter Dean Tullsen University of California, San Diego ASPLOS

Detecting Global Stride Locality in Value Streams

Master/Slave Speculative Parallelization with Distilled Programs

Data Prefetching by Dependence Graph Precomputation

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

The Predictability of Computations that Produce Unpredictable Outcomes

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

The Predictability of Computations that Produce Unpredictable Outcomes

Instruction Level Parallelism (Branch Prediction)

Techniques for Efficient Processing in Runahead Execution Engines

Control Hazards. Prediction

Low-Complexity Reorder Buffer Architecture*

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Implicitly-Multithreaded Processors

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Implicitly-Multithreaded Processors

Speculative Parallelization in Decoupled Look-ahead

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

EECS 470 Final Exam Fall 2013

Dynamic Memory Dependence Predication

Dynamic Speculative Precomputation

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Getting CPI under 1: Outline

Dynamic Control Hazard Avoidance

Dynamic Code Value Specialization Using the Trace Cache Fill Unit

SEVERAL studies have proposed methods to exploit more

SPECULATIVE MULTITHREADED ARCHITECTURES

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Control Hazards. Branch Prediction

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Dynamic Scheduling with Narrow Operand Values

Improving Virtual-Function-Call Target Prediction via Dependence-Based Pre-Computation

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

November 7, 2014 Prediction

Design of Experiments - Terminology

A Study of Control Independence in Superscalar Processors

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Threshold-Based Markov Prefetchers

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Precise Instruction Scheduling

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

Exploitation of instruction level parallelism

Understanding Scheduling Replay Schemes

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Pre-Computational Thread Paradigm: A Survey

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Detecting Global Stride Locality in Value Streams

Wide Instruction Fetch

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

Register Integration: A Simple and Efficient Implementation of Squash Reuse

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Dynamically Controlled Resource Allocation in SMT Processors

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Understanding the Backward Slices of Performance Degrading Instructions

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Instruction Level Parallelism

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Transcription:

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001

The Problem Two major barriers to achieving high ILP: MISPREDICTED BRANCHES and CACHE MISSES TRADITIONAL PREDICTION: SOMEWHAT MATURE TECHNOLOGY correctly anticipate > 90% instructions exploit patterns in outcome/address stream remaining mispredictions still expensive EXECUTION-BASED PREDICTION exploit regularity in computations speculatively compute results early for use as predictions speedups from 1 to 43% on SPECINT 2000 International Symposium on Computer Architecture (ISCA-28), July 2001 2

The Solution RETIREMENT STREAM 1 Identify frequently mispredicting instructions TIME BRANCH mispredict PROGRAM 2 Extract and pack dependant computation into code fragments called s BRANCH LOAD miss LOAD load International Symposium on Computer Architecture (ISCA-28), July 2001 3

The Solution RETIREMENT STREAM 3 Execute s in helper threads to generate predictions TIME BRANCH mispredict load miss BRANCH prediction hit LOAD idle thread LOAD miss idle thread }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 4

The Outline PROBLEM INSTRUCTIONS load miss BRANCH prediction hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 5

The Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION load BRANCH prediction miss hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 6

The Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION PREDICTION CORRELATION load BRANCH prediction miss hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 7

The Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION PREDICTION CORRELATION PERFORMANCE RESULTS load BRANCH prediction miss hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 8

Problem Instructions Misses and mispredictions are not evenly distributed. EXAMPLE: PERLBMK 82 static es: 68% of misp., 9% of dynamic es 140 static loads: 67% misses, 2% of dynamic memory insts Fixing just problem inst s gives > 1/2 perf. of perfect /pred OUTCOMES OF THESE INSTRUCTIONS DO NOT EXHIBIT A PREDICTABLE PATTERN... consistently mispredicted... BUT SOMETIMES THE COMPUTATION IS REGULAR. while (...) {... ptr = ptr->next; } while (i < n) { if (object[i]!= NULL) {... } International Symposium on Computer Architecture (ISCA-28), July 2001 9

Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION O O O O An different pre-execution approach Speculative s and imprecise transformations Slice structure Slice characterization PREDICTION CORRELATION PERFORMANCE RESULTS load miss BRANCH prediction hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 10

Previous pre-execution proposal Speculative Data-driven Multithreading: Roth and Sohi, HPCA 01 Speculatively pre-executes data-driven threads (DDTs) Register integration matches DDTs to main thread + avoids re-execution of DDT instructions + early resolution (at decode stage) - DDTs must be sub-set of original program International Symposium on Computer Architecture (ISCA-28), July 2001 11

Two Observations Two Observations: benefit comes from prefetches and predictions strict program subsets not most efficient s Our approach: generate predictions/prefetches in as efficient manner as possible. OPTIMIZE SLICES: + reduce fetch/execution overhead + reduce critical path to making prediction - need a new mechanism to correlate predictions International Symposium on Computer Architecture (ISCA-28), July 2001 12

Speculative Slices DON T ALLOW SLICES TO AFFECT ARCHITECTED STATE only generate pre-fetches and predictions need not be 100% accurate 3 CLASSES OF TRANSFORMATIONS: (NOT ORIGINALLY APPLIED BY COMPILER) Imprecise O static assertion (remove es/cold code) Not-provably safe O register allocation in the presence of aliases Previously unprofitable O if-conversion (of a subset of a block) International Symposium on Computer Architecture (ISCA-28), July 2001 13

Slice Structure problem instructions frequently in loops encapsulate loop in Program Fork Slice problem load BENEFITS: lower overhead earlier predictions amortize overhead single helper thread ISSUES & SOLUTIONS: in paper International Symposium on Computer Architecture (ISCA-28), July 2001 14

Slice Characterization CONSTRUCTED AND OPTIMIZED SLICES BY HAND encouraging results STATISTICS: 85% of s cover multiple static problem instructions 70% of s contained loops small static size O smaller than 4 * # problem instructions covered prefetch or prediction generated every ~3 dynamic inst s. small number of live-in values O 80% of s had 2 or less s can be very small International Symposium on Computer Architecture (ISCA-28), July 2001 15

Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION PREDICTION CORRELATION O O difficult problem valid regions RESULTS AND ANALYSIS load miss BRANCH prediction hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 16

Prediction Correlation BRANCH prediction TO BENEFIT FROM A SLICE-GENERATED PREDICTION must bind it to fetched instruction overrides hardware predictor HOW ARE PREDICTIONS CORRELATED TO DYNAMIC BRANCHES? International Symposium on Computer Architecture (ISCA-28), July 2001 17

Prediction Correlation BRANCH prediction Tagged prediction queues BRANCH PC BRANCH PC BRANCH PC F T F T T T T F T F T T F CHALLENGES: Related Work: Farcy, et al, Micro 98 re-ordering predictions produced out-of-order recovering from misspeculation by main thread dealing with conditionally-executed problem es International Symposium on Computer Architecture (ISCA-28), July 2001 18

Conditionally-executed problem es program s CFG G A B C F D point E not executed on all iterations problem MINIMIZE OVERHEAD BY BUILDING SIMPLEST SLICE compute prediction for each iteration NAIVE IMPLEMENTATION predictions dequeued when used mis-alignment occurs on path CF CONDITIONALLY GENERATE PREDICTIONS? INSIGHT include existence in too much overhead existence encoded in fetch path International Symposium on Computer Architecture (ISCA-28), July 2001 19

Valid Regions A B 1st pred DEFINE REGION WHERE PREDICTION IS VALID using assumptions from building first iteration G C D E F second iteration G F B 2nd pred C D F E International Symposium on Computer Architecture (ISCA-28), July 2001 20

Valid Regions A DEFINE REGION WHERE PREDICTION IS VALID first iteration G X B 1st pred C X F D X X E using assumptions from building markers to indicate region boundary implementation discussed in paper DEQUEUE PREDICTION WHEN MARKER ENCOUNTERED using a prediction doesn t dequeue it second iteration G X F B X F C 2nd pred D X X E greater than 99% correlation accuracy International Symposium on Computer Architecture (ISCA-28), July 2001 21

Outline PROBLEM INSTRUCTIONS EXECUTION-BASED PREDICTION PREDICTION CORRELATION RESULTS AND ANALYSIS O Methodology O Results O Discussion load BRANCH prediction miss hit LOAD }speedup International Symposium on Computer Architecture (ISCA-28), July 2001 22

Methodology Used SPEC2000 integer benchmarks spectrum of program behaviors Identified dominant program phase selected 100M inst. region for simulation Built s (by hand) to cover problem instructions Warmed up simulator for 100M instructions International Symposium on Computer Architecture (ISCA-28), July 2001 23

Methodology, cont. AGGRESSIVE BASELINE: 4-wide superscalar, 128 entry window, 14 cycle mispredict penalty 2 load/store units, 4 fully pipelined integer/floating point units 64Kb YAGS, 32Kb cascaded indirect, RAS predictors Fetches across basic blocks, perfect BTB for direct es 2-way associative 64KB L1 s (64B blocks) 4-way associative 2MB unified L2 (128B blocks) 64-entry unified pre-fetch/victim buffer with hardware stream pre-fetcher Deeply-pipelined, 4-wide, out-of-order superscalar with big predictors, associative s, hardware stride pre-fetcher, and victim buffers. International Symposium on Computer Architecture (ISCA-28), July 2001 24

Results IPC 4.0 3.5 3.0 2.5 2.0 1.5 1.0 16% 3% 7% 11% 1% 16% 43% 1% 7% 12% 35% 1% base 0.5 0.0 bzip2 crafty eon gap gcc gzip mcf parser perl twolf vpr vortex speedups ranging from 1% to 43% must be regularity in /address computation speedups proportional to memory, stall time low base IPC lower opportunity cost of execution International Symposium on Computer Architecture (ISCA-28), July 2001 25

Related Work Pre-execution: Roth and Sohi: HPCA-2001 and TR-2000 SPECULATIVE SLICES: Zilles and Sohi: ISCA-2000 Limited forms of pre-execution: Roth, et al: ASPLOS-1998 and ICS-1999 Farcy, et al: Micro-1998 Slipstream processors: Sundaramoorthy, et al: ASPLOS-2000 Helper threads: Chappell, et al: ISCA-1999 Song and Dubois: TR-1998 International Symposium on Computer Architecture (ISCA-28), July 2001 26

Summary PROBLEM INSTRUCTIONS behavior not predictable with existing predictors sometimes computation is regular EXECUTION-BASED PREDICTION execute code fragments to generate prediction/prefetch imprecise transformations enable small s PREDICTION CORRELATION: VALID REGIONS monitor main thread s fetch path greater than 99% correlation accuracy Speedups of 1 to 43% over an aggressive baseline International Symposium on Computer Architecture (ISCA-28), July 2001 27