Architectures for Instruction-Level Parallelism

Similar documents
SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Reducing Mis-speculated Instructions to Save Power. Pipeline Gating

Superscalar Organization

The Processor: Instruction-Level Parallelism

Limitations of Scalar Pipelines

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Four Steps of Speculative Tomasulo cycle 0

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Case Study IBM PowerPC 620

Superscalar Processor

Advanced Instruction-Level Parallelism

Advanced Computer Architecture

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Processor (IV) - advanced ILP. Hwansoo Han

Complex Pipelines and Branch Prediction

Multiple Instruction Issue and Hardware Based Speculation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Dynamic Memory Dependence Predication

Hardware-Based Speculation

Full Datapath. Chapter 4 The Processor 2

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COMPUTER ORGANIZATION AND DESI

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Copyright 2012, Elsevier Inc. All rights reserved.

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Handout 2 ILP: Part B

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

E0-243: Computer Architecture

CS433 Homework 2 (Chapter 3)

Multiple Instruction Issue. Superscalars

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS433 Homework 2 (Chapter 3)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 4 The Processor (Part 4)

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ECE/CS 552: Introduction to Superscalar Processors

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Pipelining. CSC Friday, November 6, 2015

Advanced Computer Architecture

CS425 Computer Systems Architecture

Hardware-based Speculation

CS252 Graduate Computer Architecture Midterm 1 Solutions

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Lecture-13 (ROB and Multi-threading) CS422-Spring

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CS425 Computer Systems Architecture

Multiple Issue ILP Processors. Summary of discussions

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

ELE 655 Microprocessor System Design

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

5008: Computer Architecture

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Alexandria University

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Lec 25: Parallel Processors. Announcements

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 4. The Processor

Adapted from David Patterson s slides on graduate computer architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

LECTURE 3: THE PROCESSOR

Wide Instruction Fetch

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

EECC551 Exam Review 4 questions out of 6 questions

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CS 152, Spring 2011 Section 8

Portland State University ECE 587/687. Superscalar Issue Logic

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4 The Processor 1. Chapter 4D. The Processor

PowerPC 620 Case Study

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design


Super Scalar. Kalyan Basu March 21,

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Superscalar Processor Design

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Full Datapath. Chapter 4 The Processor 2

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Transcription:

Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software Co-design for Approximate Computing Natalie Enright Jerger Department of Electrical and Computer Engineering University of Toronto October, 07 Noon pm Barus & Holley 90 Lunch will be provided Architectures for Instruction-Level Parallelism Scalar Pipeline (baseline) Pipeline Depth = D Operation Latency = Peak IPC = SUCCESSIVE INSTRUCTIONS 0 D IF DE EX WB 7 8 9 TIME IN CYCLES (OF BASELINE MACHINE) Limitations of Inorder Pipelines Clocks per Instruction (CPI) of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance between dependent instructions Forwarding is no longer effective must stall more often» Pipeline may never be full due to frequent dependency stalls!! 7 8 9 IF DE EX WB 7

Out of Order Execution - General Scheme Many high performance processors use out of order execution. Most of them are doing the fetching and the retirement IN ORDER, but it executes in OUT OF ORDER A Modern(ish) Superscalar Processor In Order Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Fetch & Decode Instruction pool Retire (commit) Issue Reservation Stations In-order Execute Out-of-order In-order 9 Out of Order In Order Execute Finish Complete Retire Reorder/ Completion Buffer Store Buffer 0 Out of Order Execution The Cost of Speculation Execute instr. based on data flow rather than program order. Basic idea: The fetch is done fast enough to fill-out a window of instructions. Of all the instructions in the window, look for ones ready to execute: All the data on which the instructions are dependent are ready Resources are available. As soon as the instruction is execution it needs to signal to all the instructions which are depend on it that the input is ready. Triggers wake-up and instruction select for next cycle Advantages: Help exploit Instruction Level Parallelism (ILP) Help cover latencies (e.g., cache miss, divide) How does the processor find enough ready instructions? Look beyond branch boundaries Modern processor have fairly sophisticated branch prediction mechanisms (in HW and SW) Front end fills instruction window with instructions down path of predicted branch What happens if the branch prediction is wrong? What is the cost in terms of performance and power of keeping these wrong path instructions in the instruction window?

Reducing Mis-speculated Instructions to Save Power Executing wrong path instruction is a waste of power Does nothing to improve effective IPC either IDEA: if branch prediction becomes too speculative, don t bother continuing to fetch instruction past branches Fetch unit stops reading new instructions from the cache Instruction window does not take new instructions Instruction execution rate may slow, but only until predicted branches have been resolved How do we know if an instruction flow has become too speculative? What do we need to monitor? Pipeline Gating Low-confidence branch counter records # of unresolved branches that reported as low-confidence. The processor ceases instruction fetch if there are more than N unresolved low-confident branches in the pipeline. What is a Low Confidence Branch? Confidence Prediction A confidence predictor monitors branch instructions and compares the predicted action from the actual action Each branch is eventually determined to have been predicted correctly or incorrectly once it is resolved If a branch prediction is incorrect several times in a row, it is a low confidence branch. If a branch has unpredictable behavior it is likely to be labeled low confidence NOTE: a branch prediction direction and a confidence rating are two different things. HC Confidence LC Prediction Outcome C I C HC C LC I HC I LC 7 8

Do We Need Wide Issue Machines? Varying Program Needs Many programs never achieve IPC values close to the maximum issue of the machine Branch Misprediction Dependency Chains Cache Misses Overall IPC is not indicative of superscalar needs of a program This program shows a nearly X difference in IPC across successive snapshots of the program execution 9 0 Partitioned Cluster Architecture Clusters Instructions share the fetch, decode and renaming units There is a steering logic that sends instructions to different clusters based on scheduling algorithm There is a bypass logic between the two clusters. Each cluster has reservation stations (RS) feeding 8 specialpurpose functional units. The RS hold 8 instr. and permit out-of-order instruction selection. Small clusters reduce complexity of wake-up and instr. select logic Intra-cluster communication is done in the same cycle as instruction dispatch. Forward data to other clusters takes at least cycles.

The Pipeline Balancing Approach Our 8-Wide Issue Processor Model Monitor the varying issue requirements of each program Overall issue rate Floating point issue rate History of past behavior IDEA: Tune processor issue and execution resources according to the needs of the program Goal: Reduce power, retain performance Right Cluster Rename arbiter0 arbiter arbiter arbiter 8-entry Unified Issue Queue section0 section section section PLB Monitors Right File Right Cluster Functional s DCache INT, FP, MEM Commit Left Cluster arbiter0 arbiter arbiter arbiter Left File Left Cluster Functional s INT, FP, MEM Commit 8 Triggering PLB by Tracking Issue IPC Issue Width Issue IPC Switch to -, -, or 8- wide issue depending on issue IPC of previous sampling window 0 7 9 7 Sample Window