Lecture 19: Instruction Level Parallelism

Similar documents
Static vs. Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Processor: Superscalars Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

5008: Computer Architecture

Hardware-Based Speculation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Getting CPI under 1: Outline

Instruction-Level Parallelism and Its Exploitation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

E0-243: Computer Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 06: Instruction Pipelining and Parallel Processing

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

EECC551 Exam Review 4 questions out of 6 questions

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Hardware-based Speculation

Four Steps of Speculative Tomasulo cycle 0

ILP: Instruction Level Parallelism

Dynamic Scheduling. CSE471 Susan Eggers 1

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Instruction Level Parallelism (ILP)

CS152 Computer Architecture and Engineering. Complex Pipelines

Advanced issues in pipelining

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Adapted from David Patterson s slides on graduate computer architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

The basic structure of a MIPS floating-point unit

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS Mid-Term Examination - Fall Solutions. Section A.

CS425 Computer Systems Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Metodologie di Progettazione Hardware-Software

Instruction Level Parallelism

Hardware-based Speculation

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Handout 2 ILP: Part B

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS 152 Computer Architecture and Engineering

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Course on Advanced Computer Architectures

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

Advanced Computer Architecture

Static & Dynamic Instruction Scheduling

Out of Order Processing

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

COMPUTER ORGANIZATION AND DESI

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

SUPERSCALAR AND VLIW PROCESSORS

EE 660: Computer Architecture Superscalar Techniques

Hardware-Based Speculation

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Case Study IBM PowerPC 620

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Computer Architecture Spring 2016

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Lecture 4: Introduction to Advanced Pipelining

Transcription:

Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register renaming Lecture 19 1

Pipelined in-order processor Sim ple branch prediction I nstru ction/ data caches (on chip) Where Are We? Out-of-order instruction execution Su persca lar S ophisticated branch prediction DEC A lpha 21064 DEC A lpha 21264 Lecture 19 2

How Do We Speed up the Pipeline? Instruction Level Parallelism (ILP) Multi-issue (in-order execution to multiple pipes) Dynamic scheduling ( Superscalar ) Compiler schedules ILP (VLIW/EPIC) Undetermined dependencies at compile time dynamic scheduling Object code compatibility Simplify compiler WAR/WAW hazards register renaming R1,R2,R3 SUB R1,R4,R5 Too many branches better branch prediction Or use predication to eliminate branches Unknown dependencies (control/data) speculate R1,R2,R3 SUB R1,R4,R5 Lecture 19 3

Multiple Issue No instruction reordering Instruction Memory Instruction Buffer Hazard Detect Register File Lecture 19 4

Multiple Issue (Details) Dependencies and structural hazards checked at run-time Can run existing binaries Recompiler for performance, not correctness Example - Pentium More complex issue logic Swizzle next N instructions into position Check dependencies and resource needs Issue M <= N instructions that can execute in parallel Lecture 19 5

Example Multiple Issue Issue rules: at most 1 load/store, at most 1 floating op Latency: load=1, int=1, float-mult = 1, float-add = 1 LOOP: LD F0, 0(R1) // a[i] 1 LD F2, 0(R2) // b[i] 2 MULTD F8, F0, F2 // a[i] * b[i] 4 (stall) D F12, F8, F16 // + c 5 SD F12, 0(R3) // d[i] 6 I R1, R1, 4 I R2, R2, 4 7 I R3, R3, 4 I R4, R4, 1 // increment I 8 SLT R5, R4, R6 // i<n-1 9 BNEQ R5, R0, LOOP 10 Old CPI = 12/11 = 1.09 New CPI = 10/11 = 0.91 cycle Lecture 19 6

Rescheduled for Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-1, F+-1 LOOP: LD F0, 0(R1) // a[i] 1 I R1, R1, 4 LD F2, 0(R2) // b[i] 2 I R2, R2, 4 MULTD F8, F0, F2 // a[i] * b[i] 4 I R4, R4, 1 // increment I D F12, F8, F16 // + c 5 SLT R5, R4, R6 // i<n-1 SD F12, 0(R3) // d[i] 6 I R3, R3, 4 BNEQ R5, R0, LOOP 7 Old CPI = 0.91 New CPI = 7/11 = 0.64 cycle Lecture 19 7

The Problem with Static Scheduling In-order execution an unexpected long latency blocks ready instructions from executing binaries need to be rescheduled for each new implementation small number of named registers becomes a bottleneck MUL SW SW SW R1, C //miss 50 cycles R2, D R3, R1, R2 R3, C R4, B //ready R5, R4, R9 R5, A R6, F R7, G R8, R6, R7 R8, E Lecture 19 8

Dynamic Scheduling Determine execution order of instructions at run time Schedule with knowledge of run-time variable latency cache misses Compatibility advantages avoid need to recompile old binaries avoid bottleneck of small named register sets but still need to deal with spills Significant hardware complexity Lecture 19 9

Example Top = without dy namic scheduling, B ott om = with dynamic scheduling ycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 D R1,A I M X X X X X X X X X C D R2,B I C UL R3,R1,R2 I X X C D R4,C I M X X X X X X X X X C D R5,D I C UL R6,R5,R4 I X X C DD R7,R3,R6 I X C D R7,E I C D R1,A I M X X X X X X X X X C D R2,B I C UL R3,R1,R2 I X X C D R4,C I M X X X X X X X X X C D R5,D I C UL R6,R5,R4 I X X C DD R7,R3,R6 I X C D R7,E I C 10 cycle data memory (cache) miss 3 cycle MUL latency 2 cycle add latency Lecture 19 10

Dynamic Scheduling Basic Concept PC Sequential Instruction Stream R1,A R2,B R3,R1,R2 SW R3,C R4,8(A) R5,8(B) R6,R4,R5 SW R6,8(C) R7,16(A) R8,16(B) R9,R7,R8 SW R9,16(C) R10,24(A) R11,24(B) Window of Waiting Instructions on operands & resources R3,R1,R2 SW R3,C R6,R4,R5 SW R6,8(C) R7,16(A) R8,16(B) R9,R7,R8 SW R9,16(C) Issue Logic Execution Resources Register File Instructions waiting to commit R4,8(A) R5,8(B) Lecture 19 11

Implementation Issues Instruction fetch fixed number of instruction window slots (e.g., 32) generic partitioned over execution units fetch next sequential instruction whenever a slot is free Instruction decode mark input and output registers busy slots monitor register status and execution unit reservation tables Instruction Issue all input operands available output operand (register) not busy (WAW, WAR) due to earlier instruction execution unit is available Instruction commit all previous instructions have committed why? Lecture 19 12

Implementation I - Register Scoreboard R0 1 R1 0 R2 1 R3 0 R4 0 R5 0 R6 0 R7 0 Register File Tracks register writes busy = pending write Detect hazards for scheduler R3,R1,R2 Wait until R1 is valid Mark R3 valid when complete SUB R4,R0,R3 Wait for R3 valid bit (= 0 if write pending ) What about: LD R3,(0)R7 R4,R3,R5 LD R3,(4)R7 Lecture 19 13

Implementing A Simple Instruction Window 3 issue order dst R3 src1 reg rdy R1 0 src2 reg rdy R2 1 result reg SW SW R3,R1,R2 R3,0(R8) R6,R4,R5 R6,8(R8) R7,16(R9) 5 2 4 1 SW R3 0 R8 1 R6 R4 0 R5 0 SW R6 0 R8 1 R7 R9 1 1 R0 1 R1 0 R2 1 R3 0 R4 0 R5 0 R6 0 R7 0 Often called reservation stations reg = name, value Result sequence: R4, R7, R5, R1, R6, R3 Lecture 19 14

Instruction Window Policies Add an instruction to the window only when dest register is not busy mark destination register busy check status of source registers and set ready bits When each result is generated compare dest register field to all waiting instruction source register fields update ready bits mark dest register not busy Issue an instruction when execution resource is available all source operands are ready Result issues instructions out of order as soon as source registers are available allows only one operation in the window per destination register Lecture 19 15

Register Renaming (1) What about this sequence? R1, 0(R4) R2, R1, R3 R1, 4(R4) R5, R1, R3 Can t add 3 to the window since R1 is already busy Need 2 R1s! Lecture 19 16

Register Renaming (2) Rename Table R1 0 P5 R2 0 P2 R3 1 P1 R4 1 P7 R5 1 P6 Virtual Registers Add a tag field to each register - translates from virtual to physical register name In window R1, 0(R4) R2, R1, R3 value P1 0 A P2 1 5 P3 1 C P4 1 0 P5 0 E P6 1 F P7 1 3 P8 0 2 Physical Registers Next instruction R1, 4(R4) Lecture 19 17

Register Renaming (3) S1 S2 S3 S4 P5 P7 1 1 P2 P5 0 P1 1 P4 P7 1 1 P6 P4 0 P1 1 Before R1 0 P5 R2 0 P2 R3 1 P1 R4 1 P7 R5 1 P6 After R1 0 P4 R2 0 P2 R3 1 P1 R4 1 P7 R5 0 P6 When result is generated: compare tag of result to notready source fields grab data if match dd instruction to window even if dest register is busy hen adding instruction to window read data of non-busy source registers and retain read tags of busy source registers and retain write tag of destination register with slot number R1,0(R4) R2,R1,R3 R1,4(R4) R5,R1,R3 Lecture 19 18

Summary Today: Basics of out-of-order instruction execution Hardware to keep track of dependencies Mechanisms to reduce false dependences (WAR, WAW) Next Time Examples and more details Lecture 19 19