CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Similar documents
Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Four Steps of Speculative Tomasulo cycle 0

Multi-cycle Instructions in the Pipeline (Floating Point)

CS425 Computer Systems Architecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Super Scalar. Kalyan Basu March 21,

Multiple Issue ILP Processors. Summary of discussions

Lecture 9: Multiple Issue (Superscalar and VLIW)

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Hardware-Based Speculation

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Metodologie di Progettazione Hardware-Software

Hardware-based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

5008: Computer Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Getting CPI under 1: Outline

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Static vs. Dynamic Scheduling

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EECC551 Exam Review 4 questions out of 6 questions

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Processor: Superscalars Dynamic Scheduling

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction-Level Parallelism and Its Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

INSTRUCTION LEVEL PARALLELISM

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Hardware-Based Speculation

Handout 2 ILP: Part B

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Chapter 4 The Processor 1. Chapter 4D. The Processor

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Advanced issues in pipelining

Computer Science 246 Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Adapted from David Patterson s slides on graduate computer architecture

CS433 Homework 2 (Chapter 3)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EECC551 Review. Dynamic Hardware-Based Speculation

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

The Processor: Instruction-Level Parallelism

TDT 4260 lecture 7 spring semester 2015

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Course on Advanced Computer Architectures

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS 152 Computer Architecture and Engineering

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Multiple Instruction Issue. Superscalars

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Processor (IV) - advanced ILP. Hwansoo Han

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Hardware-based Speculation

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture-13 (ROB and Multi-threading) CS422-Spring

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Exploitation of instruction level parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

CS425 Computer Systems Architecture

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

CS433 Homework 2 (Chapter 3)

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Transcription:

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)

Loads and Stores Loads and Stores are treated as separate functional units (FU) with their own reservation stations (RS) Load buffers and store buffers behave almost exactly like reservation stations Load buffers hold data coming from memory Store buffers hold data going to memory Loads and stores require a two-step execution process: First step: they go through a functional unit that computes the effective address Second step: the effective address is placed in the corresponding load or store buffer Loads in the load buffer execute as soon as the memory unit is available Stores in the store buffer wait for the value to be stored, before being sent to the memory unit 2

Prevent Hazards through Memory A load and a store can safely be done in different order as long as they access different addresses If a load and a store access the same memory address, there are potential WAR, RAW and WAW hazards Solution: The processor performs the effective address calculation in program order For the loads: check for conflicts with all active store buffers. There is no need to check the active reads, since there are no RAR hazards. For the stores: check in both the load and the store buffers. 3

Dynamic Memory Disambiguation Order of loads and stores must be preserved Since they access memory locations, we can examine order only after we calculate effective address Effective address calculation is performed in order: Address of a load is examined against A fields of all store buffers Address of a store is examined against A fields of all load and store buffers 4

CPI < 1 5

CPI < 1? CPI < 1 not possible if only one instruction is issued per clock cycle Need to allow multiple instructions to be issued in a clock cycle 6

Getting CPI < 1: Issuing Multiple Instructions/Cycle Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers Multimedia instructions being added to many processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (TBD) Intel Architecture-64 (IA-64) 64-bit address» Renamed: Explicitly Parallel Instruction Computer (EPIC) Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI 7

Superscalar Processors Instructions either statically or dynamically scheduled: Statically scheduled by compilers Dynamically scheduled by techniques based on scoreboarding of Tomasulo s Issue varying number of instructions per clock 8

Very Long Instruction Word Issue a fixed number of instructions formatted wither as one large instruction or as a fixed instruction packet Instructions statically scheduled by the compiler 9

Implementing Superscalar Processors To have multiple instructions per clock Run each step (i.e., assigned a reservation station and uploading the pipeline control) in half a clock cycle so that two instructions can be processed in one clock cycle Build the logic necessary to handle two instructions at once, including any dependency between instructions 10

Getting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar: assume 2 instructions, 1 FP & 1 anything else Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS instruction in right half can t use it, nor instructions in next slot 11

Multiple Issue Issues Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue 0 to N instruction issues per clock cycle, for N-issue Performing issue checks in 1 cycle could limit clock cycle time: O(n 2 -n) comparisons => issue stage usually split and pipelined 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued => higher branch penalties => prediction accuracy important 12

Dynamic Scheduling in Superscalar The easy way How to issue two instructions and keep in-order instruction issue for Tomasulo? Assume 1 integer + 1 floating point 1 Tomasulo control for integer, 1 for floating point Issue 2X Clock Rate, so that issue remains in order Only loads/stores might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR,WAW 13

How much to Speculate? Speculation Pro: uncover events that would otherwise stall the pipeline (cache misses) Speculation Con: speculate costly if exceptional event occurs when speculation was incorrect Typical solution: speculation allows only lowcost exceptional events (1st-level cache miss) When expensive exceptional event occurs, (2ndlevel cache miss or TLB miss) processor waits until the instruction causing event is no longer speculative before handling the event Assuming single branch per cycle: future may speculate across multiple branches! 14

Review: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD F0,0(R1) LD to ADDD: 1 Cycle 2 LD F6,-8(R1) ADDD to SD: 2 Cycles 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 15

Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SD -24(R1),F16 9 SUBI R1,R1,#40 10 BNEZ R1,LOOP 11 SD -32(R1),F20 12 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration (1.5X) 16

Statically Scheduled Superscalar MIPS The compiler is responsible for finding independent instruction to issue E.g., unroll loop to make n copies Problems might arise: We will need additional hardware in the pipeline Maintaining precise exceptions is hard because instructions may complete out of order Hazard penalties are longer 17

Dynamically Scheduled Superscalar MIPS Extend Tomasulo s algorithm to support issue of 2 instructions per cycle We must issue instructions to reservation stations in order Issue stage can either be Pipelined issue one instruction in half cycle, another one in another half Extended add more hardware and issue instructions simultaneously 18

Dynamically Scheduled Superscalar MIPS Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Any two instruction can be issued (not only integer + FP) One INT unit used both for ALU and effective address calculation Integer ALU takes 1 cycle, load 2, FP add 3 Pipelined FP units, 2 CDBs, perfect branch prediction One cycle is needed for issue and one for write results (this stage adds one cycle delay) Show when each instruction issues, begins execution and writes to CDB for the first 3 iterations of the loop Show resource usage for integer unit, FP unit, data cache and CDB Assume that we do not have any hardware that allows us to know whether the as-yet-undecoded instruction is a branch Assume instructions following branch cannot proceed with execution until we know branch outcome Assume one single memory port 19

Dynamically Scheduled Superscalar MIPS Dual issue version with without speculation Iteration Instruction Issue Execute Memory Write CDB 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop Comment 1 2 3 4 1 5 8 Wait for L.D 2 3 9 Wait for ADD.D 2 3 4 6 5 Wait for ALU Wait for DADDIU 4 7 8 9 Wait for BNE 4 10 13 Wait for L.D 5 8 14 Wait for ADD.D 5 9 10 Wait for ALU 6 11 Wait for DADDIU 7 12 13 14 Wait for BNE 7 15 18 8 8 9 13 19 14 15 16 CPI=16/15=1.07 Wait for L.D Wait for ADD.D Wait for ALU Wait for DADDIU 20

Dynamically Scheduled Superscalar MIPS Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Any two instruction can be issued (not only integer + FP) One INT unit used for ALU One INT unit is used for effective address calculation Integer ALU takes 1 cycle, load 2, FP add 3 Pipelined FP units, 2 CDBs, perfect branch prediction One cycle is needed for issue and one for write results (this stage adds one cycle delay) Show when each instruction issues, begins execution and writes to CDB for the first 3 iterations of the loop Show resource usage for integer unit, FP unit, data cache and CDB Assume that we do not have any hardware that allows us to know whether the as-yet-undecoded instruction is a branch Assume instructions following branch cannot proceed with execution until we know branch outcome Assume one single memory port 21

Dynamically Scheduled Superscalar MIPS Iteration Instruction Issue Execute Memory Write CDB 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop CPI=11/15=0.73 Comment 1 2 3 4 1 5 8 Wait for L.D 2 3 9 Wait for ADD.D 2 3 4 3 5 Wait for DADDIU 4 6 7 8 Wait for BNE 4 9 12 Wait for L.D 5 7 13 Wait for ADD.D 5 6 7 6 8 Wait for DADDIU 7 9 10 11 Wait for BNE 7 12 15 Wait for L.D 8 10 16 Wait for ADD.D 8 9 10 9 11 Wait for DADDIU 22

Increasing Instruction Fetch Bandwidth Predicts next instruct address, sends it out before decoding instruction PC of branch sent to BTB When match is found, Predicted PC is returned If branch predicted taken, instruction fetch continues at Predicted PC Branch Target Buffer (BTB) 23

Branch Folding (I) Branch folding allows: 0-cycle unconditional branches (always) 0-cycle conditional branches (some times) BF eliminates an instruction (the branch) from the code stream BF eliminates the single-cycle pipeline bubble that usually occurs immediately after a branch Predicted instruction 24

Branch folding (II) If the processor is issuing two instructions per cycle Predicted instructions 25

Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; (N-issue ~O(N 2 -N) comparisons) Register file: need 2x reads and 1x writes/cycle Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2 sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 Imagine doing this transformation in a single cycle! Result buses: Need to complete multiple instructions/cycle» So, need multiple buses with associated matching logic at every reservation station.» Or, need multiple forwarding paths 26

More about VLIW VLIW packages: multiple operations into one very long instruction The compiler chooses the instructions to be issued Enough parallelism is needed in a straight-line code sequence to fill the available operation slots Unroll loops Schedule code across basic blocks using a global scheduling techniques 27

Loop Unrolling in VLIW Memory Memory Clock FP FP Int. op/ reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) 28

Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation HW determines address conflicts HW better branch prediction HW maintains precise exception model HW does not execute bookkeeping instructions Works across multiple implementations SW speculation is much easier for HW design 29

Superscalar v. VLIW Smaller code size Binary compatibility across generations of hardware Simplified Hardware for decoding, issuing instructions No Interlock Hardware (compiler checks?) More registers, but simplified Hardware for Register Ports (multiple independent register files?) 30

Limits in Multi-issue Processors Inherent limitations of ILP in programs Difficulties in building the underlying hardware Limitations specific to either a superscalar or VLIW implementations 31