CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)
Loads and Stores Loads and Stores are treated as separate functional units (FU) with their own reservation stations (RS) Load buffers and store buffers behave almost exactly like reservation stations Load buffers hold data coming from memory Store buffers hold data going to memory Loads and stores require a two-step execution process: First step: they go through a functional unit that computes the effective address Second step: the effective address is placed in the corresponding load or store buffer Loads in the load buffer execute as soon as the memory unit is available Stores in the store buffer wait for the value to be stored, before being sent to the memory unit 2
Prevent Hazards through Memory A load and a store can safely be done in different order as long as they access different addresses If a load and a store access the same memory address, there are potential WAR, RAW and WAW hazards Solution: The processor performs the effective address calculation in program order For the loads: check for conflicts with all active store buffers. There is no need to check the active reads, since there are no RAR hazards. For the stores: check in both the load and the store buffers. 3
Dynamic Memory Disambiguation Order of loads and stores must be preserved Since they access memory locations, we can examine order only after we calculate effective address Effective address calculation is performed in order: Address of a load is examined against A fields of all store buffers Address of a store is examined against A fields of all load and store buffers 4
CPI < 1 5
CPI < 1? CPI < 1 not possible if only one instruction is issued per clock cycle Need to allow multiple instructions to be issued in a clock cycle 6
Getting CPI < 1: Issuing Multiple Instructions/Cycle Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers Multimedia instructions being added to many processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (TBD) Intel Architecture-64 (IA-64) 64-bit address» Renamed: Explicitly Parallel Instruction Computer (EPIC) Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI 7
Superscalar Processors Instructions either statically or dynamically scheduled: Statically scheduled by compilers Dynamically scheduled by techniques based on scoreboarding of Tomasulo s Issue varying number of instructions per clock 8
Very Long Instruction Word Issue a fixed number of instructions formatted wither as one large instruction or as a fixed instruction packet Instructions statically scheduled by the compiler 9
Implementing Superscalar Processors To have multiple instructions per clock Run each step (i.e., assigned a reservation station and uploading the pipeline control) in half a clock cycle so that two instructions can be processed in one clock cycle Build the logic necessary to handle two instructions at once, including any dependency between instructions 10
Getting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar: assume 2 instructions, 1 FP & 1 anything else Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS instruction in right half can t use it, nor instructions in next slot 11
Multiple Issue Issues Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue 0 to N instruction issues per clock cycle, for N-issue Performing issue checks in 1 cycle could limit clock cycle time: O(n 2 -n) comparisons => issue stage usually split and pipelined 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued => higher branch penalties => prediction accuracy important 12
Dynamic Scheduling in Superscalar The easy way How to issue two instructions and keep in-order instruction issue for Tomasulo? Assume 1 integer + 1 floating point 1 Tomasulo control for integer, 1 for floating point Issue 2X Clock Rate, so that issue remains in order Only loads/stores might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR,WAW 13
How much to Speculate? Speculation Pro: uncover events that would otherwise stall the pipeline (cache misses) Speculation Con: speculate costly if exceptional event occurs when speculation was incorrect Typical solution: speculation allows only lowcost exceptional events (1st-level cache miss) When expensive exceptional event occurs, (2ndlevel cache miss or TLB miss) processor waits until the instruction causing event is no longer speculative before handling the event Assuming single branch per cycle: future may speculate across multiple branches! 14
Review: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD F0,0(R1) LD to ADDD: 1 Cycle 2 LD F6,-8(R1) ADDD to SD: 2 Cycles 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 15
Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SD -24(R1),F16 9 SUBI R1,R1,#40 10 BNEZ R1,LOOP 11 SD -32(R1),F20 12 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration (1.5X) 16
Statically Scheduled Superscalar MIPS The compiler is responsible for finding independent instruction to issue E.g., unroll loop to make n copies Problems might arise: We will need additional hardware in the pipeline Maintaining precise exceptions is hard because instructions may complete out of order Hazard penalties are longer 17
Dynamically Scheduled Superscalar MIPS Extend Tomasulo s algorithm to support issue of 2 instructions per cycle We must issue instructions to reservation stations in order Issue stage can either be Pipelined issue one instruction in half cycle, another one in another half Extended add more hardware and issue instructions simultaneously 18
Dynamically Scheduled Superscalar MIPS Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Any two instruction can be issued (not only integer + FP) One INT unit used both for ALU and effective address calculation Integer ALU takes 1 cycle, load 2, FP add 3 Pipelined FP units, 2 CDBs, perfect branch prediction One cycle is needed for issue and one for write results (this stage adds one cycle delay) Show when each instruction issues, begins execution and writes to CDB for the first 3 iterations of the loop Show resource usage for integer unit, FP unit, data cache and CDB Assume that we do not have any hardware that allows us to know whether the as-yet-undecoded instruction is a branch Assume instructions following branch cannot proceed with execution until we know branch outcome Assume one single memory port 19
Dynamically Scheduled Superscalar MIPS Dual issue version with without speculation Iteration Instruction Issue Execute Memory Write CDB 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop Comment 1 2 3 4 1 5 8 Wait for L.D 2 3 9 Wait for ADD.D 2 3 4 6 5 Wait for ALU Wait for DADDIU 4 7 8 9 Wait for BNE 4 10 13 Wait for L.D 5 8 14 Wait for ADD.D 5 9 10 Wait for ALU 6 11 Wait for DADDIU 7 12 13 14 Wait for BNE 7 15 18 8 8 9 13 19 14 15 16 CPI=16/15=1.07 Wait for L.D Wait for ADD.D Wait for ALU Wait for DADDIU 20
Dynamically Scheduled Superscalar MIPS Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Any two instruction can be issued (not only integer + FP) One INT unit used for ALU One INT unit is used for effective address calculation Integer ALU takes 1 cycle, load 2, FP add 3 Pipelined FP units, 2 CDBs, perfect branch prediction One cycle is needed for issue and one for write results (this stage adds one cycle delay) Show when each instruction issues, begins execution and writes to CDB for the first 3 iterations of the loop Show resource usage for integer unit, FP unit, data cache and CDB Assume that we do not have any hardware that allows us to know whether the as-yet-undecoded instruction is a branch Assume instructions following branch cannot proceed with execution until we know branch outcome Assume one single memory port 21
Dynamically Scheduled Superscalar MIPS Iteration Instruction Issue Execute Memory Write CDB 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop CPI=11/15=0.73 Comment 1 2 3 4 1 5 8 Wait for L.D 2 3 9 Wait for ADD.D 2 3 4 3 5 Wait for DADDIU 4 6 7 8 Wait for BNE 4 9 12 Wait for L.D 5 7 13 Wait for ADD.D 5 6 7 6 8 Wait for DADDIU 7 9 10 11 Wait for BNE 7 12 15 Wait for L.D 8 10 16 Wait for ADD.D 8 9 10 9 11 Wait for DADDIU 22
Increasing Instruction Fetch Bandwidth Predicts next instruct address, sends it out before decoding instruction PC of branch sent to BTB When match is found, Predicted PC is returned If branch predicted taken, instruction fetch continues at Predicted PC Branch Target Buffer (BTB) 23
Branch Folding (I) Branch folding allows: 0-cycle unconditional branches (always) 0-cycle conditional branches (some times) BF eliminates an instruction (the branch) from the code stream BF eliminates the single-cycle pipeline bubble that usually occurs immediately after a branch Predicted instruction 24
Branch folding (II) If the processor is issuing two instructions per cycle Predicted instructions 25
Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; (N-issue ~O(N 2 -N) comparisons) Register file: need 2x reads and 1x writes/cycle Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2 sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 Imagine doing this transformation in a single cycle! Result buses: Need to complete multiple instructions/cycle» So, need multiple buses with associated matching logic at every reservation station.» Or, need multiple forwarding paths 26
More about VLIW VLIW packages: multiple operations into one very long instruction The compiler chooses the instructions to be issued Enough parallelism is needed in a straight-line code sequence to fill the available operation slots Unroll loops Schedule code across basic blocks using a global scheduling techniques 27
Loop Unrolling in VLIW Memory Memory Clock FP FP Int. op/ reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) 28
Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation HW determines address conflicts HW better branch prediction HW maintains precise exception model HW does not execute bookkeeping instructions Works across multiple implementations SW speculation is much easier for HW design 29
Superscalar v. VLIW Smaller code size Binary compatibility across generations of hardware Simplified Hardware for decoding, issuing instructions No Interlock Hardware (compiler checks?) More registers, but simplified Hardware for Register Ports (multiple independent register files?) 30
Limits in Multi-issue Processors Inherent limitations of ILP in programs Difficulties in building the underlying hardware Limitations specific to either a superscalar or VLIW implementations 31