Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd GPR s FPR s Fmul Scoreboard: In-order issue, out-of-order completion Register Renaming: Fdiv Out-of-order issue Speculative Execution Page 1

Scoreboard for In-order Issues 6823, L14--3 An instruction is not dispatched by the Issue stage if a RAW or WAW hazard exists or the required FU is busy Busy[unit#] : a bit-vector to indicate unit s availability (unit = Int, Add, Mult, Div) These bits are hardwired to FU s WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? RAW? WAR? WAW? not Busy[FU#] WP[src1] or WP[src2] cannot arise WP[dest] I 2 I 1 Scoreboard Dynamics worksheet Functional Unit Status Int(1) Add(1) Mult(3) Div(4) WB for Writes t0 f6 f6 t1 f2 f6 f6, f2 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Registers Reserved 6823, L14--4 FU available? RAW? WAW? Page 2

I 1 I 2 I 3 I 4 I 5 I 6 Scoreboard Dynamics Functional Unit Status Int(1) Add(1) Mult(3) Div(4) WB for Writes t0 f6 f6 t1 f2 f6 f6, f2 t2 f6 f2 f6, f2 t3 f0 f6 f6, f0 t4 f0 f6 f6, f0 t5 f0 f8 f0, f8 t6 f8 f0 f0, f8 t7 f10 f8 f8, f10 t8 f8 f10 f8, f10 t9 f8 f8 t10 f6 f6 t11 f6 f6 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Registers Reserved 6823, L14--5 I 2 I 1 I 3 I 5 I 4 I 6 Precise Interrupts 6823, L14--6 It must appear as if an interrupt is taken between two instructions (say I i and I i+1 ) the effect of all instructions up to and including I i is totally complete no effect of any instruction after I i has taken place The interrupt handler either aborts the program or restarts it at I i+1 Page 3

Effect on Interrupts Out-of-order Completion 6823, L14--7 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6 restore f2 restore f10 Consider interrupts Precise interrupts are difficult to implement The choice - imprecise interrupts or speculative execution Multiple Instructions in the Issue Stage 6823, L14--8 ALU Mem IF ID Issue WB Fadd Fmul Decode adds an instruction to the buffer only if it is not full and the instruction does not cause a WAR or WAW hazard Any instruction in the Issue buffer whose RAW hazard has been satisfied can be dispatched On a write back (WB) new instructions may get enabled Out-of order dispatch (and completion) Page 4

Out-of-Order Issue: an example 6823, L14--9 latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 4 5 3 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) 2 3 4 4 3 5 5 6 6 Out-of-order: 1 (2,1) 4 4 2 3 3 5 5 6 6 Out-of-order did not allow any significant improvement! How many Instructions can be in the pipeline 6823, L14--10 Which features of an ISA limit the number of instructions in the pipeline? Which features of a program limit the number of instructions in the pipeline? Page 5

Overcoming the Lack of Register Names 6823, L14--11 Number of registers in an ISA limits the number of partially executed instructions in complex pipelines Floating Point pipelines often cannot be kept filled with small number of registers IBM 360 had only 8 Floating Point Registers Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility? Robert Tomosulo of IBM suggested an ingenious solution in 1967 based on on-the-fly register renaming Instruction-Level Parallelism with Renaming latency 1 LD F2, 34(R2) 1 6823, L14--12 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 4 X 3 5 DIVD F4, F2, F8 4 5 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) 2 3 4 4 (5,3) 5 6 6 Out-of-order: 1 (2,1) 4 4 5 2 (3,5) 3 6 6 Any antidependence can be eliminated by renaming renaming storage Can it be done in hardware? yes! Page 6

Register Renaming 6823, L14--13 ALU Mem IF ID ROB WB Fadd Fmul Decode does register renaming and adds instructions to the reorder buffer (ROB) renaming makes WAR or WAW hazards impossible Any instruction in ROB whose RAW hazard has been satisfied can be dispatched Out-of order or dataflow execution Renaming & Out-of-order Issue An example 6823, L14--14 Renaming table p data F1 F2 F3 F4 F5 F6 F7 F8 Reorder buffer Ins# use exec op p1 src1 p2 src2 t 1 t2 data / t i 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F2 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 When are names in sources replaced by data? When can a name be reused? Page 7

Data-Driven Execution 6823, L14--15 Renaming table & reg file Reorder buffer Replacing the tag by its value is an expensive operation Ins# use exec op p1 src1 p2 src2 t 1 t2 t n Load Unit FU FU Store Unit < t, result > Instruction template (ie, tag t) is allocated by the Decode stage, which also stores the tag the reg file When an instruction completes, its tag is deallocated 6823, L14--16 Simplifying Allocation/Deallocation Reorder buffer Ins# use exec op p1 src1 p2 src2 ptr 2 next to deallocate t 1 t 2 prt 1 next t n available Instruction buffer is managed circularly When an instruction completes its use bit is marked free ptr 2 is incremented only if the use bit is marked free Page 8

IBM 360/91 Floating Point Unit R M Tomasulo, 1967 6823, L14--17 1 2 3 4 5 6 data load buffers (from storage) instructions 1 2 3 4 p data Floating Point Reg distribute instruction templates by functional units 1 2 3 p data p data Adder 1 2 < t, result > p data p data Mult store buffers (to storage) p data Common bus ensures that data is made available immediately to all the instructions waiting for it Effectiveness? 6823, L14--18 Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid Nineties Why? reasons 1 No precise interrupts! 2 Effective on a very small class of programs One more problem needed to be solved Page 9

Effect of Control Transfer on Pipelined Execution 6823, L14--19 Control transfer instructions require insertion of bubbles in the pipeline The number of bubbles depends upon the number of cycles it takes to determine the next instruction address, and to fetch the next instruction Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % 6823, L14--20 SPECint92: SPECfp92: compress, eqntott, espresso, gcc, li doduc, ear, hydro2d, mdijdp2, su2cor What is the average run length between branches? Page 10

Loss Due to Control Transfers Suppose a branch needed the result of a long latency operation, eg load=1~50, FADD=2, FMUL=4, FDIV=10 How often does each case arise? How far apart are the branch instruction and the instruction that determines the branch condition? 6823, L14--21 Assuming average run length = l (10, 100, 1000) average delay in branch resolution = d (1, 2, 10, 100) maximum pipeline efficiency =? l \ d 1 2 10 100 10 909 833 500 91 100 990 980 909 500 1000 999 998 990 909 6823, L14--22 Reducing Control Transfer Penalties Software solution loop unrolling Increases the run length instruction scheduling Compute the branch condition as early as possible (limited) Hardware solution delay slots replaces pipeline bubbles with useful work (requires software cooperation) branch prediction & speculative execution of instructions beyond the branch Page 11

Branch Prediction and Speculative Execution motivation: without longer run-lengths, pipelines with long latencies cannot be utilized properly 6823, L14--23 speculative execution can increase run length for most programs required hardware support: for prediction: branch target buffers etc for speculative execution: shadow registers and memory buffers for each speculated branch improved prediction by itself does not increase run lengths Speculative Execution 6823, L14--24 next PC Branch Prediction (BTB) update prediction Branch Resolution branch condition Kill Kill Kill PC IF ID IS Ex WB Reg File Mem Buf s Speculated instructions should not update the reg file or memory hold the data in the instruction reorder buffer + no speculative stores Page 12

Shadows and Renaming 6823, L14--25 Two ideas have a lot in common: both require additional resources registers, instruction buffers both require allocation of a resource on instruction issue, and maintenance of a <reg no, tag> map both require deallocation of resources when an instruction completes Differences: the ability to kill an instruction in case of wrong prediction kill free resources occupied by the shadow Extensions for Speculation 6823, L14--26 Instruction reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data ptr 2 next to commit ptr 1 next available add <pd, dest, data> fields in the instruction template commit instructions to reg file and memory in program order buffers can be maintained circularly wrong speculation roll back the next available pointer no speculative stores Page 13

Rollback and Renaming 6823, L14--27 Register File Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2 t n Load Unit FU FU FU Store Unit < t, result > Register file does not contain renaming tags any more How does the decode stage find the tag of a source register? Renaming Table 6823, L14--28 Renaming Table r 1 r 2 t i t j Reg File Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2 t n Load Unit FU FU FU Store Unit < t, result > Renaming table is like a cache to speed up register name look up It needs to be loaded after each branch prediction failure! Several renaming tables? Page 14

Temporary Register File 6823, L14--29 The width of the reorder buffer (ROB) can be reduced if data values are stored in a temporary register file instead of ROB This will reduce the width of several datapaths at the expense of extra register fetches while dispatching to functional units Temporary register files does not reduce the need for associative searching of tags when an instruction completes There is tremendous variety in the implementation of ROBs in commercial microprocessors Speculative Loads / Stores 6823, L14--30 Just like register updates, stores should not modify the memory until after the instruction is committed store buffer entry must carry a speculation bit and the tag of the corresponding store instruction If the instruction is committed, the speculation bit of the corresponding store buffer entry is cleared If the instruction is killed, the corresponding store buffer entry is freed Loads work normally -- older store buffer entries needs to be searched before accessing the memory or the cache Page 15