Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Size: px

Start display at page:

Download "Donn Morrison Department of Computer Science. TDT4255 ILP and speculation"

Roy Casey
6 years ago
Views:

1 TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science

2 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple issue and static scheduling Section 2.8: Dynamic scheduling, multiple issue, speculation Section 2.9: Advanced techniques for instruction delivery and speculation Section 2.10: Putting it all together: The Intel Pentium 4 Slides adapted from Arvind & Emer,

3 3 Review What is the goal of instruction level parallelism (ILP)? What are the two main approaches to ILP? What is an example of a WAR hazard? What is the general idea behind dynamic scheduling?

4 4 Multiple issue and static scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions Statically scheduled superscalar processors Very long instruction word (VLIW) processors (done in software) Dynamically scheduled superscalar processors (done in hardware)

5 5 Speculation Executing instructions when you are unsure the results will actually be needed Execute out-of-order, but commit in-order Two types of speculation:

6 5 Speculation Executing instructions when you are unsure the results will actually be needed Execute out-of-order, but commit in-order Two types of speculation: Control speculation Move instructions across a branch boundary

7 5 Speculation Executing instructions when you are unsure the results will actually be needed Execute out-of-order, but commit in-order Two types of speculation: Control speculation Move instructions across a branch boundary Data speculation Execute loads and stores out-of-order

8 6 Speculation How to design an out-of-order processor that: Uses register renaming to remove WAW and WAR dependencies Handles instruction exceptions Executes across branch boundaries Reorder load and store instructions

9 7 Dataflow execution Instruction slot is candidate for execution when: It holds a valid instruction ( use bit is set) It has not already started execution ( exec bit is clear) Both operands are available (p1 and p2 are set)

10 8 Reorder buffer: example Sources replaced by data when an FU finishes F2 F4 to avoid divide by zero More compact format than pp LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F4 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 Ins# Use Exec Operation P1 Source 1 P2 Source 2 PD Destination Data

11 9 Reorder buffer: CC1 Assume 34(R2) holds 2.0 Assume 45(R3) holds LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F4 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 Ins# Use Exec Operation P1 Source 1 P2 Source 2 PD Destination Data 1 X LD X (R2)+34 F2 2 X LD X (R3)+45 F4 3 X MULT.D #2 #1 F6 4 X SUB.D #1 #2 F8

12 10 Reorder buffer: CC2 ROB populated LD instructions complete, sources replaced with data values 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F4 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 Ins# Use Exec Operation P1 Source 1 P2 Source 2 PD Destination Data 1 X X LD X (R2)+34 X F X X LD X (R3)+45 X F X MULT.D X 4.0 X 2.0 F6 4 X SUB.D X 2.0 X 4.0 F8 5 X DIV.D X 2.0 #4 F4 6 X ADD.D #3 X 2.0 F10

13 11 Reorder buffer: CC3 Completed instructions committed and deallocated Operands for instructions 5,6 ready 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F4 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 Ins# Use Exec Operation P1 Source 1 P2 Source 2 PD Destination Data Commit Commit 3 X X MULT.D X 4.0 X 2.0 X F X X SUB.D X 2.0 X 4.0 X F X DIV.D X 2.0 X -2.0 F4 6 X ADD.D X 8.0 X 2.0 F10

14 12 Reorder buffer: CC4 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F4 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 Ins# Use Exec Operation P1 Source 1 P2 Source 2 PD Destination Data Commit Commit 5 X X DIV.D X 2.0 X -2.0 X F X X ADD.D X 8.0 X 2.0 X F

15 13 Data-driven execution Instruction template (i.e., tag t) is allocated by the decode stage, which also stores the tag in the reg file When an instruction completes, its tag is deallocated

16 14 Simplifying allocation / deallocation Instruction buffer is managed circularly: exec bit is set when instruction begins execution When an instruction completes its use bit is marked free ptr2 is incremented only if the use bit is marked free

17 15 Effectiveness? Renaming and out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid-90 s. Why? Effective on a very small class of programs Did not address the memory latency problem which turned out be a much bigger issue than FU latency Made exceptions imprecise One more problem needed to be solved: control transfers

18 16 Precise exceptions Exceptions are relatively unlikely events that need special processing, but where adding explicit control flow instructions is not desired, e.g., divide by zero, page fault Exceptions can be viewed as an implicit conditional subroutine call that is inserted between two instructions Therefore, it must appear as if the exception is taken between two instructions (say I i and I i+1 ) The effect of all instructions up to and including I i is complete No effect of any instruction after I i has taken place The handler either aborts the program or restarts it at I i+1

19 17 Effect on exceptions Out-of-order completion 1 DIVD F6, F6 F4 2 LD F2, 45(R3) 3 MULTD F0, F2, F4 4 DIVD F8, F6, F2 5 SUBD F10, F0, F6 6 ADDD F6, F8, F2 Out - of - order : Consider exceptions restore F2 restore F10 Precise exceptions are difficult to implement at high speed - we want to start execution of later instructions before exception checks have finished on earlier instructions

20 18 Exceptions Exceptions create dependence on the value of the next PC Options for handling this dependence Stall No Bypass No Find something else to do No Change the architecture Sometimes: Alpha, Multiflow Speculate Most common approach! How can we handle rollback on mis-speculation? Delay state update until commit on speculated instructions Note: earlier exceptions must override later ones

21 19 Phases of instruction execution

22 20 Exception handling (in-order) Hold exception flags in pipeline until commit point (M stage) If exception at commit Update Cause/EPC registers Kill all stages Fetch at handler PC Inject external interrupts at commit point

21 In-order commit for precise exceptions Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order (implying out-of-order completion)

23 21 In-order commit for precise exceptions Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order (implying out-of-order completion) Commit (write-back to architectural state, i.e., register file & memory, is in-order) Temporary storage needed to hold results before commit (shadow registers and store buffers)

22 Extensions for precise exceptions Add <pd, dest, data, cause> fields in the instruction template Commit instructions to register file and memory in

24 22 Extensions for precise exceptions Add <pd, dest, data, cause> fields in the instruction template Commit instructions to register file and memory in program order (buffers can be maintained circularly) On exception, clear reorder buffer by resetting ptr1=ptr2 (stores must wait for commit before updating memory)

25 23 Renaming table exception extension Renaming table is a cache to speed up register name lookup It needs to be cleared after each exception taken Where else are valid bits cleared? Control transfers

26 24 Physical register files Reorder buffers are space inefficient - a data value may be stored in multiple places in the reorder buffer Idea: keep all data values in a physical register file Tag represents the name of the data value and name of the physical register that holds it Reorder buffer then contains only tags Thus, 64-bit data values may be replaced by 8-bit tags for a 256 element physical register file

27 25 Recovering ROB / renaming table Take snapshot of register renaming table at each predicted branch, recover earlier snapshot if branch is mispredicted

28 26 Map table recovery - snapshots Speculative value management of microarchitectural state What kind of value management is this? Greedy!

27 Out-of-order execution with ROB Basic operation: Enter op and tag or data (if known) for each source Replace tag with data as it becomes available

29 27 Out-of-order execution with ROB Basic operation: Enter op and tag or data (if known) for each source Replace tag with data as it becomes available Issue instruction when all sources are available Save destination data when operation finishes Commit saved destination data when instruction commits

30 28 Lifetime of physical registers Physical register file holds committed and speculative values Physical registers decoupled from ROB entries (no data in ROB) LD R1, ( R3 ) ADD R3, R1, #4 SUB R1, R3, R9 ADD R3, R1, R7 LD R6, ( R1 ) ADD R8, R6, R3 ST R8, ( R1 ) LD R3, ( R11 ) Rename LD P1, ( Px ) ADD P2, P1, #4 SUB P1, P2, Py ADD P4, P3, Pz LD P5, ( P3 ) ADD P6, P5, P4 ST P6, ( P3 ) LD P7, ( Pw ) When can we reuse a physical register? When next write of same architectural register commits

31 29 Physical register management

32 30 Physical register management

33 31 Physical register management

34 32 Physical register management

35 33 Physical register management

36 34 Physical register management

37 35 Physical register management

38 36 Physical register management

39 37 Unified physical register file One register file for both committed and speculative values (no data in ROB) During decode, instruction result allocated new physical register, source regs translated to physical regs through rename table Instruction reads data from register file at start of execute (not in decode) Write-back updates reg. busy bits on instructions in ROB (assoc. search) Snapshots of rename table taken at every branch to recover mispredicts On exception, renaming undone in reverse order of issue (MIPS R10000)

40 38 Speculative and out-of-order execution

41 39 Reorder buffer holds active instruction window

42 40 Branch penalty Next fetch started How many instructions need to be killed on a misprediction? Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution! Branch executed

43 41 Getting CPI below 1 CPI if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavours: 1. Statically-scheduled superscalar processors In-order execution Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors Out-of-order execution Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors In-order execution Fixed number of instructions issued

44 42 VLIW: Very long instruction word (1/2) Each VLIW has explicit coding for multiple operations Several instructions combined into packets Possibly with parallelism indicated Trade instruction space for simple decoding Room for many operations Independent operations execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch

45 43 VLIW: Very long instruction word (2/2) Assume 2 load/store, 2 FP, 1 int/branch VLIW with 0-5 operations Why 0? Important to avoid empty instruction slots Loop unrolling Local scheduling Global scheduling (across branches) Difficult to find all dependencies in advance Solution 1: Block on memory access Solution 2: CPU detects some dependencies

46 44 Loop unrolling Recall: unrolled loop minimises stalls for scalar for (i=999; i>= 0; i=i-1) x[i] = x[i] + s; Register mapping: s F2 i R1 Loop : L. D F0,0( R1 ) L. D F6, -8( R1 ) L. D F10, -16( R1 ) L. D F14, -24( R1 ) ADD. D ADD. D ADD. D ADD. D F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 S. D F4,0( R1 ) S. D F8, -8( R1 ) DADDUI R1, R1,# -32 S. D F12,16( R1 ) S. D F16,8( R1 ) BNE R1, R2, Loop

47 45 Loop unrolling in VLIW Mem ref 1 Mem ref 2 FP op 1 FP op 2 Int. op/branch Clock L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 S.D -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 iterations to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8 ) Average: 2.5 operations per clock, 50% efficiency Note: need more registers in VLIW (15 vs 6 in MIPS)

48 46 Problems with 1st generation VLIW Increase in code size Loop unrolling Partially empty VLIW Operated in lock-step; no hazard detection HW A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronised Compiler might predict function units, but caches hard to predict Modern VLIWs are interlocked (identify dependencies between bundles and stall) Binary code compatibiliy Strict VLIW different numbers of functional units and unit latencies require different versions of the code

49 47 VLIW tradeoffs Advantages Simpler hardware because the HW does not have to identify independent instructions Disadvantages Relies on smart compiler Code incompatibility between generations There are limits to what the compiler can do (cannot move loads above branches, cannot move loads above stores) Common uses Embedded market where hardware simplicity is important, applications exhibit plenty of ILP, and binary compatibility is a non-issue

50 48 IA-64 and EPIC 64-bit instruction set architecture Not a CPU, but an architecture Itanium and Itanium 2 are CPUs based on IA-64 Made by Intel and Hewlett-Packard (Itanium 2 and 3 designed in Colorado) Uses EPIC: Explicitly Parallel Instruction Computing Departure from the x86 architecture Meant to achieve out-of-order performance with in-order HW + compiler-smarts Stop bits to help with code density Support for control speculation (moving loads above branches) Support for data speculation (moving loads above stores)

51 49 Control speculation Can the compiler schedule an independent load above a branch? BNE R1, R2, TARGET LD R3, R4(0) What are the problems? EPIC provides speculative loads LD.S R3, R4(0) BNE R1, R2, TARGET CHECK R4(0)

52 50 Data speculation Can the compiler schedule an independent load above a store? ST R5, R6(0) LD R3, R4(0) What are the problems? EPIC provides advanced loads and an ALAT (Advanced Load Address Table) LD.A R3, R4(0) creates entry in ALAT ST R5, R6(0) looks up ALAT, if match, jump to fixup code

53 51 EPIC conclusions Goal of EPIC was to maintain advantages of VLIW, but achieve performance of out-of-order Results Complicated bundling rules saves some space, but makes the hardware more complicated Add special hardware and instructions for scheduling loads above stores and branches (new complicated hardware) Add special hardware to remove branch penalties (prediction) End result is a machine as complicated as an out-of-order, but now also requiring a super-sophisticated compiler

54 52 Multiple issue

55 53 Dynamic scheduling, multiple issue, speculation Modern microarchitectures Dynamic scheduling + multiple issue + speculation Two approaches Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approach Issue logic can become bottleneck

56 54 Overview of design

57 55 Multiple issue Limit the number of instructions of a given class that can be issued in a bundle (i.e., one FP, one integer, one load, one store) Examine all the dependencies amoung the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit

58 56 Example Loop : LD R2,0( R1 ) ; R2 = array element DADDIU R2, R2,# 1 ; increment R2 SD R2,0( R1 ) ; store result DADDIU R1, R1,# 8 ; increment pointer BNE R2, R3, LOOP ; branch if not last element

59 57 Example (no speculation)

60 58 Example (with speculation)

61 59 Branch target buffer Need high instruction bandwidth Branch-target buffers Next PC prediction buffer, indexed by current PC

62 60 Branch folding Optimization Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer Branch folding

63 61 Return address predictor Most unconditional branches come from function returns The same procedure can be called from multiple sites Causes the buffer to potentially forget about the return address from previous calls Create return address buffer organized as a stack

64 62 Integrated instruction fetch unit Design monolithic unit that performs Branch prediction Instruction prefetch Fetch ahead Instruction memory access and buffering Deal with crossing cache lines

65 63 How much to speculate? Mis-speculation degrades performance and power relative to no speculation May cause additional misses (cache, TLB) Prevent speculative code from causing higher costing misses (e.g. L2) Speculating through multiple branches Complicates speculation recovery No processor can resolve multiple branches per cycle

66 64 Review What is control speculation? What is data speculation? What are the advantages of a superscalar vs a VLIW? What are the disadvantages of a superscalar vs a VLIW? When is a VLIW appropriate? When is a superscalar appropriate?

67 65 Summary Section 2.6: Speculation Section 2.7: Multiple issue and static scheduling Section 2.8: Dynamic scheduling, multiple issue, speculation Section 2.9: Advanced techniques for instruction delivery and speculation Section 2.10: Putting it all together: The Intel Pentium 4

68 66 Next week Caches and virtual memory

Hardware-Based Speculation

Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register