Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Size: px

Start display at page:

Download "Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model"

Edwina Shaw
6 years ago
Views:

1 Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching from a different location if needed Check branch condition Next instruction may come from a new location given by the branch instruction Saman marasinghe 6.05 MIT Fall 998 Simple xecution Model 5 Stage pipe-line fetch decode execute memory writeback Fetch: get the next instruction ecode: figure-out what that instruction is xecute: Perform LU operation address calculation in a memory op Memory: o the memory access in a mem. Op. Write ack: write the results back IF X MM W IF X MM W Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Inst Inst Inst Inst Inst Inst Inst 5 Simple xecution Model time IF X MM W IF X MM W IF X MM W IF X MM W IF X MM W From a Simple Machine Model to a Real Machine Model Many pipeline stages Pentium 5 Pentium Pro 0 Pentium IV (0nm) 0 Pentium IV (90nm) ifferent instructions taking different amount of time to execute Real Machine Model cont. Most modern processors have multiple execution units (superscalar) If the instruction sequence is correct, multiple operations will happen in the same cycles ven more important to have the right instruction sequence Hardware to stall the pipeline if an instruction uses a result that is not ready Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998

2 Constraints On Scheduling ata dependencies Control dependencies Resource Constraints ata ependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies True: write read nti: read write Output: write write What to do if two instructions are dependent. The order of execution cannot be reversed Reduce the possibilities for scheduling Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe MIT Fall 998 Computing ependencies For basic blocks, compute dependencies by walking through the instructions Identifying register dependencies is simple is it the same register? For memory accesses simple: base + offset?= base + offset data dependence analysis: a[i]?= a[i+] interprocedural analysis: global?= parameter pointer alias analysis: p?= p Representing ependencies Using a dependence G, one per basic block Nodes are instructions, edges represent dependencies Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Representing ependencies Using a dependence G, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + ) : r = *(r + 8) : r = r + r : r5 = r - dge is labeled with Latency: v(i j) = delay required between initiation times of i and j minus the execution time required by i Saman marasinghe 6.05 MIT Fall 998 : r = *(r + ) : r = *(r + ) : r = r + r : r5 = r - Saman marasinghe 6.05 MIT Fall 998

3 nother Control ependencies and Resource Constraints : r = *(r + ) : *(r + ) = r For now, lets only worry about basic blocks : r = r + r For now, lets look at simple pipelines : r5 = r - Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 List Scheduling lgorithm Results In : lea var_a, %rax cycle Idea : add $, %rax cycle : inc %r o a topological sort of the dependence G cycle : (%rsp), %r0 cycles Consider when an instruction can be scheduled 5: add %r0, 8(%rsp) without causing a stall 6: and 6(%rsp), %rbx cycles Schedule the instruction if it causes no stall and all cycles its predecessors are already scheduled st st 5 6 st st st Optimal list scheduling is NP-complete Use heuristics when necessary Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 List Scheduling lgorithm Create a dependence G of a basic block Topological Sort RY = nodes with no predecessors Loop until RY is empty Schedule each node in RY when no stalling Update RY Heuristics for selection Heuristics for selecting from the RY list pick the node with the longest path to a leaf in the dependence graph pick a node with most immediate successors pick a node that can go to a less busy pipeline (in a superscalar) Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe MIT Fall 998

4 Heuristics for selection pick the node with the longest path to a leaf in the dependence graph lgorithm (for node x) If no successors d x = 0 d x = MX( d y + c xy ) for all successors y of x Heuristics for selection pick a node with most immediate successors lgorithm (for node x): f x = number of successors of x reverse breadth-first visitation order Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Results In : lea var_a, %rax cycle : add $, %rax cycle : inc %r cycle : (%rsp), %r0 cycles 5: add %r0, 8(%rsp) 6: and 6(%rsp), %rbx cycles cycles 8: %rbx, 6(%rsp) 9: lea var_b, %rax d=5 d= f= f= RY = { } d= d= f= 6 f= 5 8 d= f= Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Results In : lea var_a, %rax cycle : add $, %rax cycle : inc %r cycle : (%rsp), %r0 cycles 5: add %r0, 8(%rsp) 6: and 6(%rsp), %rbx cycles cycles 8: %rbx, 6(%rsp) 9: lea var_b, %rax st st 5 6 st st st cycles vs 9 cycles Saman marasinghe 6.05 MIT Fall 998 Resource Constraints Modern machines have many resource constraints Superscalar architectures: can run few parallel operations ut have constraints Saman marasinghe 6.05 MIT Fall 998

5 Resource Constraints of a Superscalar Processor : One fully pipelined reg-to-reg unit ll integer operations taking one cycle In parallel with One fully pipelined memory-to/from-reg unit ata loads take two cycles ata stores teke one cycle Saman marasinghe MIT Fall 998 List Scheduling lgorithm with resource constraints Represent the superscalar architecture as multiple pipelines ach pipeline represent some resource One single cycle reg-to-reg LU unit One two-cycle pipelined reg-to/from-memory unit LU MM MM Saman marasinghe MIT Fall 998 List Scheduling lgorithm with resource constraints Create a dependence G of a basic block Topological Sort RY = nodes with no predecessors Loop until RY is empty Let n RY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update RY Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 5

6 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 6,, } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {,, } 8 9 LUop MM MM 6 Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {,, 5 } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 5, 8, 9 } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 5, 8, 9 } 8 9 LUop 6 MM 5 MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 8, 9 } 8 9 LUop 6 8 MM 5 MM Saman marasinghe MIT Fall 998 6

7 : lea var_a, %rax : add (%rsp), %rax : inc %r : (%rsp), %r0 5: %r0, 8(%rsp) 9: %rbx, 6(%rsp) RY = { } LUop MM MM Saman marasinghe 6.05 MIT Fall Scheduling across basic blocks Number of instructions in a basic block is small Cannot keep a multiple units with long pipelines busy by just scheduling within a basic block Need to handle control dependence Scheduling constraints across basic blocks Scheduling policy Saman marasinghe MIT Fall 998 Moving across basic blocks ownward to adjacent basic block Moving across basic blocks Upward to adjacent basic block C C path to that does not execute? path from C that does not reach? Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Control ependencies Constraints in ing instructions across basic blocks if (... ) a = b op c Control ependencies Constraints in ing instructions across basic blocks If ( valid address? ) d = *(a) Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998

8 Trace Scheduling Trace Scheduling Find the most common trace of basic blocks Use profile information Combine the basic blocks in the trace and schedule them as one block Create clean-up code if the execution goes offtrace C F G H Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Trace Scheduling Large asic locks via Code uplication Creating large extended basic blocks by duplication Schedule the larger blocks G C C H Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Trace Scheduling C F G F G H H H Saman marasinghe 6.05 MIT Fall 998 Scheduling Loops Loop bodies are small ut, lot of time is spend in loops due to large number of iterations Need better ways to schedule loops Saman marasinghe MIT Fall 998 8

9 Loop Machine One load/store unit load cycles store cycles Two arithmetic units add cycles branch cycles multiply cycles oth units are pipelined (initiate one op each cycle) Source Code for i = to N [i] = [i] * b Source Code for i = to N [i] = [i] * b ssembly Code Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop ssembly Code loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule (9 cycles per iteration) bge bge d= d=5 d= 0 d= bge Loop Unrolling Unroll the loop body few times Pros: Create a much larger basic block for the body liminate few loop bounds checks Cons: Much larger program Setup code (# of iterations < unroll factor) beginning and end of the schedule can still have unused slots Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule (8 cycles per iteration) 0 0 mul mul d= d= d=9 d=9 d= d=5 d= d= bge bge bge Saman marasinghe MIT Fall 998 Loop Unrolling Rename registers Use different registers in different iterations Saman marasinghe MIT Fall 998 9

10 loop: bge Loop (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax (%rdi,%rax), %rcx %r, %rcx %rcx, (%rdi,%rax) $, %rax loop d= mul d= d=9 0 d=9 d= mul d=5 d= 0 d= bge Loop Unrolling Rename registers Use different registers in different iterations liminate unnecessary dependencies again, use more registers to eliminate true, anti and output dependencies eliminate dependent-chains of calculations when possible Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) d= mul $8, %rax d=5 (%rdi,%rbx), %rcx %r, %rcx mul d= bge %rcx, (%rdi,%rbx) $8, %rbx loop Schedule (.5 cycles per iteration d= d= d= bge bge bge Saman marasinghe MIT Fall 998 Software Pipelining Try to overlap multiple iterations so that the slots will be filled Find the steady-state window so that: all the instructions of the loop body is executed but from different iterations Saman marasinghe MIT Fall 998 Loop ssembly Code loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule st st st 5 st 6 ld5 mul mul mul bge mul bge mul bge mul5 mul mul mul bge mul bge mul bge mul mul mul mul mul Saman marasinghe MIT Fall 998 iterations are overlapped value of %r don t change Loop regs for (%rdi,%rax) each addr. incremented by * regs to keep value %r0 st mul bge mul mul Same registers can be reused after of these blocks loop: generate code for blocks, otherwise need to e (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Saman marasinghe MIT Fall 998 0

11 Software Pipelining Optimal use of resources Need a lot of registers Values in multiple iterations need to be kept Issues in dependencies xecuting a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) Code generation issues Generate pre-amble and post-amble code Multiple blocks so no register copy is needed Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 : (%rbp), %rax : add %rax, %rbx : 8(%rbp), %rax : add %rax, %rcx : (%rbp), %rax : add %rax, %rbx : 8(%rbp), %r0 : add %r0, %rcx LUop MM MM Saman marasinghe MIT Fall 998 LUop MM MM Saman marasinghe MIT Fall 998 Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling If instruction scheduling before register allocation Register allocation may spill registers Will change the carefully done schedule!!! Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998

12 Superscalar: Where have all the transistors gone? Out of order execution If an instruction stalls, go beyond that and start executing non-dependent instructions Pros: Hardware scheduling Tolerates unpredictable latencies Cons: Instruction window is small Superscalar: Where have all the transistors gone? Register renaming If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register Pros: voids anti and output dependencies Cons: Cannot do more complex transformations to eliminate dependencies Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Hardware vs. Compiler In a superscalar, hardware and compiler scheduling can work hand-in-hand Hardware can reduce the burden when not predictable by the compiler Compiler can still greatly enhance the performance Large instruction window for scheduling Many program transformations that increase parallelism Compiler is even more critical when no hardware support VLIW machines (Itanium, SPs) Saman marasinghe MIT Fall 998

Spring 2 Spring Loop Optimizations

Spring 2 Spring Loop Optimizations Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition