CS 406/534 Compiler Construction Instruction Scheduling

CS 406/534 Compiler Construction Instruction Scheduling Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy and Dr. Linda Torczon s teaching materials at Rice University. All rights reserved. 1

Administravia Lab2 due Class presentation by student today Lab3 has been posted on the web Due on 12/5 in 3 weeks Instruction scheduling for single basic block Implement 2 algorithms and compare results Lecture today covers the necessary background simulator and test files in ~cs406/lab3 CS406/534 Fall 2004, Prof. Li Xu 2 2

What We Did Last Time Procedure activation record and run-time structures for OO languages Change of schedule today to jump start lab3 Back to code generation in the next class CS406/534 Fall 2004, Prof. Li Xu 3 3

Today s Goals Instruction scheduling Motivation List scheduling Scheduling for larger regions CS406/534 Fall 2004, Prof. Li Xu 4 4

The Big Picture Source Code Front End IR Middle IR Back End Machine code Errors So far, we have covered the front-end Parsing, intermediate representation Middle end Data flow analysis and optimization, later Back end Generate machine code and deal with processor architecture Instruction scheduling today CS406/534 Fall 2004, Prof. Li Xu 5 5

The Big Picture IR Instruction Selection IR Register Allocation IR Instruction Scheduling Machine code Errors Back end has become increasingly important due to hardware architectures Multiple-issue, deep pipelining in modern micro-architecture Memory hierarchy to deal with CPU-memory gap Critical for application performance on modern machines CS406/534 Fall 2004, Prof. Li Xu 6 6

Recap from Computer Architecture Instruction-level parallelism Execute multiple instructions in parallel Employ multiple functional units and pipelining Typical paradigms Out-of-order superscalar Hardware dynamically reorders instructions Compiler scheduling can boost performance Examples: alpha, sparc, mips and other RISCs, even Pentium VLIW Compiler statically schedule parallel instructions into a long instruction word Depends on compiler to generate correct code Not just performance issues Examples: Intel Itanium IA-64, TI C6xxx CS406/534 Fall 2004, Prof. Li Xu 7 7

Motivation Instruction Scheduling Operations have non-uniform latencies due to pipelining Multiple-issue arch can issue several operations per cycle As a result, execution time is order-dependent Instruction Scheduling: decide on instruction execution order to minimize execution cycles Assumed latencies (conservative) Operation Cycles Loads & stores may or may not block load 3 > Non-blocking fill those issue slots store 3 Branch costs vary with path taken loadi 1 Branches typically have delay slots add 1 > Fill slots with unrelated operations mult 2 > Percolates branch upward fadd 1 Scheduler should hide the latencies fmult 2 shift 1 branch 0 to 8 Lab 3 will build a local scheduler CS406/534 Fall 2004, Prof. Li Xu 8 8

Example w w * 2 * x * y * z Simple schedule Schedule loads early 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 1 loadai r0,@w r1 2 loadai r0,@x r2 3 loadai r0,@y r3 4 add r1,r1 r1 5 mult r1,r2 r1 6 loadai r0,@z r2 7 mult r1,r3 r1 9 mult r1,r2 r1 11 storeai r1 r0,@w 14 r1 is free 2 registers, 20 cycles 3 registers, 13 cycles Reordering operations for speed is called instruction scheduling CS406/534 Fall 2004, Prof. Li Xu 9 9

The Problem Instruction Scheduling Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The Concept Machine description The task Produce correct code Minimize execution cycles slow code Scheduler fast code Avoid spilling registers Operate efficiently CS406/534 Fall 2004, Prof. Li Xu 10 10

Instruction Scheduling To capture properties of the code, build a precedence graph G Node n G are operations with type(n) and delay(n) An edge e = (n 1,n 2 ) G if & only if n 2 uses the result of n 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 11 11

Instruction Scheduling A correct schedule S maps each n N into a non-negative integer representing its cycle number, and 1. S(n) 0, for all n N 2. If (n 1,n 2 ) E, S(n 1 ) + delay(n 1 ) S(n 2 ) 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S), is L(S) = max n N (S(n) + delay(n)) The goal is to find the shortest possible correct schedule. S is time-optimal if L(S) L(S 1 ), for all other schedules S 1 A schedule might also be optimal in terms of registers, power, or space. CS406/534 Fall 2004, Prof. Li Xu 12 12

Critical Points Instruction Scheduling All operands must be available Multiple operations can be ready Moving operations can lengthen register lifetimes Placing uses near definitions can shorten register lifetimes Operands can have multiple predecessors Together, these issues make scheduling hard (NP-Complete) Local scheduling is the simple case Restricted to straight-line code Consistent and predictable latencies CS406/534 Fall 2004, Prof. Li Xu 13 13

Instruction Scheduling The big picture 1. Build a precedence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue Local list scheduling The dominant algorithm for twenty years A greedy, heuristic, local technique CS406/534 Fall 2004, Prof. Li Xu 14 14

Local List Scheduling Cycle 1 Ready leaves of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Removal in priority order op has completed execution If successor s operands are ready, put it on Ready CS406/534 Fall 2004, Prof. Li Xu 15 15

Scheduling Example 1. Build the precedence graph a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 16 16

Scheduling Example 1. Build the precedence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w 11 14 a b 9 d 7 12 c 10 e 8 g f 5 h 3 i The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 17 17

Scheduling Example 1. Build the precedence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling New register name used 1) a: loadai r0,@w r1 2) c: loadai r0,@x r2 3) e: loadai r0,@y r3 4) b: add r1,r1 r1 5) d: mult r1,r2 r1 6) g: loadai r0,@z r2 7) f: mult r1,r3 r1 9) h: mult r1,r2 r1 11) i: storeai r1 r0,@w 10 13 a b 9 d 7 12 c 10 e 8 g f 5 h 3 i The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 18 18

Detailed Scheduling Algorithm Idea: Keep a collection of worklists W[c], one per cycle We need MaxC = max delay + 1 such worklists Code: for each n N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2) E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1] {n2}; end for i := 0 to MaxC 1 do W[i] := ; Wcount := 0; for each n N do if count[n] = 0 then begin W[0] := W[0] {n}; Wcount := Wcount + 1; end c := 0; // c is the cycle number cw := 0;// cw is the number of the worklist for cycle c instr[c] := ; CS406/534 Fall 2004, Prof. Li Xu 19 19

Detailed Scheduling Algorithm while Wcount > 0 do begin while W[cW] = do begin c := c + 1; instr[c] := ; cw := mod(cw+1,maxc); end nextc := mod(c+1,maxc); while W[cW] do begin Priority select and remove an arbitrary instruction x from W[cW]; if free issue units of type(x) on cycle c then begin instr[c] := instr[c] {x}; Wcount := Wcount - 1; for each y successors[x] do begin count[y] := count[y] 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],maxc); W[loc] := W[loc] {y}; Wcount := Wcount + 1; end end else W[nextc] := W[nextc] {x}; end end CS406/534 Fall 2004, Prof. Li Xu 20 20

More List Scheduling List scheduling breaks down into two distinct classes Forward list scheduling Start with available operations Work forward in time Ready all operands available Backward list scheduling Start with no successors Work backward in time Ready latency covers uses Variations on list scheduling Prioritize critical path(s) Schedule last use as soon as possible Depth first in precedence graph (minimize registers) Breadth first in precedence graph (minimize interlocks) Prefer operation with most successors CS406/534 Fall 2004, Prof. Li Xu 21 21

Local Scheduling As long as we stay within a single block List scheduling does well Problem is hard, so tie-breaking matters More descendants in dependence graph Prefer operation with a last use over one with none Breadth first makes progress on all paths Tends toward more ILP & fewer interlocks Depth first tries to complete uses of a value Tends to use fewer registers Classic work on this is Gibbons & Muchnick CS406/534 Fall 2004, Prof. Li Xu 22 22

Local Scheduling Forward and backward can produce different results 8 8 8 8 8 loadi 1 lshift loadi 2 loadi 3 loadi 4 7 7 7 7 7 add 1 add 2 add 3 add 4 addi Latency to the cbr Subscript to identify 2 5 5 5 5 5 cmp store 1 store 2 store 3 store 4 store 5 1 cbr Block from SPEC benchmark go Operation load loadi add addi store cmp Latency 4 1 2 1 4 1 CS406/534 Fall 2004, Prof. Li Xu 23 23

F o r w a r d S c h e d u l e 1 2 3 4 5 6 7 8 9 10 11 12 13 Int loadi 1 loadi 2 loadi 4 add 2 add 4 cmp cbr Local Scheduling Int Mem Int B lshift a 1 loadi 4 loadi 3 c 2 addi add k 1 3 add w 4 add 3 a 4 add 3 addi store 1 r 5 add 2 d store 2 6 add 1 store 3 S 7 store 4 c 8 h store 5 9 e d 10 u 11 cmp l e 12 cbr 13 Using latency to root as the priority Mem CS406/534 Fall 2004, Prof. Li Xu 24 Int lshift loadi 3 loadi 2 loadi 1 store 5 store 4 store 3 store 2 store 1 24

Schielke s RBF algorithm Local Scheduling Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule Shortest time to completion Other metrics are possible (shortest time + fewest registers) Randomized Backward & Forward In practice, this does very well CS406/534 Fall 2004, Prof. Li Xu 25 25

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs B 1 a b c d B 2 e f B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 26 26

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems B 2 c,e f B 1 a b c d B 3 no c here! g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 28 28

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems Must insert compensation code in B 3 Increases code space B 4 h i B 2 c,e f B 1 B 5 a b c d j k B 3 This one wasn t done for speed! c g B 6 l CS406/534 Fall 2004, Prof. Li Xu 29 29

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op into B 1 causes problems Lengthens {B 1,B 3 } B 4 Adds computation to {B 1,B 3 } May need compensation code, too Renaming may avoid undof h i B 2 e f B 1 B 5 a b c d,f j k B 3 This makes the path even longer! undo f g B 6 l CS406/534 Fall 2004, Prof. Li Xu 31 31

Scheduling Larger Regions Superlocal Scheduling How much can we get? Schielke saw 11 to 12% speed ups Constrained away compensation code B 2 e f B 1 a b c d B 3 g B 4 h i B 5 j k B 6 l Value numbering is the best case for superlocal scope CS406/534 Fall 2004, Prof. Li Xu 32 32

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context B 1 a b c d B 2 e f B 3 g Join points create blocks that must work in multiple contexts B 4 h i B 5 j k 2 paths B 6 l 3 paths CS406/534 Fall 2004, Prof. Li Xu 33 33

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 1 a b c d B 2 e f B 3 g B 4 h i B 5a j k B 5b j k B 6a l B 6b l B 6c l CS406/534 Fall 2004, Prof. Li Xu 34 34

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5q }, {B 1,B 3,B 5b } Pay heed to compensation code B 2 e f B 1 a b c d B 3 g Works well for forward motion Backward motion still has off-path problems Speeding up one path can slow down others (undo) B 4 h i l B 5a j k l B 5b j k l CS406/534 Fall 2004, Prof. Li Xu 36 36

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling B 1 a b c d B 2 e f B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 37 37

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1 7 a b c d 10 3 B 2 e f B 3 g 5 2 3 B 4 h i B 5 j k 5 B 6 l 5 Block counts could mislead us see B 5 CS406/534 Fall 2004, Prof. Li Xu 38 38

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1,B 2,B 4,B 6 Schedule it Compensation code in B 3,B 5 if needed Get the hot path right! B 4 If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles h i B 2 5 5 B 6 e f l B 1 7 B 5 2 5 a b c d j k 10 B 3 3 3 g CS406/534 Fall 2004, Prof. Li Xu 39 39

Scheduling Larger Regions Trace Scheduling Entire CFG Pick & schedule hot path Insert compensation code Remove hot path from CFG Repeat the process until CFG is empty B 1 7 a b c d 10 3 B 2 e f B 3 g Idea Hot paths matter Farther off hot path, less it matters B 4 h i 5 5 B 6 l B 5 2 5 j k 3 CS406/534 Fall 2004, Prof. Li Xu 40 40

Interaction with Register Allocation Dependence constraint due to register names True dependence (a: =>r1 b:,r1 =>) Anti dependence (a: r1 => b: => r1) Output dependence (a: =>r1 b: => r1) Renaming can remove anti and output dependence Usually increase register pressure Solution: pre- and post-scheduling CS406/534 Fall 2004, Prof. Li Xu 41 41

Lab 3 Static schedule machine model Run iloc_sim s 3 <file.i for non-scheduled code Implement two schedulers for basic blocks in ILOC One must be list scheduling with specified priority Other can be a second priority, or a different algorithm Same ILOC subset as in lab 1 (plus NOP) Different execution model Two asymmetric functional units Latencies different from lab 1 Simulator different from lab 1 CS406/534 Fall 2004, Prof. Li Xu 42 42

Lab 3 Specific questions 1. Are we in over our heads? No. The hard part of this lab should be trying to get good code. The programming is not bad; you can reuse some stuff from lab 1. You have all the specifications & tools you need. Jump in and start programming. 2. How many registers can we use? Assume that you have as many registers as you need. If the simulator runs out, we ll get more. Rename registers to avoid false dependences & conflicts. 3. What about a store followed by a load? If you can show that the two operations must refer to different memory locations, the scheduler can overlap their execution. Otherwise, the store must complete before the load issues. 4. What about test files? We will put some test files online. We will test your lab on files you do not see. We have had problems (in the past) with people optimizing their labs for the test data. (Not that you would do that!) CS406/534 Fall 2004, Prof. Li Xu 43 43

Lab 3 - Hints Begin by renaming registers to eliminate false dependencies Pay attention to loads and stores When can they be reordered? Understand the simulator Inserting NOPs may be necessary to get correct code Tiebreakers can make a difference CS406/534 Fall 2004, Prof. Li Xu 44 44

Summary Instruction scheduling List scheduling Scheduling larger regions CS406/534 Fall 2004, Prof. Li Xu 45 45

Next Class Code shape Code generation CS406/534 Fall 2004, Prof. Li Xu 46 46