Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1
Backend Tasks Instruction selection Map virtual instructions To machine instructions Scheduling Reorder instructions To exploit instruction level parallelism (ILP) Register allocation Map values (computed by instruction) To machine registers Jianwen Zhu 2009 - P. 2
Instruction Selection Naïve selection One-to-Many: Fast Each virtual instrn map to one or more machine instrns Employed by fast JIT compilers Poor performance Better selection May-to-one: Map a subgraph of virtual instrns to one machine instruction Subgraph is called a tile Better performance More challenge in algorithm design Jianwen Zhu 2009 - P. 3
Example tiles = MIPS code: IR Tile CONST(c) li 'd0, c +(e0,e1) add 'd0, 's0, 's1 +(e0,const(c)) add 'd0, 's0, c *(e0,e1) mult 'd0, 's0, 's1 *(e0,const(2^k)) sll 'd0, 's0, k MEM(e0) lw 'd0, ('s0) MEM(+(e0,CONST(c))) lw 'd0, c('s0) MOVE(MEM(e0),e1) sw 's1, ('s0) MOVE(MEM(+(e0,CONST(c))),e1)sw 's1, c('s0) JUMP(NAME(X)) b X JUMP(e0) jr 's0 LABEL(X) X: nop e0 e1 d0 s0 s1 IR tile en sn Jianwen Zhu 2009 - P. 4
Problem Formuation Input : Subject graph Nodes: virtual instrns Edges: operand usages Simplification Subject graph reduced to a forest of trees Break whenever a value is used more than once Input: Patterns Each machine instruction is modeled as a tree of virtual instructions Jianwen Zhu 2009 - P. 5
Problem Formulation Find a covering of the subject graph (tree) Non-overlapping partitioning of graph Each partition = a tile Each tile matches a pattern (machine instruction) Such that Cost of cover is minimized Cost = #number of tiles (machine instructions) Jianwen Zhu 2009 - P. 6
Instruction Selection Challenge There are many possible alterantives Exponentially many! a[i]:=x is: MOVE(MEM(+(MEM(+(TEMP(fp),CONST(20))), *(TEMP(i),CONST(4)))), MEM(+(TEMP(fp),CONST(10)))) The following are two possible tilings of the IR: lw r1, 20($fp) add r1, $fp, 20 lw r2, i lw r1, (r1) sll r2, r2, 2 lw r2, i add r1, r1, r2 sll r2, r2, 2 lw r2, 10($fp) add r1, r1, r2 sw r2, (r1) add r2, $fp, x lw r2, (r2) sw r2, (r1) Which one is better? Jianwen Zhu 2009 - P. 7
Top Down Greedy Algorithm Maximum Munch Start from the IR root and from all matching tiles Select the one with the maximum number of IR nodes Go to the children of this tile and apply the algorithm recursively until you reach the tree leaves Pros Fast Cons Greedy Making decisions without knowing impact Jianwen Zhu 2009 - P. 8
Maximum Munch Example A lw r1, fp B lw r2, 8(r1) C lw r3, i D sll r4, r3, 2 E add r5, r2, r4 F lw r6, fp G lw r7, 16(r6) H add r8, r7, 1 I sw r8, (r5) Jianwen Zhu 2009 - P. 9
Bottom-up Optimal Algorithm Works from the leaves to the root Evaluate each matching tile Calculate the cost of children first The cost is accumulated cost cost of a tile = (number of nodes in the tile) + (total costs of all the tile children) Pick the best tile Pros Optimal (if the subject graph is a tree) Cons A node maybe visited many times as children Cost calculation repeated What have we learned before to solve that? Jianwen Zhu 2009 - P. 10
Dynamic Programming Remember the solution of children! Cost calculated only once Use a table to remember That s what programming means Example Jianwen Zhu 2009 - P. 11
Step 1 Jianwen Zhu 2009 - P. 12
Step 2 Jianwen Zhu 2009 - P. 13
Step 3 Jianwen Zhu 2009 - P. 14
Why Scheduling Pipelined Hazards Reorder could relieve hazards VLIW By DSP (eg. TI) Issue multiple instrn per cycle Determined by compiler Superscalar By Desktop (e.g Intel) Issue multiple instrn per cycle Determined by hardware scheduler Compiler could increase available ILP Jianwen Zhu 2009 - P. 15
Scheduling Problem For each basic block Map each instrn v To a clock step S(v): integer Subject to Data dependency constraints S (u) < S (v) if v depends on u Subject to resource constraints #instrn issued < available #FU Jianwen Zhu 2009 - P. 16
Scheduling: Dependency Test Why dependency test Schedule max # of inst in a step. Preserve functional correctness Sources of data dependency Must l l A = B = A May (if A & B aliased to a same location) l (1) A = B = l (2) A = = B l (3) = A B = Jianwen Zhu 2009 - P. 17
Dependency Test for TinyIR No pointers dependency test only consists of comparing the symbol names With pointers Pointer/alias analysis has to be performed Test result: a precedence graph for each basic block Jianwen Zhu 2009 - P. 18
Precedence Graph Example TinyC(a) Code to TinyIR(b) Jianwen Zhu 2009 - P. 19
Precedence Graph Example For a basic block, for each instruction, draw its dependencies as edges in Precedence Graph. E.g. (28) (26) (24),(25) Instructions outside the basic block will be names s and t (a) (b) Chain of instructions Precedence Graph of B4 Jianwen Zhu 2009 - P. 20
Unconstrained Scheduling Assume an unlimited # of functional units Problem Total # of steps = S(t) S(s) Schedule of source must be earlier and sink must be later than any nodes in the basic block. Jianwen Zhu 2009 - P. 21
As Soon As Possible (ASAP) Scheduling Jianwen Zhu 2009 - P. 22
ASAP in English Iterative approach In each iteration a set of ready nodes (inst) are scheduled to a control step. A node is ready when all its predecessors are scheduled Key: how to efficiently decide if a node is ready? Naïve approach for each node being scheduled, visit each successor, check all predecessors of the successor Better approach: keep a counter for each node representing # of predecessors For each node being scheduled, visit each successor Decrement the counter of the successor When counter = 0, the node is ready. O( V + E ) ALAP(As Late As Possible) algorithm starts to schedule in reverse order (from sink) Jianwen Zhu 2009 - P. 23
ASAP and ALAP Example s s 15 16 17 18 20 24 25 26 step 0 15 16 24 25 15 16 step 1 17 26 17 step 2 18 18 24 25 step 3 20 20 26 t t (a) (b) (c) (a) ASAP (b) ALAP (c) Mobility Mobility shows the flexibility of scheduling for each node. (difference between ASAP step and ALAP step. Jianwen Zhu 2009 - P. 24
Resource-constrained Scheduling Now consider with limited # of functional units: New constraint: # of instructions scheduled at any control step must <= # of functional units s s 15 16 17 18 20 24 25 26 step 0 15 16 24 25 15 16 step 1 17 26 Violated! 17 step 2 18 18 24 25 step 3 20 20 26 t t (a) (b) (c) Jianwen Zhu 2009 - P. 25
List Scheduling Jianwen Zhu 2009 - P. 26
List Scheduling in English Modified version of ASAP. Like ASAP, list of nodes ready are maintained Unlike ASAP Unit occupancy for current control step needs to be maintained: reservation station (restab) Only a subset of ready nodes can be scheduled. Key question: How to choose the subset for better performance? Classic NP-complete problem Solution: Assign priority to ready nodes Priority determined by heuristics Jianwen Zhu 2009 - P. 27
List Scheduling Heuristics Less-Flexible-First assigns higher priority to nodes have smaller mobility Mobility = ALAP - ASAP Distance to sink # successors Jianwen Zhu 2009 - P. 28
Example (a) (b) (c) Uses less-flexible-first priority Random1 did not choose 16 at step 0 (extra step) Ramdom2 did not choose 17 at step 2 (extra step) s s s step 0 15 + 16-25 * + + 15 24 25 * 15 + 16-25 * step 1 17 * 24 + - 16 26 * 17 * 24 + step 2 18 * 17 * 26 * step 3 20 + 26 * 18 * 18 * step 4 t 20 + 20 + t t (a) Jianwen Zhu 2009 - P. 29 (b) (c)
Register Allocation Virtual instructions compute values Values must be stored for later use Store values in registers Questions How many registers are needed? How can values be bound to registers? Jianwen Zhu 2009 - P. 30
Register Binding Objective: Minimize the number of registers Maximize the sharing of registers between values Observation: Two values can only share the same register if they are not in use at the same time Jianwen Zhu 2009 - P. 31
Liveness analysis Objective: Determine when variables are in use What does it mean to be live? Definition 5.5 A value (instruction) v is live at a control step s1 if there exists another control step s2 reachable from s1, such that v is used as an operand by one of the instructions scheduled at s2. A live set at control step s1 is the set of all values alive at s1. Jianwen Zhu 2009 - P. 32
Example 6 4 Live values 4 6 15 16 17 18 20 24 25 26 step 0 15 16 25 {4, 6} step 1 17 24 {4, 15, 16, 25} step 2 18 {17, 24, 25} step 3 20 26 {18, 24, 25} {20, 26} (a) (b) Jianwen Zhu 2009 - P. 33
Live(s), Def(s) and Use(s) Liveness analysis is used to compute when variables are alive Uses three sets Live(s): Set of live values at the beginning of step s Def(s): Set of values defined at step s Use(s): Set of values used at step s Jianwen Zhu 2009 - P. 34
Basic Block Liveness Analysis Relationship between variables Live(s) = Use(s) U [Live(s+1) Def(s)] Backward scan: Since each step requires information from the next step Idea: Used variables become live Defined variables become dead Jianwen Zhu 2009 - P. 35
Basic Block Liveness Analysis {18, 24, 25} U[ {20, 26} - {20, 26} ] STEP Def Use Live 0 {15, 16, 25} {4, 6} 1 {17, 24} {4, 15, 16} 2 {18} {17} 3 {20, 26} {18, 24, 25} {18, 24, 25} 4 {20, 26} Jianwen Zhu 2009 - P. 36
Basic Block Liveness Analysis {17} U[ {18,24,25} - {18} ] STEP Def Use Live 0 {15, 16, 25} {4, 6} {4, 6} 1 {17, 24} {4, 15, 16} {4, 15, 16, 25} 2 {18} {17} {17, 24, 25} 3 {20, 26} {18, 24, 25} {18, 24, 25} 4 {20, 26} Jianwen Zhu 2009 - P. 37
Basic Block Liveness Analysis Jianwen Zhu 2009 - P. 38
Liveness Analysis Algorithm This algorithm can be extended to whole control flow graph Caveats: CFG has backward edges BB can have multiple successors Modifications Repeatedly traverse graph until solution converges Union all the live sets from predecessors before applying equation Jianwen Zhu 2009 - P. 39
Liveness Analysis Algorithm Jianwen Zhu 2009 - P. 40
Interference Graph Liveness of variables has been computed Construct a graph describing interference of two variables i.e. both variables are alive at the same time Nodes: represent variables Edges: represent two variables alive at the same time Jianwen Zhu 2009 - P. 41
Interference Graph 4 26 6 25 15 24 16 20 17 18 Jianwen Zhu 2009 - P. 42
Interference Graph Algorithm Jianwen Zhu 2009 - P. 43
Register binding by coloring Recall: We want to assign variables to registers to minimize the number of registers used Corresponds to the graph coloring problem Given a graph color each node such that no edge joins nodes of the same color using the fewest number of color Jianwen Zhu 2009 - P. 44
Graph Coloring Graph color is NP-complete Need heuristics to complete problem within reasonable time Overview Given a order which to visit nodes (i.e. vertex elimination order) Visit each node in that order Pick a color such that no neighbor has the same one Jianwen Zhu 2009 - P. 45
Example 25 16 Jianwen Zhu 2009 - P. 46
Example 25 15 16 Jianwen Zhu 2009 - P. 47
Example 4 25 15 16 Jianwen Zhu 2009 - P. 48
Example 4 25 15 24 16 Jianwen Zhu 2009 - P. 49
Example 4 25 15 24 16 17 18 Jianwen Zhu 2009 - P. 50
Example 4 26 6 25 15 24 16 20 17 18 Jianwen Zhu 2009 - P. 51
Coloring Algorithm Jianwen Zhu 2009 - P. 52
Vertex Elimination Order In previous example, assumed that order was given For basic block To determine order use left-edge algorithm Optimal for the interval graph Generally Use heuristic: less-flexible first i.e. pick nodes with the most neighbors first Jianwen Zhu 2009 - P. 53
Vertex Elimination To generate vertex order Visit node with fewest neighbors and push them onto a stack Repeat until all edges visited Stack will now contain vertex order starting with the top node on the stack Jianwen Zhu 2009 - P. 54
Example 4 26 6 25 15 24 16 20 17 18 26 20 6 Jianwen Zhu 2009 - P. 55
Example 4 25 15 24 18 17 16 18 17 26 20 6 Jianwen Zhu 2009 - P. 56
Example 4 25 24 15 16 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 57
Example 4 25 15 16 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 58
Example 25 15 16 15 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 59
25 Example 16 25 16 15 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 60