Introduction to Optimization. CS434 Compiler Construction Joel Jones Department of Computer Science University of Alabama

Introduction to Optimization CS434 Compiler Construction Joel Jones Department of Computer Science University of Alabama Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use.

Traditional Three-pass Compiler Source Code Front End IR Middle End IR Back End Machine code Errors Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code May also improve space, power consumption, Must preserve meaning of the code Measured by values of named variables A course (or two) unto itself

The Optimizer (or Middle End) IR Opt IR Opt IR Opt.. IR Opt IR 1 2 3. n Typical Transformations Errors Modern optimizers are structured as a series of passes Discover & propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover a redundant computation & remove it Remove useless or unreachable code Encode an idiom in some particularly efficient form

The Role of the Optimizer The compiler can implement a procedure in many ways The optimizer tries to find an implementation that is better Speed, code size, data space, To accomplish this, it Analyzes the code to derive knowledge about run-time behavior Data-flow analysis, pointer disambiguation, General term is static analysis Uses that knowledge in an attempt to improve the code Literally hundreds of transformations have been proposed Large amount of overlap between them Nothing optimal about optimization Proofs of optimality assume restrictive & unrealistic conditions

Redundancy Elimination as an Example An expression x+y is redundant if and only if, along every path from the procedure s entry, it has been evaluated, and its constituent subexpressions (x & y) have not been re-defined. If the compiler can prove that an expression is redundant It can preserve the results of earlier evaluations It can replace the current evaluation with a reference Two pieces to the problem Proving that x+y is redundant Rewriting the code to eliminate the redundant evaluation One technique for accomplishing both is called value numbering

Value Numbering (An old idea) The key notion (Balke 1968 or Ershov 1954) Assign an identifying number, V(exp), to each expression V( x+y ) = V( j ) iff x+y and j have the same value path Use hashing over the value numbers to make it efficient Use these numbers to improve the code Improving the code Replace redundant expressions Simplify algebraic identities Discover constant-valued expressions, fold & propagate them This technique was invented for low-level, linear IRs Equivalent methods exist for trees (build a DAG )

Local Value Numbering The Algorithm For each operation o = <operator, o 1, o 2 > in the block 1 Get value numbers for operands from hash lookup 2 Hash <operator,vn(o 1 ),VN(o 2 )> to get a value number for o 3 If o already had a value number, replace o with a reference 4 If o 1 & o 2 are constant, evaluate it & replace with a loadi If hashing behaves, the algorithm runs in linear time If not, use multi-set discrimination (see digression in Ch. 5 ) Handling algebraic identities Case statement on operator type Handle special cases within each operator

Local Value Numbering An example Original Code a x + y b x + y a 17 c x + y Two redundancies: Eliminate stmts with a Coalesce results? With VNs a 3 x 1 + y 2 b 3 x 1 + y 2 a 4 17 c 3 x 1 + y 2 Rewritten a 3 x 1 + y 2 b 3 a 3 a 4 17 c 3 a 3 (oops!) Options: Use c 3 b 3 Save a 3 in t 3 Rename around it

Local Value Numbering Example (continued) Original Code a 0 x 0 + y 0 b 0 x 0 + y 0 a 1 17 c 0 x 0 + y 0 Renaming: Give each value a unique name Makes it clear With VNs a 0 3 x 0 1 + y0 2 b 0 3 x 0 1 + y 0 2 a 1 4 17 c 0 3 x 0 1 + y 0 2 Notation: While complex, the meaning is clear Rewritten a 0 3 x 0 1 + y0 2 b 0 3 a 0 3 a 1 4 17 c 0 3 a 0 3 Result: a 0 3 is available Rewriting just works

Simple Extensions to Value Numbering Constant folding Add a bit that records when a value is constant Evaluate constant values at compile-time Replace with load immediate or immediate operand No stronger local algorithm Algebraic identities Must check (many) special cases Replace result with input VN Build a decision tree on operation Identities: (Cliff Click) x y, x+0, x-0, x 1, x 1, x-x, x 0, x x, x 0, x 0xFF FF, max(x,maxint), min(x,minint), max(x,x), min(y,y), and so on... With values, not names

Handling Larger Scopes Extended Basic Blocks Initialize table for b i with table from b i-1 Otherwise, it is complex With single-assignment naming, can use scoped hash table b 2 b 1 b 3 The Plan: Process b 1, b 2, b 4 Pop two levels Process b 3 relative to b 1 Start clean with b 5 Start clean with b 6 b 4 b 5 b 6 Using a scoped table makes doing the full tree of EBBs that share a common header efficient.

Handling Larger Scopes To go further, we must deal with merge points Our simple naming scheme falls apart in b 4 b 1 We need more powerful analysis tools Naming scheme becomes SSA b 2 b3 This requires global data-flow analysis b 4 Compile-time reasoning about the run-time flow of values 1 Build a model of control-flow 2 Pose questions as sets of simultaneous equations 3 Solve the equations 4 Use solution to transform the code Examples: LIVE, REACHES, AVAIL

Value Numbering EaC, Figure 8.5 A m a + b n a + b Missed opportunities (need stronger methods) B p c + d r c + d C q a + b r c + d D e b + 18 E e a + 17 s a + b t c + d u e + f u e + f F v a + b w c + d x e + f Local Value Numbering G y a + b z c + d 1 block at a time Strong local results No cross-block effects

An Aside on Terminology A m a + b n a + b Control-flow graph (CFG) Nodes for basic blocks Edges for branches B p c + d r c + d C q a + b r c + d Basis for much of program analysis & transformation D e b + 18 E e a + 17 s a + b t c + d u e + f u e + f F v a + b w c + d x e + f This CFG, G = (N,E) N = {A,B,C,D,E,F,G} G y a + b z c + d E = {(A,B),(A,C),(B,G),(C,D), (C,E),(D,F),(E,F),(F,E)} N = 7, E = 8

SSA Name Space (in general) Two principles Each name is defined by exactly one operation Each operand refers to exactly one definition To reconcile these principles with real code Insert φ-functions at merge points to reconcile name space Add subscripts to variable names for uniqueness x... x...... x +... becomes x 0... x 1... x 2 φ(x 0,x 1) x 2 +...

Superlocal Value Numbering EaC, Figure 8.6 A m 0 a + b n 0 a + b With all the bells & whistles Find more redundancy B p 0 c + d r 0 c + d C q 0 a + b r 1 c + d Still does nothing for F & G D e 0 b + 18 E e 1 a + 17 s 0 a + b t 0 c + d u 0 e + f u 1 e + f This is in SSA Form G F r 2 φ(r 0,r 1 ) y 0 a + b z 0 c + d e 3 φ(e 0,e 1 ) u 2 φ(u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f Superlocal techniques Some local methods extend cleanly to superlocal scopes VN does not back up If C adds to A, it s a problem

What About Larger Scopes? We have not helped with F or G Multiple predecessors Must decide what facts hold in F and in G For G, combine B & F? B A p 0 c + d r 0 c + d m 0 a + b n 0 a + b D C e 0 b + 18 s 0 a + b u0 e + f q 0 a + b r 1 c + d E e 1 a + 17 t 0 c + d u1 e + f Merging state is expensive Fall back on what s known F e 3 φ(e 0,e 1 ) u 2 φ(u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f G r 2 φ(r 0,r 1 ) y 0 a + b z 0 c + d

Dominators Definitions x dominates y if and only if every path from the entry of the control-flow graph to the node for y includes x By definition, x dominates x We associate a Dom set with each node Dom(x ) 1 Immediate dominators For any node x, there must be a y in Dom(x ) closest to x We call this y the immediate dominator of x As a matter of notation, we write this as IDom(x )

Dominators Dominators have many uses in analysis & transformation Finding loops Building SSA form Making code motion decisions Dominator sets B Dominator tree A B C G A p 0 c + d r 0 c + d m 0 a + b n 0 a + b D C e 0 b + 18 s 0 a + b u0 e + f F q 0 a + b r 1 c + d E e 3 φ(e 0,e 1 ) u 2 φ(u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f e 1 a + 17 t 0 c + d u1 e + f D E F G r 2 φ(r 0,r 1 ) y 0 a + b z 0 c + d We ll look at how to compute dominators later Back to the discussion of value numbering over larger scopes... *

What About Larger Scopes? We have not helped with F or G Multiple predecessors B A p 0 c + d r 0 c + d m 0 a + b n 0 a + b C q 0 a + b r 1 c + d Must decide what facts hold in F and in G For G, combine B & F? Merging state is expensive Fall back on what s known D e 0 b + 18 s 0 a + b u0 e + f F E e 3 φ(e 0,e 1 ) u 2 φ(u 0,u 1 ) v 0 a + b w 0 c + d x 0 e + f e 1 a + 17 t 0 c + d u1 e + f Can use table from IDom(x ) to start x Use C for F and A for G Imposes a Dom-based application order G r 2 φ(r 0,r 1 ) y 0 a + b z 0 c + d Leads to Dominator VN Technique (DVNT) *

Dominator Value Numbering The DVNT Algorithm Use superlocal algorithm on extended basic blocks Retain use of scoped hash tables & SSA name space Start each node with table from its IDom DVNT generalizes the superlocal algorithm No values flow along back edges (i.e., around loops) Constant folding, algebraic identities as before Larger scope leads to (potentially) better results Local + Superlocal + good start for new EBBs

Dominator Value Numbering A m a + b n a + b DVNT advantages Find more redundancy B p c + d r c + d C q a + b r c + d Little additional cost Retains online character D e b + 18 E e a + 17 s a + b t c + d u e + f u e + f G F r 2 φ(r 0,r 1 ) y a + b z c + d e 3 φ(e 1,e 2 ) u 2 φ(u 0,u 1 ) v a + b w c + d x e + f DVNT shortcomings Misses some opportunities No loop-carried CSEs or constants

The Story So Far, Local algorithm (Balke, 1967) Superlocal extension of Balke (many) Dominator VN technique (Simpson, 1996) All these propagate along forward edges None are global methods Global Methods Classic CSE (Cocke 1970) Partitioning algorithms (Alpern et al. 1988, Click 1995) Partial Redundancy Elimination (Morel & Renvoise 1979) SCC/VDCM (Simpson 1996) We will look at several global methods *

Roadmap To recap We have seen value numbering Local, superlocal, dominator scopes We may look at global value numbering later We will look at global redundancy elimination today We have used dominators Chapter 9 in EaC shows how to compute dominators Gives world s fastest algorithm (also simplest) We have used SSA Chapter 9 in EaC shows how to build SSA form & tear it down

Review So far, we have seen Local Value Numbering Finds redundancy, constants, & identities in a block Superlocal Value Numbering Extends local value numbering to EBBs Used SSA-like name space to simplify bookkeeping Dominator Value Numbering Extends scope to almost global (no back edges) Uses dominance information to handle join points in CFG Still More Global Common Subexpression Elimination (GCSE) Applying data-flow analysis to the problem

Using Available Expressions for GCSE The goal Find common subexpressions whose range spans basic blocks, and eliminate unnecessary re-evaluations Safety Available expressions proves that the replacement value is current Transformation must ensure right name value mapping Profitability Don t add any evaluations Add some copy operations Copies are inexpensive Many copies coalesce away Copies can shrink or stretch live ranges *

Computing Available Expressions For each block b Let AVAIL(b) be the set of expressions available on entry to b Let EXPRKILL(b) be the set of expression killed in b Let DEEXPR(b) be the set of expressions defined in b and not subsequently killed in b Now, AVAIL(b) can be defined as: AVAIL(b) = x pred(b) (DEEXPR(x) (AVAIL(x) EXPRKILL(x))) preds(b) is the set of b s predecessors in the control-flow graph This system of simultaneous equations forms a data-flow problem Solve it with a data-flow algorithm

Using Available Expressions for GCSE The Method Expressions killed in b 1. block b, compute DEEXPR(b) and EXPRKILL(b) 2. block b, compute AVAIL(b) 3. block b, value number the block starting from AVAIL(b) 4. Replace expressions in AVAIL(b) with references Two key issues Expressions defined in b and exposed downward Computing AVAIL(b) Managing the replacement process We ll look at the replacement issue first Assume, w.l.og, that we can compute available expressions for a procedure. This annotates each basic block, b, with a set AVAIL(b) that contains all expressions that are available on entry to b. *

Global CSE (replacement step) Managing the name space Need a unique name e AVAIL(b) 1. Can generate them as replacements are done (Fortran H) 2. Can compute a static mapping 3. Can encode value numbers into names (Briggs 94) Strategies 1. This works; it is the classic method 2. Fast, but limits replacement to textually identical expressions 3. Requires more analysis (VN), but yields more CSEs Assume, w.l.o.g., solution 2

Global CSE (replacement step, strategy two) Compute a static mapping from expression to name After analysis & before transformation b, e AVAIL(b), assign e a global name by hashing on e During transformation step Evaluation of e insert copy name(e) e Reference to e replace e with name(e) The major problem with this approach Inserts extraneous copies At all definitions and uses of any e AVAIL(b), b Those extra copies are dead and easy to remove The useful ones often coalesce away Common strategy: Insert copies that might be useful Let DCE sort them out Simplifies design & implementation *

An Aside on Dead Code Elimination What does dead mean? Useless code result is never used Unreachable code code that cannot execute Both are lumped together as dead To perform DCE Must have a global mechanism to recognize usefulness Must have a global mechanism to eliminate unneeded stores Must have a global mechanism to simplify control-flow predicates All of these will come later in the course

Global CSE Now a three step process Compute AVAIL(b), block b Assign unique global names to expressions in AVAIL(b) Perform replacement with local value numbering Earlier in the lecture, we said Assume, without loss of generality, that we can compute available expressions for a procedure. This annotates each basic block, b, with a set AVAIL(b) that contains all expressions that are available on entry to b. Now, we need to make good on the assumption

Computing Available Expressions The Big Picture 1. Build a control-flow graph 2. Gather the initial (local) data DEEXPR(b) & EXPRKILL(b) 3. Propagate information around the graph, evaluating the equation 4. Post-process the information to make it useful (if needed) All data-flow problems are solved, essentially, this way

Computing Available Expressions For each block b Let AVAIL(b) be the set of expressions available on entry to b Let EXPRKILL(b) be the set of expression killed in b Let DEEXPR(b) be the set of expressions defined in b and not subsequently killed in b Now, AVAIL(b) can be defined as: AVAIL(b) = x pred(b) (DEEXPR(x) (AVAIL(x) EXPRKILL(x) )) preds(b) is the set of b s predecessors in the control-flow graph This system of simultaneous equations forms a data-flow problem Solve it with a data-flow algorithm

Using Available Expressions for GCSE The Big Picture 1. block b, compute DEEXPR(b) and EXPRKILL(b) 2. block b, compute AVAIL(b) 3. block b, value number the block starting from AVAIL(b) 4. Replace expressions in AVAIL(b) with references

Computing Available Expressions First step is to compute DEEXPR & EXPRKILL assume a block b with operations o 1, o 2,, o k VARKILL Ø DEEXPR(b) Ø for i = k to 1 assume o i is x y + z add x to VARKILL if (y VARKILL) and (z VARKILL) then add y + z to DEEXPR(b) EXPRKILL(b) Ø Backward through block For each expression e for each variable v e if v VARKILL(b) then EXPRKILL(b) EXPRKILL(b) {e } Many data-flow problems have initial information that costs less to compute O(k) steps O(N) steps N is # operations *

Computing Available Expressions The worklist iterative algorithm Worklist { all blocks, b i } while (Worklist Ø) remove a block b from Worklist recompute AVAIL(b ) as AVAIL(b) = x pred(b) (DEEXPR(x) (AVAIL(x) EXPRKILL(x) )) if AVAIL(b ) changed then Worklist Worklist successors(b ) Finds fixed point solution to equation for AVAIL That solution is unique Identical to meet over all paths solution How do we know these things? *

Data-flow Analysis Data-flow analysis is a collection of techniques for compile-time reasoning about the run-time flow of values Almost always involves building a graph Problems are trivial on a basic block Flow graph Global problems control-flow graph (or derivative) Whole program problems call graph (or derivative) Usually formulated as a set of simultaneous equations Sets attached to nodes and edges Data-flow problem Lattice (or semilattice) to describe values Desired result is usually meet over all paths solution What is true on every path from the entry? Can this happen on any path from the entry? Related to the safety of optimization

Data-flow Analysis Limitations 1. Precision up to symbolic execution Assume all paths are taken 2. Solution cannot afford to compute MOP solution Large class of problems where MOP = MFP= LFP Not all problems of interest are in this class 3. Arrays treated naively in classical analysis Represent whole array with a single fact 4. Pointers difficult (and expensive) to analyze Imprecision rapidly adds up Need to ask the right questions Summary Good news: Simple problems can carry us pretty far For scalar values, we can quickly solve simple problems *

Computing Available Expressions AVAIL(b) = x pred(b) (DEEXPR(x) (AVAIL(x) EXPRKILL(x) )) where EXPRKILL(b) is the set of expression not killed in b, and DEEXPR(b) is the set of downward exposed expressions in b (defined and not subsequently killed in b) Initial condition AVAIL(n 0 ) = Ø, because nothing is computed before n 0 The other node s AVAIL sets will be computed over their preds. n 0 has no predecessor.

Making Theory Concrete B Computing AVAIL for the example A p c + d r c + d G m a + b n a + b C y a + b z c + d q a + b r c + d D e b + 18 E e a + 17 s a + b t c + d u e + f u e + f F v a + b w c + d x e + f A B C D E F G DEEXPR a+b c+d a+b,c+d b+18,a+b,e+f a+17,c+d,e+f a+b,c+d,e+f a+b,c+d EXPRKILL { } { } { } e+f e+f { } { } AVAIL(A) = Ø AVAIL(B) = {a+b} (Ø all) = {a+b} AVAIL(C) = {a+b} AVAIL(D) = {a+b,c+d} ({a+b} all) = {a+b,c+d} AVAIL(E) = {a+b,c+d} AVAIL(F) = [{b+18,a+b,e+f} ({a+b,c+d} {all - e+f})] [{a+17,c+d,e+f} ({a+b,c+d} {all - e+f})] = {a+b,c+d,e+f} AVAIL(G) = [ {c+d} ({a+b} all)] [{a+b,c+d,e+f} ({a+b,c+d,e+f} all)] = {a+b,c+d} *

Example LIVE = variables Transformation: Eliminating unneeded stores e in a register, have seen last definition, never again used The store is dead (except for debugging) Compiler can eliminate the store Data-flow problem: Live variables Form of f is same as in AVAIL LIVE(b) = s succ(b) USED(s) (LIVE(s) DEF(s)) LIVE(b) is the set of variables live on exit from b DEF(b) is the set of variables that are redefined in b USED(b) is the set of variables used before redefinition in b Live analysis is a backward flow problem LIVE plays an important role in both register allocation and the pruned-ssa construction. *