Redundant Computation Elimination Optimizations. Redundancy Elimination. Value Numbering CS2210

Similar documents
Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

register allocation saves energy register allocation reduces memory accesses.

Code generation for modern processors

Code generation for modern processors

CSC D70: Compiler Optimization Register Allocation

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Compiler Design. Register Allocation. Hwansoo Han

Rematerialization. Graph Coloring Register Allocation. Some expressions are especially simple to recompute: Last Time

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Low-Level Issues. Register Allocation. Last lecture! Liveness analysis! Register allocation. ! More register allocation. ! Instruction scheduling

Register allocation. Register allocation: ffl have value in a register when used. ffl limited resources. ffl changes instruction choices

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Calvin Lin The University of Texas at Austin

Induction Variable Identification (cont)

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Global Register Allocation - Part 2

Today More register allocation Clarifications from last time Finish improvements on basic graph coloring concept Procedure calls Interprocedural

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

CS 406/534 Compiler Construction Putting It All Together

Register allocation. instruction selection. machine code. register allocation. errors

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Global Register Allocation

Tour of common optimizations

An Overview of GCC Architecture (source: wikipedia) Control-Flow Analysis and Loop Detection

Lecture 6. Register Allocation. I. Introduction. II. Abstraction and the Problem III. Algorithm

Register allocation. Overview

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Global Register Allocation - Part 3

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

Lecture 3 Local Optimizations, Intro to SSA

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Loop Invariant Code Motion. Background: ud- and du-chains. Upward Exposed Uses. Identifying Loop Invariant Code. Last Time Control flow analysis

More Code Generation and Optimization. Pat Morin COMP 3002

Using Static Single Assignment Form

Calvin Lin The University of Texas at Austin

Lecture 21 CIS 341: COMPILERS

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline.

G Programming Languages - Fall 2012

A main goal is to achieve a better performance. Code Optimization. Chapter 9

Introduction to Optimization Local Value Numbering

Register Allocation. CS 502 Lecture 14 11/25/08

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Why Global Dataflow Analysis?

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Register Allocation & Liveness Analysis

Goals of Program Optimization (1 of 2)

Global Register Allocation via Graph Coloring

Topic 12: Register Allocation

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

CS577 Modern Language Processors. Spring 2018 Lecture Optimization

Compiler Construction 2010/2011 Loop Optimizations

Register Allocation via Hierarchical Graph Coloring

Control flow graphs and loop optimizations. Thursday, October 24, 13

CS553 Lecture Dynamic Optimizations 2

Lecture Notes on Loop Optimizations

Compiler Optimization

CSC D70: Compiler Optimization LICM: Loop Invariant Code Motion

Appendix Set Notation and Concepts

Interprocedural Analysis. Motivation. Interprocedural Analysis. Function Calls and Pointers

Fall Compiler Principles Lecture 12: Register Allocation. Roman Manevich Ben-Gurion University

Languages and Compiler Design II IR Code Optimization

Topic I (d): Static Single Assignment Form (SSA)

Control Flow Analysis. Reading & Topics. Optimization Overview CS2210. Muchnick: chapter 7

CSC D70: Compiler Optimization

Compiler Optimization Techniques

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

The C2 Register Allocator. Niclas Adlertz

Calvin Lin The University of Texas at Austin

LECTURE 19. Subroutines and Parameter Passing

Compilers and Code Optimization EDOARDO FUSELLA

7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!

G Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University

SSA Construction. Daniel Grund & Sebastian Hack. CC Winter Term 09/10. Saarland University

Machine-Independent Optimizations

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Optimization Prof. James L. Frankel Harvard University

CS 4110 Programming Languages & Logics. Lecture 35 Typed Assembly Language

Code generation and local optimization

Lecture 8: Induction Variable Optimizations

EECS 583 Class 8 Classic Optimization

April 15, 2015 More Register Allocation 1. Problem Register values may change across procedure calls The allocator must be sensitive to this

Lecture Outline. Intermediate code Intermediate Code & Local Optimizations. Local optimizations. Lecture 14. Next time: global optimizations

Code generation and local optimization

CSE P 501 Compilers. SSA Hal Perkins Spring UW CSE P 501 Spring 2018 V-1

Compiler Construction 2016/2017 Loop Optimizations

Compiler Design. Fall Data-Flow Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

CS558 Programming Languages Winter 2018 Lecture 4a. Andrew Tolmach Portland State University

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Lecture 25: Register Allocation

CS558 Programming Languages

CSE 501 Midterm Exam: Sketch of Some Plausible Solutions Winter 1997

Reuse Optimization. LLVM Compiler Infrastructure. Local Value Numbering. Local Value Numbering (cont)

Static Single Assignment Form in the COINS Compiler Infrastructure

Register Allocation (via graph coloring) Lecture 25. CS 536 Spring

Intermediate Code & Local Optimizations. Lecture 20

Quality and Speed in Linear-Scan Register Allocation

Transcription:

Redundant Computation Elimination Optimizations CS2210 Lecture 20 Redundancy Elimination Several categories: Value Numbering local & global Common subexpression elimination (CSE) local & global Loop-invariant code motion Partial redundancy elimination Most complex Subsumes CSE & loop-invariant code motion Value Numbering Goal: identify expressions that have same value Approach: hash expression to a hash code Then have to compare only expressions that hash to same value 1

Example a := x y b := x y t1:=!z if t1 goto L1 x :=!z c := x & y t2 := x & y if t2 trap 30 Global Value Numbering Generalization of value numbering for procedures Requires code to be in SSA form Crucial notion: congruence Two variables are congruent if their defining computations have identical operators and congruent operands Only precise in SSA form, otherwise the definition may not dominate the use Value Graphs Labeled, directed graph Nodes are labeled with Operators Function symbols Constants Edges point from operator (or function) to its operands Number labels indicate operand position 2

Example entry receive n 1 (val) i 1 := 1 j 1 := 1 i 3 := φ 2 (i 1,i 2 ) j 3 := φ 2 (j 1,j 2 ) i 3 mod 2 = 0 B1 B2 B3 i 4 := i 3 +1 j 4 := j 3 +1 i 5 := i 3 +3 j 5 := j 3 +3 B4 i 2 := φ 5 (i 4,i 5 ) j 3 := φ 5 (j 4,j 5 ) j 2 > n 1 B5 exit Congruence Definition Maximal relation on the value graph Two nodes are congruent iff They are the same node, or Their labels are constants and the constants are the same, or They have the same operators and their operands are congruent Variable equivalence := x and y are equivalent at program point p iff they are congruent and their defining assignments dominate p Global Value Numbering Algo. Fixed point algorithm Partition nodes into congruent sets Initial partition: nodes with same label are in same partition Iterate: split partitions where operands are not congruent 3

Global CSE Data flow problem: available expressions Lattice Direction Gen & Kill sets Distributive? Example Gen(a := a + b) =? Gen (x := y op z ) =? Gen(x relop y) =? entry c := a+b d := a*c e := d*d i := 1 f[i] := a+b c := c*2 c > d g[i] := a*c g[i] := d*d i := i+1 i >10 exit Loop-invariant Code Motion Goal: recognize computations that compute same value every iteration Move out of loop and compute just once Commonly useful for array subscript computations, e.g., a[j] in loop over I Two steps Analysis: recognize invariant computations Transformation: Code hoisting if used within loop Code sinking if used after loop 4

Algorithm Base cases: expression is a constant is a variable all of whose definitions are outside of loop Inductive cases Idempotent computation all of whose arguments are loop-invariant It s a variable use with only one reaching definition and the rhs of that def is loop-invariant Example See board Code Motion When is code motion of invariant computation S: z := x op y to loop preheader legal? Sufficient conditions S dominates all loop exits Otherwise may execute S when never executed otherwise Can relax if S idempotent (possibly slowing down program) S is only assignment to z in loop & no use of z in loop is reached by any def other than S Otherwise may reorder defs/uses Unnecessary in SSA form 5

Loop & Procedure Call Optimization Induction Variable Optimizations Classes basic = variables explicitly modified by same constant during each iteration dependent induction variables = computed from basic induction variables Identification: Find basic first: variables modified i := i+d or i := d+i Find dependent variables using patterns, i basic induction variable j := i *e, j := e*i j := i+e, j := e+i j := i-e, j := e-i j := -i Strength Reduction for IVs Given j = b*i +c want to avoid repeated multiplication Algorithm: create new temporary tj and replace j := b*i+c by j := tj after each assignment i := i+d insert tj := tj+db (db = result of d *b) if just loop-invariant create db := b*d in preheader initialize tj at end of preheader: tj := b *i; tj := tj+c replace uses of j by tj in loop 6

Linear Function Test Replacement Induction variables may (become) useless eliminate Important case: linear function test replacement loop induction variable used only in test and can be replaced by other test LFT Replacement j = b*i+c Replace i relop v with tj := b*v tj := tj+c j relop tj can then eliminate induction variable i Correct in all cases? Procedure Call Optimization Procedure calls can be costly direct call costs: call, return, argument & result passing, stack frame maintenance indirect call costs: (opportunity cost) damage to intraprocedural analysis of caller and callee Optimization techniques hardware support inlining tail call optimization interprocedural analysis procedure specialization 7

Inlining (aka Procedure Integration) Replace call with body of callee insert assignments for actual/formal mapping use copy propagation to eliminate copies manage variable scoping correctly! eg. rename local variables or tag names with scopes Pros & cons eliminates call overhead, parameter passing and result returning overheads can optimize callee in context of caller and vice versa can slow down compilation & increase code space Implementation Issues Within compilation unit or across? caller and callee in same language? parameter passing conventions may be different Should copies of inlined functions be kept? should we compile a copy? Should recursive procedures be inlined? What & where should be inlined? Considerations callee size call frequency benefit from inlining Criteria estimated or actual call frequencies may do inline trial inline and optimize to check for benefit if not big enough do not acutally inline In practice: profilebased inlining is much better than static estimates can get very good speedups (Ayers et al.) found 1.3 average up to 2 no conclusive impact on i-cache behavio 8

Tail Call Elimination Tail call = last thing executed before return is a call return f(n) is tail call return n * f(n-1) is not can jump to callee rather than call splice out on stack frame creation and tear down (callee reuses caller s frame & return address) effect on debugging Tail Recursion Elimination Tail call is self-recursive can turn recursion into iteration Extremely important optimization for functional languages (e.g., Scheme) since all iteration is expressed recursively Implementation replace call by parameter assignment branch to beginning of procedure body eliminate return following the recursive call Example void insert_node(int n, struct node* l) { if (n > l->value) if (l->next == nil) make_node(l,n); else insert_node(n,l->next); } void insert_node(int n, stuct node*l) { } loop: if (n>l->value) if (l->next == nil) make_node(l,n); else {l := l->next; goto loop;} 9

Leaf Routine Optimization Goal: simplify prologue and epilogue code for procedures that do not call others e.g. don t have to save / restore callersaved registers works only if there are no calls at all (otherwise procedure not a leaf) Shrink-Wrapping Generalization of leaf procedure optimization try to move prologue and epilogue code close to call to execute it only when necessary Register Allocation 10

Reading & Topics Chapter 16 Topics Register allocation methods Problem Assign machine resources (registers & stack locations) to hold run-time data Constraint simultaneously live data must be allocated to different locations Goal minimize overhead of stack loads & stores and register moves Solution Central insight: can be formulated as graph coloring problem Chaitin-style register allocation (1981) for IBM 390 PL/I compiler represent interference (= simultaneous liveness) as graph color with minimal number of colors Alternative bin-packing (used in DEC GEM compiler for Alpha) equally good in practice 11

Interference Graph Data structure to represent simultaneously live objects nodes are units of allocation variables better: webs edges represent simultaneously live property symmetric, not transitive Global Register Allocation Algorithm Allocate objects that can be register allocated (variables, temporaries that fit, large constants) to symbolic registers s 1... s n Determine which should be register allocation candidates (simplest case all) Construct interference graph allocatable objects and available registers are nodes use arcs to indicate interference and other conflicts (e.g., floating point values and integer registers) Construct k-coloring k = number available registers Allocate object to register of same color Example x := 2 y := 4 w := x+y z := x=1 u := x*y x := z*2 assume y & w dead on exit 12

Webs webs = maximal intersecting du-chains for a variable separates different uses of same variable, e.g., i used as loop index in different loops useful when no SSA-form is used (SSA form achieves same effect automatically) Web Example def y def x def y def x use y use x use y use x def x use x Constructing and Representing the Interference Graph Construction alternatives: as side effect of live variables analysis (when variables are units of allocation) compute du-chains & webs (or ssa form), do live variables analysis, compute interference graph Representation adjacency matrix: A[min(i,j), max(i,j)] = true iff (symbolic) register i adjacent to j (interferes) 13

Adjacency List Used in actual allocation A[i] lists nodes adjacent to node i Components color disp: displacement (in stack for spilling) spcost: spill cost (initialized 0 for symbolic, infinity for real registers) nints: number of interferences adjnds: list of real and symbolic registers currently interfere with i rmvadj: list of real and symbolic registers that interfered withi bu have been removed Register Coalescing Goal: avoid unnecessary register to register copies by coalescing register ensure that values are in proper argument registers before procedure calls remove unnecessary copies introduced by code generation from SSA form enforce source / target register constraints of certain instructions (important for CISC) Approach: search for copies s i := s j where s i and s j do not interfere (may be real or symbolic register copies) Computing Spill Costs Have to spill values to memory when not enough registers can be found (can t find k- coloring) Why webs to spill? least frequently accessed variables most conflicting Sometimes can rematerialize instead: = recompute value from other register values instead of store / load into memory (Briggs: in practice mixed results) 14

Spill Cost Computation defwt * Σ 10 depth(def) + usewt * Σ 10 depth(use) - copywt * Σ 10 depth(copy) ν defwt / usewt / copywt costs relative weights assigned to instructions ν def, use, copy are individual definitions /uses/ copies ν frequency estimated by loop nesting depth Coloring the Graph e a1 a2 b weight order: c d a2 b a1 d c e Assume 3 registers available Graph Pruning Improvement #1 (Chaitin, 1982) Nodes with <k edges can be colored after all other nodes and still be guaranteed registers So remove <k-degree nodes first this may reduce the degree of remaining graph and make it colorable 15

Algorithm while interference graph not empty while there is a node with <k neighbors remove it from graph, push on stack if all remaining nodes have >= k neighbors, then blocked: pick a node to spill (lowest cost) remove from graph, add to spill set if any nodes in spill set: inset spill codes for all spilled nodes, reconstruct interference graph and start over while stack not empty pop node from stack, allocate to register Coloring the Graph with pruning weight order: e a1 a2 c d a2 b b a1 d c e 1. Assume 3 registers available 2. Assume 2 registers available An Annoying Case A D B If only 2 register available: blocked must spill! C 16

Improvement #2: blocked!= spill (Briggs et al. 1989) Idea: just because node has k neighbors doesn t mean it will be spilled (neighbors can have overlapping colors) Algorithm: like Chaitin, except when removing blocked node, just push onto stack ( optimistic spilling ) when done removing nodes, pop nodes off stack and see if they can be allocated really spill if cannot be allocated at this stage Improvement #3: Prioritybased Coloring (Chow & Hennessy 1984) Live-range splitting when variable cannot be register-allocated, split into multiple subranges that can be allocated separately move instructions inserted at split points some live ranges in registers, some in memory selective spilling Based on variable live ranges can result in more conservative inteference graph than webs live range = set of basic blocks a variables is live in Live Range Example x := a+b y := x+c if y = 0 goto L1 z := y+d w := z L1: variables x,y,z,w x y z w 17

Improvement #4: Rematerialization Idea: instead of reloading value from memory, recompute instead if recomputation cheaper than reloading Simple strategy choose rematerialization over spilling if can recompute a value in a single instruction, and all operands will always be available Examples constants, address of a global, address of variable in stack frame Evaluation Rematerialization showed a reduction of -26% to 33% in spills Optimistic spilling showed a reduction of -2% to 48% Priority-based coloring may often not be worthwhile appears to be more expensive in practice than optimistic spilling 18