Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Similar documents
Instruction Selection and Scheduling

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Instruction Selection: Preliminaries. Comp 412

Compiler Design. Register Allocation. Hwansoo Han

Global Register Allocation via Graph Coloring

CS 406/534 Compiler Construction Instruction Scheduling

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CS 406/534 Compiler Construction Putting It All Together

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Lecture 12: Compiler Backend

Instruction Selection, II Tree-pattern matching

Code generation for modern processors

Code generation for modern processors

CS5363 Final Review. cs5363 1

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

fast code (preserve flow of data)

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

CS415 Compilers. Intermediate Represeation & Code Generation

Introduction to Optimization. CS434 Compiler Construction Joel Jones Department of Computer Science University of Alabama

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

k register IR Register Allocation IR Instruction Scheduling n), maybe O(n 2 ), but not O(2 n ) k register code

Code Shape II Expressions & Assignment

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

Instruction Scheduling

Introduction to Compiler

CSE 504: Compiler Design. Instruction Selection

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Topic 6 Basic Back-End Optimization

Local Register Allocation (critical content for Lab 2) Comp 412

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Lab 3, Tutorial 1 Comp 412

The ILOC Simulator User Documentation

Global Register Allocation via Graph Coloring The Chaitin-Briggs Algorithm. Comp 412

The ILOC Simulator User Documentation

Fall Compiler Principles Lecture 12: Register Allocation. Roman Manevich Ben-Gurion University

Introduction to Optimization Local Value Numbering

Intermediate Representations

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

The ILOC Simulator User Documentation

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Compiler Architecture

Lecture 21 CIS 341: COMPILERS

CS415 Compilers. Lexical Analysis

Compiler Design. Code Shaping. Hwansoo Han

Code Generation. CS 540 George Mason University

Register allocation. Overview

Register Allocation. CS 502 Lecture 14 11/25/08

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

The View from 35,000 Feet

Intermediate Representations

Lecture Overview Register Allocation

Goals of Program Optimization (1 of 2)

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Global Register Allocation

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Global Register Allocation - Part 2

CSC D70: Compiler Optimization Register Allocation

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Compilers and Interpreters

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

High-level View of a Compiler

Intermediate Representations Part II

Compiling Techniques

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline.

Low-level optimization

Compiler Optimization Techniques

Lecture Notes on Register Allocation

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Register Allocation (via graph coloring) Lecture 25. CS 536 Spring

Code generation and local optimization

Code generation and local optimization

Implementing Control Flow Constructs Comp 412

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

Register Allocation 1

The C2 Register Allocator. Niclas Adlertz

Intermediate Representations

More Code Generation and Optimization. Pat Morin COMP 3002

Register Allocation. Introduction Local Register Allocators

Lecture Compiler Backend

CS 406/534 Compiler Construction Intermediate Representation and Procedure Abstraction

Register Allocation & Liveness Analysis

Register Allocation. Michael O Boyle. February, 2014

Register Allocation via Hierarchical Graph Coloring

Intermediate Representations

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Register Allocation. Lecture 16

: Advanced Compiler Design. 8.0 Instruc?on scheduling

The Software Stack: From Assembly Language to Machine Code

Abstract Interpretation Continued

Intermediate Code Generation (ICG)

CS 314 Principles of Programming Languages Fall 2017 A Compiler and Optimizer for tinyl Due date: Monday, October 23, 11:59pm

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

The structure of a compiler

Transcription:

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Traditional Three-pass Compiler Source Code Front End IR Middle End IR Back End Machine code Errors Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code May also improve space, power consumption, Must preserve meaning of the code Measured by values of named variables A course (or two) unto itself

The Optimizer (or Middle End) Source Code Front End IR Middle End IR Back End Machine code Errors IR Opt IR Opt IR Opt... IR Opt IR 1 2 3 n Modern optimizers are structured as a series of passes Errors

The Optimizer (or Middle End) Typical Transformations Discover & propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover a redundant computation & remove it Remove useless or unreachable code Encode an idiom in some particularly efficient form

The Role of the Optimizer The compiler can implement a procedure in many ways The optimizer tries to find an implementation that is better Speed, code size, data space, To accomplish this, it Analyzes the code to derive knowledge about run-time behavior Data-flow analysis, pointer disambiguation, General term is static analysis Uses that knowledge in an attempt to improve the code Literally hundreds of transformations have been proposed Large amount of overlap between them Nothing optimal about optimization Proofs of optimality assume restrictive & unrealistic conditions

Redundancy Elimination as an Example An expression x+y is redundant if and only if, along every path from the procedure s entry, it has been evaluated, and its constituent subexpressions (x & y) have not been re-defined. If the compiler can prove that an expression is redundant It can preserve the results of earlier evaluations It can replace the current evaluation with a reference Two pieces to the problem Proving that x+y is redundant Rewriting the code to eliminate the redundant evaluation One technique for accomplishing both is called value numbering

Handling Larger Scopes Extended Basic Blocks Initialize table for b i with table from b i-1 With single-assignment naming, can use scoped hash table Otherwise, it is complex b 2 b 1 b 3 The Plan: Process b 1, b 2, b 4 Pop two levels Process b 3 relative to b 1 Start clean with b 5 Start clean with b 6 b 4 b 5 b 6 Using a scoped table makes doing the full tree of EBBs that share a common header efficient.

Handling Larger Scopes To go further, we must deal with merge points Our simple naming scheme falls apart in b 4 We need more powerful analysis tools Naming scheme becomes SSA b 1 b 2 b 3 This requires global data-flow analysis Compile-time reasoning about the run-time flow of values 1 Build a model of control-flow 2 Pose questions as sets of simultaneous equations 3 Solve the equations 4 Use solution to transform the code Examples: LIVE, REACHES, AVAIL b 4

Traditional Three-pass Compiler Source Code Front End IR Middle End IR Back End Machine code Errors Instruction Selection Register Allocation Instruction Scheduling

Instruction Selection: The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate construction of components Front End Middle End Back End Automating Instruction Selection Infrastructure Front end construction is largely automated Middle is largely hand crafted (Parts of ) back end can be automated

Definitions Instruction selection Mapping IR into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed program (set of operations) Changes demand for registers Register allocation Deciding which values will reside in registers Changes the storage mapping, may add false sharing Concerns about placement of data & memory operations

The Problem Modern computers (still) have many ways to do anything Consider register-to-register copy in ILOC Obvious operation is i2i r i r j Many others exist addi r i,0 r j subi r i,0 r j lshifti r i,0 r j multi r i,1 r j divi r i,1 r j! rshifti r i,0 r j! ori r i,0 r j! xori r i,0 r j and others Human would ignore all of these Algorithm must look at all of them & find low-cost encoding Take context into account (busy functional unit?)

The Goal Want to automate generation of instruction selectors Front End Middle End Back End Infrastructure Machine description Back-end Generator Tables Pattern Matching Engine Description-based retargeting Machine description can also help with scheduling & allocation

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 8 r 7 loadao r arp,r 7 r 8 mult r 6,r 8 r 9 loadai r arp,4 r 5 loadai r arp,8 r 6 mult r 5,r 6 r 7

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 8 r 7 loadao r arp,r 7 r 8 mult r 6,r 8 r 9 loadai r arp,4 r 5 loadai r arp,8 r 6 mult r 5,r 6 r 7 Pretty easy to fix. See 1 st digression in Ch. 7

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x NUMBER <2> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 2 r 7 mult r 6,r 7 r 8 loadai r arp,4 r 5 multi r 5,2 r 7

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x NUMBER <2> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 2 r 7 mult r 6,r 7 r 8 loadai r arp,4 r 5 multi r 5,2 r 7 Must combine these This is a nonlocal problem

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <c,@g,4> x IDENT <d,@h,4> loadi @G r 5 loadi 4 r 6 loadao r 5,r 6 r 7 loadi @H r 7 loadi 4 r 8 loadao r 8,r 9 r 10 mult r 7,r 10 r 11 loadi 4 r 5 loadai r 5,@G r 6 loadai r 5,@H r 7 mult r 6,r 7 r 8

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator can meet the second criteria How did it do on the first? Tree Treewalk Code Desired Code IDENT <c,@g,4> x IDENT <d,@h,4> Common offset loadi @G r 5 loadi 4 r 6 loadao r 5,r 6 r 7 loadi @H r 7 loadi 4 r 8 loadao r 8,r 9 r 10 mult r 7,r 10 r 11 loadi 4 r 5 loadai r 5,@G r 6 loadai r 5,@H r 7 mult r 6,r 7 r 8 Again, a nonlocal problem

How do we perform this kind of matching? Tree-oriented IR suggests pattern matching on trees Tree-patterns as input, matcher as output Each pattern maps to a target-machine instruction sequence Use dynamic programming or bottom-up rewrite systems Linear IR suggests using some sort of string matching Strings as input, matcher as output Each string maps to a target-machine instruction sequence Use text matching or peephole matching In practice, both work well; matchers are quite different

Definitions Instruction selection Mapping IR into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed program (set of operations) Changes demand for registers Register allocation Deciding which values will reside in registers Changes the storage mapping, may add false sharing Concerns about placement of data & memory operations

What Makes Code Run Fast? Many operations have non-zero latencies Modern machines can issue several operations per cycle Execution time is order-dependent (and has been since the 60 s) Assumed latencies (conservative) Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 fadd 1 fmult 2 shift 1 branch 0 to 8 Loads & stores may or may not block > Non-blocking fill those issue slots Branch costs vary with path taken Scheduler should hide the latencies

Example w w * 2 * x * y * z Cycles Simple schedule Cycles Schedule loads early 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 1 loadai r0,@w r1 2 loadai r0,@x r2 3 loadai r0,@y r3 4 add r1,r1 r1 5 mult r1,r2 r1 6 loadai r0,@z r2 7 mult r1,r3 r1 9 mult r1,r2 r1 11 storeai r1 r0,@w 14 r1 is free 2 registers, 20 cycles 3 registers, 13 cycles Reordering operations for speed is called instruction scheduling

Instruction Scheduling (Engineer s View) The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The Concept Machine description The task Produce correct code Minimize wasted cycles slow code Scheduler fast code Avoid spilling registers Operate efficiently

Instruction Scheduling (The Abstract View) To capture properties of the code, build a dependence graph G Nodes n G are operations with type(n) and delay(n) An edge e = (n 1,n 2 ) G if & only if n 2 uses the result of n 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Dependence Graph

Instruction Scheduling (Definitions) A correct schedule S maps each n N into a non-negative integer representing its cycle number, and 1. S(n) 0, for all n N, obviously 2. If (n 1,n 2 ) E, S(n 1 ) + delay(n 1 ) S(n 2 ) 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S), is L(S) = max n N (S(n) + delay(n)) The goal is to find the shortest possible correct schedule. S is time-optimal if L(S) L(S 1 ), for all other schedules S 1 A schedule might also be optimal in terms of registers, power, or space.

Instruction Scheduling (What s so difficult?) Critical Points All operands must be available Multiple operations can be ready Moving operations can lengthen register lifetimes Placing uses near definitions can shorten register lifetimes Operands can have multiple predecessors Together, these issues make scheduling hard Complete) Local scheduling is the simple case Restricted to straight-line code Consistent and predictable latencies (NP-

Instruction Scheduling The big picture 1. Build a dependence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue Local list scheduling The dominant algorithm for twenty years A greedy, heuristic, local technique

Local List Scheduling Cycle 1 Ready roots of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Removal in priority order op has completed execution If successor s operands are ready, put it on Ready

Scheduling Example 1. Build the dependence graph a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Dependence Graph

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w 10 13 a b 9 d 7 12 c 10 e 8 g f 5 h 3 i The Code The Dependence Graph

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling 1) a: 2) c: 3) e: 4) b: 5) d: 6) g: 7) f: l o ad A I r 0, @w r 1 l o ad A I r 0, @ x r 2 l o ad A I r 0, @ y r 3 a d d r 1, r 1 r 1 m ul t r 1, r 2 r 1 l o ad A I r 0, @z r 2 m ul t r 1, r 3 r 1 9) h: m ul t r 1, r 2 r 1 11) i: s t o r e A I r 1 r 0, @w New register name used 13 a 10 b 9 d 7 12 c 10 e 8 g f 5 h i 3 The Code The Dependence Graph

Register Allocation Part of the compiler s back end IR Instruction Selection m register IR Register Allocation k register IR Instruction Scheduling Machine code Errors Critical properties Produce correct code that uses k (or fewer) registers Minimize added loads and stores Minimize space used to hold spilled values Operate efficiently O(n), O(n log 2 n), maybe O(n 2 ), but not O(2 n )

Global Register Allocation The big picture m register code Register Allocator k register code Optimal global allocation is NP-Complete, under almost any assumptions. At each point in the code 1 Determine which values will reside in registers 2 Select a register for each such value The goal is an allocation that minimizes running time Most modern, global allocators use a graph-coloring paradigm Build a conflict graph or interference graph Find a k-coloring for the graph, or change the code to a nearby problem that it can k-color

Global Register Allocation... store r4 x This is an assignment problem, not an allocation problem! load x r1... What s harder across multiple blocks? Could replace a load with a move Good assignment would obviate the move Must build a control-flow graph to understand inter-block flow Can spend an inordinate amount of time adjusting the allocation

Global Register Allocation... store r4 x... store r4 x What if one block has x in a register, but the other does not? load x r1... A more complex scenario Block with multiple predecessors in the control-flow graph Must get the right values in the right registers in each predecessor In a loop, a block can be its own predecessors This adds tremendous complications

Global Register Allocation Taking a global approach Abandon the distinction between local & global Make systematic use of registers or memory Adopt a general scheme to approximate a good allocation Graph coloring paradigm (Lavrov & (later) Chaitin ) 1 Build an interference graph G I for the procedure Computing LIVE is harder than in the local case G I is not an interval graph 2 (try to) construct a k-coloring Minimal coloring is NP-Complete Spill placement becomes a critical issue 3 Map colors onto physical registers

Graph Coloring (A Background Digression) The problem A graph G is said to be k-colorable iff the nodes can be labeled with integers 1 k so that no edge in G connects two nodes with the same label Examples 2-colorable 3-colorable Each color can be mapped to a distinct physical register

Building the Interference Graph What is an interference? (or conflict) Two values interfere if there exists an operation where both are simultaneously live If x and y interfere, they cannot occupy the same register To compute interferences, we must know where values are live The interference graph, G I Nodes in G I represent values, or live ranges Edges in G I represent individual interferences For x, y G I, <x,y> iff x and y interfere A k-coloring of G I can be mapped into an allocation to k registers

Observation on Coloring for Register Allocation Suppose you have k registers look for a k coloring Any vertex n that has fewer than k neighbors in the interference graph (n < k) can always be colored! Pick any color not used by its neighbors there must be one Ideas behind Chaitin s algorithm: Pick any vertex n such that n < k and put it on the stack Remove that vertex and all edges incident from the interference graph This may make some new nodes have fewer than k neighbors At the end, if some vertex n still has k or more neighbors, then spill the live range associated with n Otherwise successively pop vertices off the stack and color them in the lowest color not used by some neighbor

Chaitin s Algorithm in Practice 3 Registers 2 1 4 5 3 Stack

Chaitin s Algorithm in Practice 3 Registers 2 4 5 3 1 Stack

Chaitin s Algorithm in Practice 3 Registers 4 5 2 1 3 Stack

Chaitin s Algorithm in Practice 3 Registers 5 4 2 1 3 Stack

Chaitin s Algorithm in Practice 3 Registers Colors: 5 3 4 2 1 1: 2: 3: Stack

Chaitin s Algorithm in Practice 3 Registers Colors: 5 1: 3 4 2 1 2: 3: Stack

Chaitin s Algorithm in Practice 3 Registers Colors: 5 1: 4 2 1 3 2: 3: Stack

Chaitin s Algorithm in Practice 3 Registers Colors: 4 5 1: 2 1 3 2: 3: Stack

Chaitin s Algorithm in Practice 3 Registers 1 Stack 2 3 4 5 Colors: 1: 2: 3:

Chaitin s Algorithm in Practice 3 Registers 2 1 4 5 3 Colors: 1: 2: 3: Stack

Picking a Spill Candidate When n G I, n k, simplify must pick a spill candidate Chaitin s heuristic Minimize spill cost current degree If LR x has a negative spill cost, spill it pre-emptively Cheaper to spill it than to keep it in a register If LR x has an infinite spill cost, it cannot be spilled No value dies between its definition & its use No more than k definitions since last value died (safety valve) Spill cost is weighted cost of loads & stores needed to spill x Bernstein et al. Suggest repeating simplify, select, & spill with several different spill choice heuristics & keeping the best