Code generation for modern processors

Similar documents
Code generation for modern processors

Compiler Design. Register Allocation. Hwansoo Han

Global Register Allocation via Graph Coloring

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Register Allocation via Hierarchical Graph Coloring

CSC D70: Compiler Optimization Register Allocation

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Register allocation. Overview

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Lecture Overview Register Allocation

register allocation saves energy register allocation reduces memory accesses.

Global Register Allocation via Graph Coloring The Chaitin-Briggs Algorithm. Comp 412

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline.

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

Global Register Allocation - Part 2

Fall Compiler Principles Lecture 12: Register Allocation. Roman Manevich Ben-Gurion University

k register IR Register Allocation IR Instruction Scheduling n), maybe O(n 2 ), but not O(2 n ) k register code

Topic 12: Register Allocation

Global Register Allocation

Register Allocation. Preston Briggs Reservoir Labs

Rematerialization. Graph Coloring Register Allocation. Some expressions are especially simple to recompute: Last Time

Redundant Computation Elimination Optimizations. Redundancy Elimination. Value Numbering CS2210

CS 406/534 Compiler Construction Putting It All Together

The C2 Register Allocator. Niclas Adlertz

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

Register allocation. Register allocation: ffl have value in a register when used. ffl limited resources. ffl changes instruction choices

Low-Level Issues. Register Allocation. Last lecture! Liveness analysis! Register allocation. ! More register allocation. ! Instruction scheduling

Global Register Allocation - Part 3

EECS 583 Class 15 Register Allocation

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

Lecture 15 Register Allocation & Spilling

Register allocation. instruction selection. machine code. register allocation. errors

Liveness Analysis and Register Allocation. Xiao Jia May 3 rd, 2013

Intermediate Representations Part II

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Register Allocation & Liveness Analysis

CS415 Compilers. Intermediate Represeation & Code Generation

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Register allocation. CS Compiler Design. Liveness analysis. Register allocation. Liveness analysis and Register allocation. V.

Instruction Selection: Preliminaries. Comp 412

Compilers and Code Optimization EDOARDO FUSELLA

Compiler Architecture

Lecture 21 CIS 341: COMPILERS

Lecture 6. Register Allocation. I. Introduction. II. Abstraction and the Problem III. Algorithm

Today More register allocation Clarifications from last time Finish improvements on basic graph coloring concept Procedure calls Interprocedural

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Intermediate Representations

Computational Geometry

How to efficiently use the address register? Address register = contains the address of the operand to fetch from memory.

Linear Scan Register Allocation. Kevin Millikin

Register Allocation. Michael O Boyle. February, 2014

CS5363 Final Review. cs5363 1

Intermediate Representation (IR)

SSA-Form Register Allocation

CHAPTER 3. Register allocation

Local Register Allocation (critical content for Lab 2) Comp 412

3 No-Wait Job Shops with Variable Processing Times

Topic I (d): Static Single Assignment Form (SSA)

MIT Spring 2011 Quiz 3

Code Generation. CS 540 George Mason University

Lecture Notes on Register Allocation

Improvements to Graph Coloring Register Allocation

Instruction Selection and Scheduling

Register Allocation. Introduction Local Register Allocators

CHAPTER 3. Register allocation

Static single assignment

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

CSc 553. Principles of Compilation. 23 : Register Allocation. Department of Computer Science University of Arizona

April 15, 2015 More Register Allocation 1. Problem Register values may change across procedure calls The allocator must be sensitive to this

Register Allocation Amitabha Sanyal Department of Computer Science & Engineering IIT Bombay

fast code (preserve flow of data)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Register Allocation. CS 502 Lecture 14 11/25/08

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Code generation for superscalar RISCprocessors. Characteristics of RISC processors. What are RISC and CISC? Processors with and without pipelining

Lecture 25: Register Allocation

Instruction Scheduling

Goals of Program Optimization (1 of 2)

CSE P 501 Compilers. SSA Hal Perkins Spring UW CSE P 501 Spring 2018 V-1

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan

Lecture 3: Art Gallery Problems and Polygon Triangulation

Dealing with Register Hierarchies

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Towards an SSA based compiler back-end: some interesting properties of SSA and its extensions

Compiler Construction 2009/2010 SSA Static Single Assignment Form

Global Register Allocation - 2

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Intermediate Representations & Symbol Tables

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Chapter 13 Reduced Instruction Set Computers

Characteristics of RISC processors. Code generation for superscalar RISCprocessors. What are RISC and CISC? Processors with and without pipelining

Compilers. Lecture 2 Overview. (original slides by Sam

Transcription:

Code generation for modern processors Definitions (1 of 2) What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Instruction selection the process of mapping il into assembly code assumes a fixed storage mapping combining instructions, using address modes (code shape) Strategy il select il il il asm schedule allocate schedule k k regs regs regs regs regs Register allocation the process of deciding which values reside in registers changes storage mapping concern about placement of data (and the code) select is fairly simple allocate and schedule are complex (problem of the 80 s) Topic 12: Register Allocation p.1/30 Topic 12: Register Allocation p.2/30 Definitions (2 of 2) Register allocation Instruction scheduling the process of reordering instructions to hide latencies assumes a fixed program changes demand for registers Concept: m register il k register allocator k register il Each problem is NP-complete for a non-trivial target processor. The problems are tightly intertwined, but conventional wisdom says we can (and should) attack each one separately. Assumptions Load-store RISC architecture Three-address IL Previous analysis identifies values that are illegal to hold in registers. Which? Load into register before use Store back after def Goals Produce correct k register code Minimize added loads and stores Minimize memory space needed to hold spills Allocator must be efficient backtracking) ( no Topic 12: Register Allocation p.3/30 Topic 12: Register Allocation p.4/30

Register classes Allocation versus assignment Some architectures have multiple register classes: commonly: general-purpose (GP) and floating-point (FP) (and double-precision may use two FPs) PowerPC: condition code registers IA 64: predicate registers, branch target registers Problem: Interactions between classes If all classes were used independently, could allocate separately Arithmetic ops may produce condition codes, predicates, etc. to be set FP and other register spills may cause address arithmetic! The distinction: allocation : choosing what to keep in registers at each point assignment : choosing specific registers for values Complexity Local Allocation optimal, linear time methods for simplest case almost everything else is NP-complete Assignment uniform regs, no spilling linear time adjacent register pairs NP-complete Our Approach 1. Assume separate allocation for each class except where noted. 2. Allocate GPRs last. Global NP-complete for 1 register machine NP-complete for k register machine most subproblems are NP-complete NP-complete Topic 12: Register Allocation p.5/30 Topic 12: Register Allocation p.6/30 Register Allocation Preliminaries Two Approaches for Single Basic Block Definitions and observations 1. Spill virtual register (aka value) V Assign to a memory location; not a physical register Load just before each use Store just after each def Usually assign to a stack slot Some algorithms may spill a value in one interval and allocate to a register in another 2. Reserve registers to ensure feasibility default (a) must be able to compute addresses, load, & store (b) requires a minimal number of registers, F : feasible set (c) F depends on architecture 3. MAXLIVE Maximum number of values live at any instruction (in some region of code, e.g., a basic block) Common features of single-basic block allocation All values live in memory between basic blocks Therefore, even for values allocated a physical register: Load before first use in block Store after last def in block if MAXLIVE k, allocation is trivial; F irrelevant if MAXLIVE > k, must spill some values to memory (1): Top-down allocation : Usage Counts a register can hold only a single value in entire block sort variables by ( uses ) assign first k F names to registers spill all other values (load before each use; store after each def) Topic 12: Register Allocation p.7/30 Topic 12: Register Allocation p.8/30

Two Approaches for Single Basic Block Global Register Allocation: An Early Approach (2) Bottom-up allocation : Linear scan using live ranges Global extension of Usage Counts a register can hold different values at different statements keep a stack of free registers for each statement (v 3 = v 1 op v 2 ) assign free registers to v 1, v 2, v 3 (if not in register already) free a register at the end of a live range if no register available, spill some busy register Extend usage counts to account for loops, branches Insert load at block entry; store at block exit Some cross-block analysis: try to avoid load, store at block boundaries A few extra spills in the wrong places can be extremely expensive (live values only) Key question: Which register to spill? spill register used farthest in the future (Sheldon Best 1955) on a tie, favor value that need not be stored back to memory Topic 12: Register Allocation p.9/30 Topic 12: Register Allocation p.10/30 Global Register Allocation: The Modern Approach Graph coloring A fundamentally global approach: Graph Coloring Lavrov (?), Cocke (1971), Chaitin (1981) abandon the distinction between local and global model live ranges of entire procedure in a single graph reduce allocation problem to coloring nodes in the graph minimal coloring is NP-Complete use heuristics to choose coloring map colors onto physical registers The problem A graph G = (N, E) is said to be k-colorable if and only if the nodes can be labeled with integers 1,...,k so that no edge in G connects nodes with the same label. Examples diamond graph 2-colorable a more complex graph 3-colorable Application to allocation & assignment graphical representation of conflicts coloring corresponds to feasible assignment model machine constraints in graph Topic 12: Register Allocation p.11/30 Topic 12: Register Allocation p.12/30

Graph Coloring Register Allocation Global Live Ranges 4 major aspects: 1. Constructing global live ranges not the same as intuitive live range in straight-line code 2. Building interference graph for a procedure captures information about overlapping live ranges 3. Estimating spill costs important to consider when k-coloring fails 4. (Try to) construct a k-coloring if unsuccesful, choose values to spill and repeat spill placement becomes critical issue Let s take these one by one. Definition: Live Range A live range is a set of references (definitions and uses) s.t. for any use in the set, all defs that reach it are in the set too for any def in the set, all uses it reaches are in the set too Fundamental Invariant Also called a web All references in a live range are allocated to the same physical register. Information needed to compute live ranges: need all definitions that reach a single use, and vice versa SSA form provides exactly this information: each SSA variable has a single def and zero or more uses φ-functions show multiple defs that reach a use: x 3 φ(x 1,...x n ) Topic 12: Register Allocation p.13/30 Topic 12: Register Allocation p.14/30 Computing Global Live Ranges The concept of interference Idea: Partition the SSA references into disjoint sets: all references to a variable belong in the same set all arguments to a φ-function belong in the same set as the output variable of the function Algorithm: Use the disjoint set union-find algorithm: 1. initially: every name is a separate set 2. repeatedly merge sets that meet at a φ-function: X = φ(y, Z): Merge the sets currently holding Y and Z into the set holding X 3. Finally, treat all references in the same live range as a single virtual register Definition of Interference Idea: Two values cannot be in one register if used in overlapping intervals One way to define interfere : n i and n j interfere if they are simultaneously live Problem: What if both are not used in some basic block (where both are live)? Better way: n i and n j interfere if n i is defined at a point where n j is live (or vice versa) Representing the interference property model the problem with an interference graph, I nodes represent values any two values that interfere are connected by an edge a k-coloring for I fits in k registers Topic 12: Register Allocation p.15/30 Topic 12: Register Allocation p.16/30

Building the interference graph Copy Coalescing Algorithm: 1. Identify global live ranges 2. Build LIVE_IN, LIVE_OUT sets for each block 3. Walk backwards through each block separately: (a) Initialize live set: LiveNow LIVE_OUT (b) For each instruction: v 1 op v 2 v 3 (i) v 3 interferes with every value in LiveNow (ii) remove v 3 from LiveNow (ii) add v 1, v 2 to LiveNow Use two representations: 1. adjacency matrix: lower-diagonal bit matrix allows test for interference in O(1) time hash-table has same benefit with lower memory usage for large graphs 2. adjacency list: allows efficient iteration over neighbors of a node When to coalesce An important optimization folded into register allocation mov v i v j Coalesce if : 1. v i and v j do not interfere, OR 2. v i and v j are not modified after the copy i.e., values remain equal always Coalesce means... 1. Replace v j with v i 2. Remove copy instruction 3. Combine nodes in the interference graph Topic 12: Register Allocation p.17/30 Topic 12: Register Allocation p.18/30 Estimating spill costs Coloring by Graph Pruning: Observations Components of cost (per reference) for a spill : Load / store: Or Rematerialization: 1. address computation 1. recomputing the value 2. memory operation 2. estimated execution frequency 3. estimated execution frequency Address computation: minimize by keeping spilled values in activation record can load / store with (fp + offset) address Execution frequencies: Static estimation: weight by 10 d for loop depth d weight by branching probability if enclosed in branches Profiling: measure execution frequencies for representative inputs Two Key Observations 1. The degree < k rule: A graph having a node n with degree < k is k-colorable iff the graph with node n removed is k-colorable Proof: obvious given k-coloring of graph without node n, all neighbors of N use fewer than k colors. Pick a remaining color for n. 2. Consider a node n with degree >= k: Neighbors of N may still use fewer than k distinct colors defer spilling until you are sure it is needed Idea: Simplify graph by removing nodes Repeat until graph is empty: Repeatedly remove a node with degree < k from the graph When no such node exists: choose a candidate to spill and remove it Topic 12: Register Allocation p.19/30 Topic 12: Register Allocation p.20/30

Coloring by Graph Pruning: Algorithm Observations about Reserving Registers Algorithm while N is non-empty if node n with n < R, push n on stack else, pick n as possible spill candidate and push n on stack remove n from I (along with its incident edges) while stack is non-empty pop n, insert n into I, try to color n if fail to color n, mark n for spilling Spill all marked nodes (insert code) Rewrite code to use assigned physical registers Use stack of nodes Need registers! Observations If reserved registers for executing spill code: after spilling all nodes, we are done If did not reserve registers: use virtual registers for spilling repeat entire allocation algorithm more expensive but could give fewer spills Heuristic for choosing spill candidates is key Topic 12: Register Allocation p.21/30 Topic 12: Register Allocation p.22/30 Chaitin-Briggs register allocators renumber find live ranges and rename them build build the interference graph, I = (N, E) coalesce fold unneeded copies: l i l j & l i, l j E combine l i & l j spill costs estimate cost for spilling each live range simplify select spill while N is non-empty if n with n < k, push n on stack else, pick n for possible spilling & push it remove n from I while stack is non-empty pop n, insert n into I, & try to color it spill uncolored definitions and uses Incorporates deferred spilling (Briggs 1989) Picking a spill candidate When n N, n k, it must pick a spill candidate Chaitin s heuristic spill cost Chaitin says minimize current degree where current degree is the number of remaining neighbors spill cost of l i is defined as j refs(li) 10 d(j) loads defs(li) where refs(l i ) is defs(l i ) uses(l i ) d(j) is the nesting depth of j 2 10 d(j) stores uses(li) 2 10 d(j) Bernstein et al. suggested repeating simplify, select, & spill with different spill choice heuristics Topic 12: Register Allocation p.23/30 Topic 12: Register Allocation p.24/30

Improvements to Chaitin: Best of 3 spilling How does Chaitin-style do? The idea when allocator blocks, it chooses a value to spill determines which value spills & where code is inserted spill choice is the critical issue let Author area(lr) = n lr num_live(i) 5d(i) Ratio to minimize Chaitin cost current degree Bernstein cost current degree 2 Bernstein cost area Bernstein cost area 2 cost The implementation no metric dominates others (NP-noise) actual coloring is inexpensive; coalescing takes time run multiple colorings & use best result (20%) Strengths and Weaknesses precise interference graph strong coalescing mechanism handles register assignment well runs relatively quickly known to overspill in tight cases graph has no geography spills live range everywhere long blocks become spilling by use counts Is further improvement possible? rising spill costs aggressive transformations live range splitting better allocation for long blocks Remaining problems 1. spill code 2. spill code 3. spill code still room for improvement on this problem Topic 12: Register Allocation p.25/30 Topic 12: Register Allocation p.26/30 Improvements Pass-through live ranges A regional effect Situations that we see in practice 1. Pass-through live ranges value live but not referenced in block or region 2. Spill the wrong value some other value might lead to better code 3. Excessive demand for registers register pressure simply too high Improvements are both possible and desirable Allocator is final filter through which code must pass should be lowest priority for register can be heavily weighted by Chaitin want fair competition inside each loop This problem motivated Briggs s work on live range splitting do i 1 to n... do j 1 to m do k 1 to o x... do j 1 to m do k 1 to o no reference to x... do j 1 to m do k 1 to o x... Topic 12: Register Allocation p.27/30 Topic 12: Register Allocation p.28/30

Improvements to Chaitin (1 of 2) Overview Improvements to Chaitin (2 of 2) Overview Global: Experience suggests that there is no silver bullet Difficulties are not symptom of a single effect deferred spilling Briggs et al. 20% best of three Bernstein et al. 20% rematerialization Briggs et al. 20% optimal coloring expensive these are good things to do do not change underlying problem Regional: live range splitting Briggs et al. ±4 Koblenz & Callahan (?) scalar replacement Carr 20% to 3 move to near-by problem with a better solution Local: Best s method Sheldon Best (1955) naively optimal (et al. in 65,75,88,95) good spill decisions ignore surrounding context Topic 12: Register Allocation p.29/30 Topic 12: Register Allocation p.30/30