Lecture Compiler Backend

Similar documents
register allocation saves energy register allocation reduces memory accesses.

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Register Allocation & Liveness Analysis

CS 406/534 Compiler Construction Putting It All Together

Compiler Architecture

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Lecture 21 CIS 341: COMPILERS

Instruction Scheduling

Lecture Compiler Middle-End

Fall Compiler Principles Lecture 12: Register Allocation. Roman Manevich Ben-Gurion University

High-Level Synthesis

Variables vs. Registers/Memory. Simple Approach. Register Allocation. Interference Graph. Register Allocation Algorithm CS412/CS413

HIGH-LEVEL SYNTHESIS

Compiler Design. Register Allocation. Hwansoo Han

Code Generation. CS 540 George Mason University

Instruction Level Parallelism (ILP)

Topic 9: Control Flow

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

SCHEDULING II Giovanni De Micheli Stanford University

CS 406/534 Compiler Construction Instruction Scheduling

4.1 Interval Scheduling

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Register allocation. CS Compiler Design. Liveness analysis. Register allocation. Liveness analysis and Register allocation. V.

Global Register Allocation via Graph Coloring

Processor (IV) - advanced ILP. Hwansoo Han

: Advanced Compiler Design. 8.0 Instruc?on scheduling

Preventing Stalls: 1

High-Level Synthesis (HLS)

Hardware-based Speculation

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CHAPTER 3. Register allocation

Midterm II CS164, Spring 2006

CHAPTER 3. Register allocation

CS5363 Final Review. cs5363 1

Four Steps of Speculative Tomasulo cycle 0

Register allocation. Overview

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Control Flow Analysis & Def-Use. Hwansoo Han

Topic 12: Register Allocation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Figure : Example Precedence Graph

HY425 Lecture 09: Software to exploit ILP

Compilers CS S-08 Code Generation

HY425 Lecture 09: Software to exploit ILP

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Register allocation. TDT4205 Lecture 31

COMPUTER ORGANIZATION AND DESI

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

Register Allocation. CS 502 Lecture 14 11/25/08

MIPS Functions and Instruction Formats

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

High Level Synthesis

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

The Processor: Instruction-Level Parallelism

Lecture 6. Register Allocation. I. Introduction. II. Abstraction and the Problem III. Algorithm

Instruction Set Architecture (Contd)

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 4 The Processor (Part 4)

CS252 Graduate Computer Architecture Midterm 1 Solutions

Status of the Bound-T WCET Tool

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CS415 Compilers. Intermediate Represeation & Code Generation

CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

CS 61C: Great Ideas in Computer Architecture. MIPS Instruction Formats

Model-based Software Development

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Global Register Allocation

Intermediate representation

Low-level optimization

Advanced issues in pipelining

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

Course Administration

Register Allocation. Lecture 16

CS781 Lecture 2 January 13, Graph Traversals, Search, and Ordering

Lec 25: Parallel Processors. Announcements

Multi-cycle Instructions in the Pipeline (Floating Point)

Advanced Computer Architecture

Unit 2: High-Level Synthesis

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Getting CPI under 1: Outline

The C2 Register Allocator. Niclas Adlertz

CSci 231 Final Review

LECTURE 19. Subroutines and Parameter Passing

CS425 Computer Systems Architecture

Instruction-Level Parallelism (ILP)

Transcription:

Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1

Backend Tasks Instruction selection Map virtual instructions To machine instructions Scheduling Reorder instructions To exploit instruction level parallelism (ILP) Register allocation Map values (computed by instruction) To machine registers Jianwen Zhu 2009 - P. 2

Instruction Selection Naïve selection One-to-Many: Fast Each virtual instrn map to one or more machine instrns Employed by fast JIT compilers Poor performance Better selection May-to-one: Map a subgraph of virtual instrns to one machine instruction Subgraph is called a tile Better performance More challenge in algorithm design Jianwen Zhu 2009 - P. 3

Example tiles = MIPS code: IR Tile CONST(c) li 'd0, c +(e0,e1) add 'd0, 's0, 's1 +(e0,const(c)) add 'd0, 's0, c *(e0,e1) mult 'd0, 's0, 's1 *(e0,const(2^k)) sll 'd0, 's0, k MEM(e0) lw 'd0, ('s0) MEM(+(e0,CONST(c))) lw 'd0, c('s0) MOVE(MEM(e0),e1) sw 's1, ('s0) MOVE(MEM(+(e0,CONST(c))),e1)sw 's1, c('s0) JUMP(NAME(X)) b X JUMP(e0) jr 's0 LABEL(X) X: nop e0 e1 d0 s0 s1 IR tile en sn Jianwen Zhu 2009 - P. 4

Problem Formuation Input : Subject graph Nodes: virtual instrns Edges: operand usages Simplification Subject graph reduced to a forest of trees Break whenever a value is used more than once Input: Patterns Each machine instruction is modeled as a tree of virtual instructions Jianwen Zhu 2009 - P. 5

Problem Formulation Find a covering of the subject graph (tree) Non-overlapping partitioning of graph Each partition = a tile Each tile matches a pattern (machine instruction) Such that Cost of cover is minimized Cost = #number of tiles (machine instructions) Jianwen Zhu 2009 - P. 6

Instruction Selection Challenge There are many possible alterantives Exponentially many! a[i]:=x is: MOVE(MEM(+(MEM(+(TEMP(fp),CONST(20))), *(TEMP(i),CONST(4)))), MEM(+(TEMP(fp),CONST(10)))) The following are two possible tilings of the IR: lw r1, 20($fp) add r1, $fp, 20 lw r2, i lw r1, (r1) sll r2, r2, 2 lw r2, i add r1, r1, r2 sll r2, r2, 2 lw r2, 10($fp) add r1, r1, r2 sw r2, (r1) add r2, $fp, x lw r2, (r2) sw r2, (r1) Which one is better? Jianwen Zhu 2009 - P. 7

Top Down Greedy Algorithm Maximum Munch Start from the IR root and from all matching tiles Select the one with the maximum number of IR nodes Go to the children of this tile and apply the algorithm recursively until you reach the tree leaves Pros Fast Cons Greedy Making decisions without knowing impact Jianwen Zhu 2009 - P. 8

Maximum Munch Example A lw r1, fp B lw r2, 8(r1) C lw r3, i D sll r4, r3, 2 E add r5, r2, r4 F lw r6, fp G lw r7, 16(r6) H add r8, r7, 1 I sw r8, (r5) Jianwen Zhu 2009 - P. 9

Bottom-up Optimal Algorithm Works from the leaves to the root Evaluate each matching tile Calculate the cost of children first The cost is accumulated cost cost of a tile = (number of nodes in the tile) + (total costs of all the tile children) Pick the best tile Pros Optimal (if the subject graph is a tree) Cons A node maybe visited many times as children Cost calculation repeated What have we learned before to solve that? Jianwen Zhu 2009 - P. 10

Dynamic Programming Remember the solution of children! Cost calculated only once Use a table to remember That s what programming means Example Jianwen Zhu 2009 - P. 11

Step 1 Jianwen Zhu 2009 - P. 12

Step 2 Jianwen Zhu 2009 - P. 13

Step 3 Jianwen Zhu 2009 - P. 14

Why Scheduling Pipelined Hazards Reorder could relieve hazards VLIW By DSP (eg. TI) Issue multiple instrn per cycle Determined by compiler Superscalar By Desktop (e.g Intel) Issue multiple instrn per cycle Determined by hardware scheduler Compiler could increase available ILP Jianwen Zhu 2009 - P. 15

Scheduling Problem For each basic block Map each instrn v To a clock step S(v): integer Subject to Data dependency constraints S (u) < S (v) if v depends on u Subject to resource constraints #instrn issued < available #FU Jianwen Zhu 2009 - P. 16

Scheduling: Dependency Test Why dependency test Schedule max # of inst in a step. Preserve functional correctness Sources of data dependency Must l l A = B = A May (if A & B aliased to a same location) l (1) A = B = l (2) A = = B l (3) = A B = Jianwen Zhu 2009 - P. 17

Dependency Test for TinyIR No pointers dependency test only consists of comparing the symbol names With pointers Pointer/alias analysis has to be performed Test result: a precedence graph for each basic block Jianwen Zhu 2009 - P. 18

Precedence Graph Example TinyC(a) Code to TinyIR(b) Jianwen Zhu 2009 - P. 19

Precedence Graph Example For a basic block, for each instruction, draw its dependencies as edges in Precedence Graph. E.g. (28) (26) (24),(25) Instructions outside the basic block will be names s and t (a) (b) Chain of instructions Precedence Graph of B4 Jianwen Zhu 2009 - P. 20

Unconstrained Scheduling Assume an unlimited # of functional units Problem Total # of steps = S(t) S(s) Schedule of source must be earlier and sink must be later than any nodes in the basic block. Jianwen Zhu 2009 - P. 21

As Soon As Possible (ASAP) Scheduling Jianwen Zhu 2009 - P. 22

ASAP in English Iterative approach In each iteration a set of ready nodes (inst) are scheduled to a control step. A node is ready when all its predecessors are scheduled Key: how to efficiently decide if a node is ready? Naïve approach for each node being scheduled, visit each successor, check all predecessors of the successor Better approach: keep a counter for each node representing # of predecessors For each node being scheduled, visit each successor Decrement the counter of the successor When counter = 0, the node is ready. O( V + E ) ALAP(As Late As Possible) algorithm starts to schedule in reverse order (from sink) Jianwen Zhu 2009 - P. 23

ASAP and ALAP Example s s 15 16 17 18 20 24 25 26 step 0 15 16 24 25 15 16 step 1 17 26 17 step 2 18 18 24 25 step 3 20 20 26 t t (a) (b) (c) (a) ASAP (b) ALAP (c) Mobility Mobility shows the flexibility of scheduling for each node. (difference between ASAP step and ALAP step. Jianwen Zhu 2009 - P. 24

Resource-constrained Scheduling Now consider with limited # of functional units: New constraint: # of instructions scheduled at any control step must <= # of functional units s s 15 16 17 18 20 24 25 26 step 0 15 16 24 25 15 16 step 1 17 26 Violated! 17 step 2 18 18 24 25 step 3 20 20 26 t t (a) (b) (c) Jianwen Zhu 2009 - P. 25

List Scheduling Jianwen Zhu 2009 - P. 26

List Scheduling in English Modified version of ASAP. Like ASAP, list of nodes ready are maintained Unlike ASAP Unit occupancy for current control step needs to be maintained: reservation station (restab) Only a subset of ready nodes can be scheduled. Key question: How to choose the subset for better performance? Classic NP-complete problem Solution: Assign priority to ready nodes Priority determined by heuristics Jianwen Zhu 2009 - P. 27

List Scheduling Heuristics Less-Flexible-First assigns higher priority to nodes have smaller mobility Mobility = ALAP - ASAP Distance to sink # successors Jianwen Zhu 2009 - P. 28

Example (a) (b) (c) Uses less-flexible-first priority Random1 did not choose 16 at step 0 (extra step) Ramdom2 did not choose 17 at step 2 (extra step) s s s step 0 15 + 16-25 * + + 15 24 25 * 15 + 16-25 * step 1 17 * 24 + - 16 26 * 17 * 24 + step 2 18 * 17 * 26 * step 3 20 + 26 * 18 * 18 * step 4 t 20 + 20 + t t (a) Jianwen Zhu 2009 - P. 29 (b) (c)

Register Allocation Virtual instructions compute values Values must be stored for later use Store values in registers Questions How many registers are needed? How can values be bound to registers? Jianwen Zhu 2009 - P. 30

Register Binding Objective: Minimize the number of registers Maximize the sharing of registers between values Observation: Two values can only share the same register if they are not in use at the same time Jianwen Zhu 2009 - P. 31

Liveness analysis Objective: Determine when variables are in use What does it mean to be live? Definition 5.5 A value (instruction) v is live at a control step s1 if there exists another control step s2 reachable from s1, such that v is used as an operand by one of the instructions scheduled at s2. A live set at control step s1 is the set of all values alive at s1. Jianwen Zhu 2009 - P. 32

Example 6 4 Live values 4 6 15 16 17 18 20 24 25 26 step 0 15 16 25 {4, 6} step 1 17 24 {4, 15, 16, 25} step 2 18 {17, 24, 25} step 3 20 26 {18, 24, 25} {20, 26} (a) (b) Jianwen Zhu 2009 - P. 33

Live(s), Def(s) and Use(s) Liveness analysis is used to compute when variables are alive Uses three sets Live(s): Set of live values at the beginning of step s Def(s): Set of values defined at step s Use(s): Set of values used at step s Jianwen Zhu 2009 - P. 34

Basic Block Liveness Analysis Relationship between variables Live(s) = Use(s) U [Live(s+1) Def(s)] Backward scan: Since each step requires information from the next step Idea: Used variables become live Defined variables become dead Jianwen Zhu 2009 - P. 35

Basic Block Liveness Analysis {18, 24, 25} U[ {20, 26} - {20, 26} ] STEP Def Use Live 0 {15, 16, 25} {4, 6} 1 {17, 24} {4, 15, 16} 2 {18} {17} 3 {20, 26} {18, 24, 25} {18, 24, 25} 4 {20, 26} Jianwen Zhu 2009 - P. 36

Basic Block Liveness Analysis {17} U[ {18,24,25} - {18} ] STEP Def Use Live 0 {15, 16, 25} {4, 6} {4, 6} 1 {17, 24} {4, 15, 16} {4, 15, 16, 25} 2 {18} {17} {17, 24, 25} 3 {20, 26} {18, 24, 25} {18, 24, 25} 4 {20, 26} Jianwen Zhu 2009 - P. 37

Basic Block Liveness Analysis Jianwen Zhu 2009 - P. 38

Liveness Analysis Algorithm This algorithm can be extended to whole control flow graph Caveats: CFG has backward edges BB can have multiple successors Modifications Repeatedly traverse graph until solution converges Union all the live sets from predecessors before applying equation Jianwen Zhu 2009 - P. 39

Liveness Analysis Algorithm Jianwen Zhu 2009 - P. 40

Interference Graph Liveness of variables has been computed Construct a graph describing interference of two variables i.e. both variables are alive at the same time Nodes: represent variables Edges: represent two variables alive at the same time Jianwen Zhu 2009 - P. 41

Interference Graph 4 26 6 25 15 24 16 20 17 18 Jianwen Zhu 2009 - P. 42

Interference Graph Algorithm Jianwen Zhu 2009 - P. 43

Register binding by coloring Recall: We want to assign variables to registers to minimize the number of registers used Corresponds to the graph coloring problem Given a graph color each node such that no edge joins nodes of the same color using the fewest number of color Jianwen Zhu 2009 - P. 44

Graph Coloring Graph color is NP-complete Need heuristics to complete problem within reasonable time Overview Given a order which to visit nodes (i.e. vertex elimination order) Visit each node in that order Pick a color such that no neighbor has the same one Jianwen Zhu 2009 - P. 45

Example 25 16 Jianwen Zhu 2009 - P. 46

Example 25 15 16 Jianwen Zhu 2009 - P. 47

Example 4 25 15 16 Jianwen Zhu 2009 - P. 48

Example 4 25 15 24 16 Jianwen Zhu 2009 - P. 49

Example 4 25 15 24 16 17 18 Jianwen Zhu 2009 - P. 50

Example 4 26 6 25 15 24 16 20 17 18 Jianwen Zhu 2009 - P. 51

Coloring Algorithm Jianwen Zhu 2009 - P. 52

Vertex Elimination Order In previous example, assumed that order was given For basic block To determine order use left-edge algorithm Optimal for the interval graph Generally Use heuristic: less-flexible first i.e. pick nodes with the most neighbors first Jianwen Zhu 2009 - P. 53

Vertex Elimination To generate vertex order Visit node with fewest neighbors and push them onto a stack Repeat until all edges visited Stack will now contain vertex order starting with the top node on the stack Jianwen Zhu 2009 - P. 54

Example 4 26 6 25 15 24 16 20 17 18 26 20 6 Jianwen Zhu 2009 - P. 55

Example 4 25 15 24 18 17 16 18 17 26 20 6 Jianwen Zhu 2009 - P. 56

Example 4 25 24 15 16 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 57

Example 4 25 15 16 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 58

Example 25 15 16 15 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 59

25 Example 16 25 16 15 4 24 18 17 26 20 6 Jianwen Zhu 2009 - P. 60