Low-level optimization

Similar documents
Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

A Simple Tree Pattern Matching Algorithm for Code Generator

CODE GENERATION Monday, May 31, 2010

Instruction Selection, II Tree-pattern matching

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CS 406/534 Compiler Construction Putting It All Together

CSE 504: Compiler Design. Instruction Selection

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Register Allocation (via graph coloring) Lecture 25. CS 536 Spring

Compilers and Code Optimization EDOARDO FUSELLA

Code Generation (#')! *+%,-,.)" !"#$%! &$' (#')! 20"3)% +"#3"0- /)$)"0%#"

Compiler Code Generation COMP360

CSCI 565 Compiler Design and Implementation Spring 2014

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

7 Translation to Intermediate Code

Programming Language Processor Theory

CS5363 Final Review. cs5363 1

CSE P 501 Compilers. Instruc7on Selec7on Hal Perkins Winter UW CSE P 501 Winter 2016 N-1

More Code Generation and Optimization. Pat Morin COMP 3002

CS415 Compilers. Intermediate Represeation & Code Generation

Introduction to Parsing. Lecture 8

CHAPTER 3. Register allocation

Global Register Allocation - Part 2

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

Torben./Egidius Mogensen. Introduction. to Compiler Design. ^ Springer

register allocation saves energy register allocation reduces memory accesses.

Lecture Compiler Backend

Front End. Hwansoo Han

Single-pass Static Semantic Check for Efficient Translation in YAPL

Compiler Architecture

Register Allocation 1

Register allocation. TDT4205 Lecture 31

Syntax-Directed Translation. Lecture 14

Compiler Design Overview. Compiler Design 1

Rematerialization. Graph Coloring Register Allocation. Some expressions are especially simple to recompute: Last Time

CS606- compiler instruction Solved MCQS From Midterm Papers

S Y N T A X A N A L Y S I S LR

CHAPTER 3. Register allocation

Summary: Semantic Analysis

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Error Recovery during Top-Down Parsing: Acceptable-sets derived from continuation

Global Register Allocation

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Introduction to Syntax Analysis. Compiler Design Syntax Analysis s.l. dr. ing. Ciprian-Bogdan Chirila

Instruction Selection: Preliminaries. Comp 412

CSE 431S Final Review. Washington University Spring 2013

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

Intermediate Representations

Bottom-Up Parsing. Lecture 11-12

CS 432 Fall Mike Lam, Professor. Code Generation

Recursive Descent Parsers

A Simple Syntax-Directed Translator

CJT^jL rafting Cm ompiler

Variables vs. Registers/Memory. Simple Approach. Register Allocation. Interference Graph. Register Allocation Algorithm CS412/CS413

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

Compiler Optimization Techniques

Itree Stmts and Exprs. Back-End Code Generation. Summary: IR -> Machine Code. Side-Effects

Lecture 25: Register Allocation

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

VIVA QUESTIONS WITH ANSWERS

Compiler Design. Register Allocation. Hwansoo Han

Global Register Allocation

Downloaded from Page 1. LR Parsing

TENTAMEN / EXAM. General instructions

Lecture Notes on Bottom-Up LR Parsing

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

Appendix Set Notation and Concepts

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline.

Review. Pat Morin COMP 3002

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Register Allocation. Lecture 38

Lecture 8: Deterministic Bottom-Up Parsing

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Compiler construction in4303 lecture 9

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

LR Parsing LALR Parser Generators

University of Technology Department of Computer Sciences. Final Examination st Term. Subject:Compilers Design

Bottom-Up Parsing. Lecture 11-12

tokens parser 1. postfix notation, 2. graphical representation (such as syntax tree or dag), 3. three address code

Compiler Optimisation

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture 7: Deterministic Bottom-Up Parsing

Principles of Programming Languages [PLP-2015] Detailed Syllabus

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum

Code Generation. CS 540 George Mason University

Lecture Notes on Bottom-Up LR Parsing

CSCI Compiler Design

Syntax Analysis. Chapter 4

Introduction to Bottom-Up Parsing

Transcription:

Low-level optimization Advanced Course on Compilers Spring 2015 (III-V): Lecture 6 Vesa Hirvisalo ESG/CSE/Aalto

Today Introduction to code generation finding the best translation Instruction selection matching the machine instructions optimization means the best match Code-generator compilers also known as code-generator generators LR-based, TWIG-like, and BURS-based as examples Register allocation overview of the problem interference graph coloring as a solution Instruction scheduling only an overview we will get back to this (is essential for parallelism!) 2/40

Introduction to code generation 3/40

Code generation The two basic problems are: finding the (semantically) correct translations selecting the best one We will mostly consider latter, i.e., low-level optimizations. But also the former, finding correct translations, is difficult. For both, we are facing severe problems machine languages are large they are complex they are ambiguous parallelism is a real challenge in our introduction, we will omit this 4/40

Low-level optimizations Selecting the best (optimal) translation is NP-complete for typical target machines. Even selecting a good translation is difficult: huge search space complex cost functions especially memory but we will omit such complications in our simplified introduction The classical solution is to split code generation into phases: register allocation (data) instruction selection (code) execution order (connecting the code and data) 5/40

Ambiguity Considered as a formal (context-free) language, typical IL grammar of machine language is massively ambiguous. This means that there exists a huge number of (parse) trees for a IL program. Ambiguity leads to conflicts that must be resolved: dynamic resolution during compilation can be slow soft-coded generators typically a program in IL is covered by ML patterns in IL static resolution resolved when the compiler is built hard-coded generators often based on LR-parsing a program in IL parser state space is often huge 6/40

Instruction selection 7/40

Instruction selection Instruction selection translates from IL to machine language. instruction selection methods are translation methods typically much more advanced (and complex) than the syntax-directed translation method for front-ends Code generation can be viewed as a tree rewriting process an instruction is described by a tree pattern patterns may have nonterminal symbols nonterminals stand for an intermediate results description for storage location Actually, a true IL representation is typically a SSA-wired graph consisting of basic block DAGs we simplify (once more) to introduce the basics 8/40

Tree coverage We cover the source (language) tree using the target language patterns (matching) each match corresponds to a cover node The cover tree is the translation result. A cover tree fulfills the following conditions: all cover nodes have pattern that matches the original intermediate representation. a cover node has as many sons as the pattern of the associated rule contains nonterminals. the i-th nonterminal of the rule of a node n is equal to the result nonterminal of the rule of the i-th son of n. 9/40

Example (1/3) Let machine instructions and their patterns in IL be: ADD ri, rj, rk BINOP(+, TEMP(rj), TEMP(rk)) MUL ri, rj, rk BINOP(*, TEMP(rj), TEMP(rk)) SUB ri, rj, rk BINOP(-, TEMP(rj), TEMP(rk)) DIV ri, rj, rk BINOP(/, TEMP(rj), TEMP(rk)) ADDI ri, rj, c BINOP(+, TEMP(rj), CONST(c)) ADDI ri, r0, c CONST(c) SUBI ri, rj, c BINOP(-, TEMP(rj), CONST(c)) LOAD ri, rj[c] MEM(BINOP(+, TEMP(rj), CONST(c))) LOAD ri, r0[c] MEM(CONST(c)) LOAD ri, rj[0] MEM(TEMP(rj)) STORE rj[c], ri MOVE(MEM(BINOP(+, TEMP(rj), CONST(c))), TEMP(ri)) STORE r0[c], ri MOVE(MEM(CONST(c)), TEMP(ri)) STORE rj[0], ri MOVE(MEM(TEMP(rj)), TEMP(ri)) MOVEM rj, ri MOVE(MEM(TEMP(rj)), MEM(TEMP(ri))) Note: Register r0 is always zero. In the above, we omitted listing the symmetries (e.g., switching places of TEMPs, MEMs, and CONSTs). 10/40

Example (2/3) Consider the following IL expression MOVE(MEM(MEM(CONST(y)), MEM(BINOP(+, TEMP(FP), CONST(x))))) to be covered with MOVE MEM MEM MEM FP + CONST x CONST y 11/40

Example (3/3) The cover forms a tree (in the machine language): MOVEM M[ri], M[rj] LOAD r1, r0[y] ADDI rj, fp, x Note that register allocation and execution order are left free. but we have limited the choices by them doing instruction selection first has consequences Note also, we did not describe how we selected the tree coverage. 12/40

Greedy selection A simple approach is to take maximal munch at each time. go top-down until the whole tree is covered at each IL node the largest fitting pattern is selected continue from the uncovered border For the root node of the example, we can select MOVE(MEM(TEMP(rj), TEMP(ri))) or MOVE(MEM(TEMP(rj), MEM(TEMP(ri)))) The latter is bigger, thus it is selected. For the left branch: MEM(CONST(y)) and for the right: BINOP(+, TEMP(FP), CONST(x)) 13/40

Code-generator compilers 14/40

Automated Tools The evolution of programming languages and computer architectures has created the need for automating the code-synthesis phase in compilers. Several techniques have been proposed. some of these techniques have become feasible they are gaining acceptance in practice However compared to front ends LL, LR, etc. parser really do parse the input automated code generators usually do partial generation they help to programmer with trivial code difficult situations are usually manually coded this is laborious as instruction sets are large 15/40

Challenges Front ends are easy to write. What is the problem with the back ends? it is enough for a front end translation to be correct a back end translation must be "optimal" The success of a code generator compiler is measured by the following key criteria: minimization of machine-dependent modules reasonable speed and compactness of target code generated by the code generator reasonable speed and compactness of the code generator reasonable speed of the code generator compiler 16/40

Using context-free grammars A code generator can be construed as a syntax-directed translator source representation a linearized prefix form of the intermediate representation for simplicity, we assume trees in our introduction parsed by a LR parser built for context-free grammar the grammar describes the target machine Representing machines as LR grammars every instruction variant of the target machine is described by a grammar rule uses a prefix form of an intermediate representation nonterminals for the results of instructions nonterminals for the addressing modes 17/40

Code generation with parsers A parser attempts to find a parse tree for the entire source tree in the intermediate language the tree is linearized without loss of information along parsing generates code for it the parse tree is the translation result in the machine language Since there are (typically) several ways to parse there are several machine code sequences that may correspond to it the ambiguities are usually resolved by some heuristics and simplifications 18/40

Example (1/2): Prefix form of trees := (1) + (2) + (5) a (3) r2 (4) deref (6) deref (13) + (7) + (14) deref (8) y(12) c (15) r2 (16) + (9) b (10) r2 (11) Figure: An example input, nodes numbered in their linearization order 19/40

Example (2/2): Prefix form of patterns An example of a code generator rule is reg.1 ::= + deref + const.1 reg.2 reg.1, const.1 (reg.2) The rule matches an addition instruction + [reg] deref reg + const reg Figure: The pattern for the addition instruction add reg,const(reg) 20/40

Code generation with an LR-parser This can be viewed as a form of tree rewriting A parse corresponds to a covering of the tree with instruction templates the instruction templates are also trees however, we use linearized forms of both We can use parsing because of the linearization but this means that the input is seen only once, LR: L= from left consumption of input R= rightmost derivation of a parser tree As any parser generating output, the parser has actions. actions are done at reductions code is emitted during the actions (reductions) 21/40

Problems with parser-based generation With LR-parsing there are several practical difficulties a grammar describing a target machine can be quite large the resulting automaton can be of infeasible size an LR parser is left-operand biased Another problem in purely syntactic approach (context-free grammar) is the difficulty of specifying target-machine architectural constraints register restrictions on addressing-modes results in multiple locations condition codes 22/40

Generator based on tree rewriting Code generation can be viewed as a global tree rewriting process. an instruction is described by a tree pattern patterns may have nonterminal symbols nonterminals stand for an intermediate results Instead of simple greedy approach, global optimization can be done by using global cost computations. a cover tree is called a minimal cover of an intermediate representation if no cheaper cover exists the cost of a cover tree is the sum of the costs of each cover node the cost of a node is the cost of the rule associated with it The problem: how to find the minimal cost cover? 23/40

TWIG-like generators The original Twig (by Aho & all.) is a code generator compiler that uses a tree rewriting scheme. Since, other similar generators have been developed. Twig uses tree rewrite rules of the form replacement <- template { cost } = { action } where replacement is a single node, template is a tree, cost is a code fragment that computes the cost of the template, and action is a code fragment. 24/40

TWIG pattern matching and cost computation The tree-matching algorithm is a generalization of Aho and Corasick s linear-time keyword matching algorithm as suggested by Hoffman and O Donnell. intermediate representation as a string patterns as strings several matches at the same time linear time Twig uses dynamic programming to find the minimal cover. The dynamic programming algorithm is a simplification of Aho and Johnson s optimal code-generation algorithm. instead of a cost vector twig uses a single scalar cost You can find descriptions of these in the literature (omitted from the slides of the course). 25/40

Register allocation 26/40

Introduction to register allocation Register allocation can be divided into sub-problems. register recognition: what values need to be stored (pseudos) register allocations: what are the values stored register assignment: where they are stored Register allocation is about finding storage for computation using static methods. keeping data close to the functional units caches do this dynamically scratchpad store chunks of data (compared to registers) we will omit scratchpad allocation by compilers Instructions using registers are cheaper than memory-based instructions (time-wise, energy-wise, etc.). 27/40

Fixed register allocations Registers have different usages. Typical is to use some for special purposes and allocate the others freely. Typical special uses. stack pointer frame pointer heap pointer indexing Register allocators target the freely allocatable register however, there can be (and often there are!) restrictions on their usage such restrictions complicate the allocation 28/40

Registers and data flow Register allocation is typically based on data flow constraints and heuristic cost analysis. The allocation is usually divided into: local register allocation (basic blocks) global register allocation Once again: optimizing for loops is the key Registers allocated for values (not for variables): one variable can have several values variables can share values this is closely connected to liveness and SSA Note:Typically we use liveness data (or next uses, or du chains), but backward chains can also be needed for the heuristics. 29/40

Interference graphs Register allocation can be done by comparing liveness regions of values. This approach is closely connected to SSA. We build an interference graph between values: the nodes are the values there is an edge between two nodes, if one is alive while the other is defined values cannot be put into the same registers, if there is an edge between their nodes The allocation can be computed by coloring the graph. Note: Graph coloring is a common NP-complete graph problem. Heuristic approaches for compilers include priority-based (Chow-style) and bottom-up (Chaitin-style) methods. 30/40

Example (1/4) The next example uses a heuristic. Consider the program: S1: v3 <- v1 / v2 S2: v5 <- a[v3] S3: a[v7] <- v2 S4: v4 <- v3 / v2 S5: v6 <- v5 - v4 Liveness before S1 between S1/S2 between S2/S3 between S3/S4 between S4/S5 after S5 v1 v2 v7 v2 v3 v7 v2 v3 v5 v7 v2 v3 v5 v4 v5 v6 31/40

Example (2/4) Thus, we get the graph: v4 v3 v2 v5 v6 v7 v1 32/40

Example (3/4) We use a stack to do the coloring. We stack nodes that have less than four neighbors or make a spill if there is not other choice. Assume three registers. We can stack, e.g.: v6 v1 v4 All the rest have more than two neighbors. Let us select v5 to be spilled. Thus, we get: v6 v1 v4 v5 (potential spill) v2 v3 v7 Make an assignment: v7 (R0) v3 (R1) v2 (R2) Neighbors of v5 occupy all the registers (colors), thus it becomes are true spill. 33/40

Example (4/4) Based on the coloring, we end up with the allocation: node edge-to selection - v7 - R0 v3 R0 R1 v2 R0,R1 R2 v5 R0,R1,R2 mem v4 - R0 v1 R0,R2 R1 v6 - R0 34/40

Instruction scheduling 35/40

Introduction to instruction scheduling Order of the execution must be decided this is often done in several phases (and using several levels of abstraction) in the following, we use our simple example machine For a basic blocks, its DAG gives the all the correct execution orders. Thus any traversal following the restrictions is good. different orders can yield different performance selecting the best one is called instruction scheduling Instruction scheduling is in theory NP-complete (and difficult in practice), so heuristics must be used. the issue is even more complex for parallel architectures scheduling constraints (etc.) must be added to IL 36/40

Example (1/4) Consider the following basic block: i <- i / j m <- i k <- i / j i <- a[i] n <- i - k s <- m / j m <- i - s a[t] <- m We apply the local value numbering algorithm to the code. 37/40

Example (2/4) 9 []= 7 - n1, m2 8 t1 6 [] i3 4 / k1, s1 5 a1 3 / i2, m1 1 2 i1 j1 Figure: The value DAG for the basic block 38/40

Example (3/4) The final execution order is usually decided after doing the instruction selection (e.g., assume the DAG to be represented in our IL and covered by our machine language templates). Topological ordering of the computational nodes gives the possible execution orders. 3 4 6 7 9 3 6 4 Figure: Execution orders for the DAG (using value node numbers). The figure is to be read from left to right and machine instructions are represented by the value numbers of the root nodes of they cover. 39/40

Example (4/4) We get the following code, if we select the top-most execution order and assume two registers available (assuming R1 to be holding i and R2 to be holding j at the beginning of the basic block). div R1, R1, R2 ; (3) note the order of the operands! div R2, R1, R2 ; (4) load R1, R2[a] ; (6) sub R1, R1, R2 ; (7) load R2, R0[t] ; remember semantics of R0! store R2[a], R1 ; (9) Note that the choice of the order affected our register allocation. The decision of our register allocator is to keep variable t in memory (i.e., loading it into a register just prior the use). 40/40