Instruction Selection: Preliminaries. Comp 412

Similar documents
Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Selection and Scheduling

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Lecture 12: Compiler Backend

Instruction Selection, II Tree-pattern matching

CS415 Compilers. Intermediate Represeation & Code Generation

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Intermediate Representations

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Introduction to Compiler

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

Local Register Allocation (critical content for Lab 2) Comp 412

CSE 504: Compiler Design. Instruction Selection

Code Shape II Expressions & Assignment

Intermediate Representations

Implementing Control Flow Constructs Comp 412

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

The Processor Memory Hierarchy

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Topic 6 Basic Back-End Optimization

The Software Stack: From Assembly Language to Machine Code

Handling Assignment Comp 412

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

The ILOC Simulator User Documentation

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

The ILOC Simulator User Documentation

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

Computing Inside The Parser Syntax-Directed Translation. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Lab 3, Tutorial 1 Comp 412

The ILOC Simulator User Documentation

CS 406/534 Compiler Construction Putting It All Together

Arrays and Functions

Runtime Support for Algol-Like Languages Comp 412

The View from 35,000 Feet

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

Naming in OOLs and Storage Layout Comp 412

Introduction to Optimization Local Value Numbering

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Combining Optimizations: Sparse Conditional Constant Propagation

CS415 Compilers. Instruction Scheduling and Lexical Analysis

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Parsing II Top-down parsing. Comp 412

Computing Inside The Parser Syntax-Directed Translation. Comp 412

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

Compiler Design. Code Shaping. Hwansoo Han

CS 406/534 Compiler Construction Instruction Scheduling

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Syntax Analysis, III Comp 412

Syntax Analysis, III Comp 412

Just-In-Time Compilers & Runtime Optimizers

CS415 Compilers. Lexical Analysis

Overview of a Compiler

Code Merge. Flow Analysis. bookkeeping

Alternatives for semantic processing

Optimizer. Defining and preserving the meaning of the program

Code generation for modern processors

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Code generation for modern processors

Compiling Techniques

CS 314 Principles of Programming Languages Fall 2017 A Compiler and Optimizer for tinyl Due date: Monday, October 23, 11:59pm

Compiler Optimisation

Syntax Analysis, VI Examples from LR Parsing. Comp 412

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

Global Register Allocation via Graph Coloring

Formats of Translated Programs

CS 406/534 Compiler Construction Intermediate Representation and Procedure Abstraction

What do Compilers Produce?

Intermediate Representations

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Operator Strength Reduc1on

The Procedure Abstraction

High-level View of a Compiler

CS5363 Final Review. cs5363 1

Intermediate Representations Part II

The Compiler s Front End (viewed from a lab 1 persepc2ve) Comp 412 COMP 412 FALL Chapter 1 & 2 in EaC2e. target code

CSE 401/M501 Compilers

Building a Parser Part III

Register Allocation. Lecture 16

Grammars. CS434 Lecture 15 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Register allocation. TDT4205 Lecture 31

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Transla'on Out of SSA Form

opt. front end produce an intermediate representation optimizer transforms the code in IR form into an back end transforms the code in IR form into

The So'ware Stack: From Assembly Language to Machine Code

(Refer Slide Time: 1:40)

Parsing II Top-down parsing. Comp 412

Intermediate Code Generation (ICG)

UG3 Compiling Techniques Overview of the Course

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

CSE P 501 Compilers. Intermediate Representations Hal Perkins Spring UW CSE P 501 Spring 2018 G-1

Intermediate Representations

Software Tools for Lab 1

Introduction to Parsing. Comp 412

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Instruction Scheduling

Transcription:

COMP 412 FALL 2017 Instruction Selection: Preliminaries Comp 412 source code Front End Optimizer Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Chapter 11 in EaC2e

The Problem source code Front End Optimizer Back End target code Writing a good compiler is a lot of work Would like to reuse components whenever possible Would like to automate construction of components Next 2 Lectures: Automating Instruction Selection Front end construction is largely automated Optimizer is largely hand crafted (Parts of) the back end can be automated 1

Structure of the Compiler s Back End source code Front End Optimizer Back End target code regs Selection asm regs Scheduling asm regs Back End Allocation asm k regs In discussing the compiler s back end, we will assume the model shown to the left Instruction selection is largely solved A major problem of the 1970 s & 1980s Register allocation & instruction scheduling are more complex Both are NP-complete in any real case Operation placement is not yet critical Assume a unified register set Partitioned register sets add another hard problem Parallelism either comes from the application, or is derived with COMP techniques 412, Fall that 2017 are beyond the scope of this course (see COMP 515). 2

Structure of the Compiler s Back End source code Front End Optimizer Back End target code regs Selection asm regs Scheduling asm regs Back End Allocation asm k regs What about the? Common case today is an ILOC-like Branches, memory tags, hierarchy of memory operations, multiple ops/cycle clang is ILOC-like but in SSA form, gcc is a lower level, linear Assume that several things are expicit: Enough register names Order of operations & flow of values All of these techniques also work on a tree-like 3

Definitions Instruction selection ( 11) Maps the into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling ( 12) Reorders the operations to hide latencies Assumes a fixed choice of operations Changes demand for registers Register allocation ( 13) Decides which values will reside in registers Assumes a fixed order of existing operations Changes name space; may add false sharing (chooses operations) (moves defs & uses) (adds ops & places them) These three problems are tightly coupled. 4

The Big Picture Conventional wisdom says that we lose little by solving these problems independently Instruction selection Use some form of pattern matching Assume enough registers or target important values Instruction scheduling Within a block, local list scheduling is close to optimal Across blocks, build framework to apply list scheduling Optimal for > 85% of blocks Register allocation Start from virtual registers & map enough into k Global allocation requires global model or paradigm This slide is full of fuzzy terms Local allocation & local scheduling have a similar algorithmic flavor. We don t have a good paradigm for global scheduling. 5

The Big Picture What are today s hard issues issues? Instruction selection Problems are well solved & tools are available Impact of choices on energy consumption & functional unit placement Instruction scheduling Modulo scheduling loops with control flow Schemes for scheduling long latency operations Finding enough ILP to keep functional units (& cores) busy Register allocation Cost of allocation a particular problem for JITs Better spilling: choice, placement, & granularity Efficiency of the allocator In the original HotSpot Server Compiler (JIT), register allocation took 49% of compile time. 6

How Hard Are These Problems? Instruction selection Can make locally optimal choices quickly, with automated tools Global optimality is (undoubtedly) NP-Complete Instruction scheduling Single basic block Þ approximate heuristics work quickly & well General problem, with control flow Þ NP-Complete To move beyond a block, find a CFG path to treat as a single block Register allocation Single basic block, no spilling, & 1 register size Þ linear time Whole procedure is NP-Complete Approximate global allocators are widely used and accepted No one talks about global optimality in selection 7

Definitions Instruction selection Mapping into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed choice of operations Changes demand for registers Register allocation Deciding which values will reside in registers Assumes a fixed order of existing operations Changes name space; may add false sharing 8

Back To Instruction Selection The fundamental problem is that modern computers have many ways to do any single thing Consider register-to-register copy in ILOC Obvious operation is i2i r i Þ r j Many others exist (any algebraic identity ) addi r i,0 Þ r j subi r i,0 Þ r j multi r i,1 Þ r j lshifti r i,0 Þ r j ori r i,0 Þ r j divi r i,1 Þ r j rshifti r i,0 Þ r j xori r i,0 Þ r j and others Some of these are cheap Some of these are expensive Human would ignore most, if not all, of these An algorithm will look at all of them & find the low-cost choice Take context into account (busy functional unit?) And ILOC is an much simpler than a real ISA 9

The Goal We want to automate generation of instruction selectors (as with parsers & scanners) source code Front End Optimizer Back End target code machine description Back-End Generator Tables Hard-coded Pattern Matcher Pattern Matching Engine Selector Description-based retargeting The machine description should also help with scheduling & allocation Simplification: we will restrict our discussion to signed integers. Expanding to handle other COMP types 412, Fall is 2017 straightforward. It adds more rules to the examples, not more complexity. 10*

The Big Picture Need pattern matching techniques Must produce good code (under some metric for good ) Must run quickly The simple schemes we saw earlier ran quickly How good was the code? Ad Hoc SDT Treewalk on AST Tree AHSDT Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 Þ r 5 loadao r 0,r 5 Þ r 6 loadi 8 Þ r 7 loadao r 0,r 7 Þ r 8 mult r 6,r 8 Þ r 9 loadai r 0,4 Þ r 5 loadai r 0,8 Þ r 6 mult r 5,r 6 Þ r 7 Assume ARP is in r 0 Notation: Variable named b, stored at offset 8 from the ARP 11

The Big Picture Need pattern matching techniques Must produce good code Must run quickly The simple schemes we saw earlier ran quickly How good was the code? Tree AHSDT Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 Þ r 5 loadao r 0,r 5 Þ r 6 loadi 8 Þ r 7 loadao r 0,r 7 Þ r 8 mult r 6,r 8 Þ r 9 Assume ARP is in r 0 loadai r 0,4 Þ r 5 loadai r 0,8 Þ r 6 mult r 5,r 6 Þ r 7 This inefficiency is easy to fix. See digression on page 346 of EaC2e. 12

The Big Picture Need pattern matching techniques Must produce good code Must run quickly The simple schemes we saw earlier ran quickly How good was the code? Tree AHSDT Code Desired Code IDENT <a,arp,4> x BER <2> loadi 4 Þ r 5 loadao r 0,r 5 Þ r 6 loadi 2 Þ r 7 mult r 6,r 7 Þ r 8 loadai r 0,4 Þ r 5 multi r 5,2 Þ r 7 13

The Big Picture Need pattern matching techniques Must produce good code Must run quickly The simple schemes we saw earlier ran quickly How good was the code? Tree AHSDT Code Desired Code IDENT <a,arp,4> x BER <2> loadi 4 Þ r 5 loadao r 0,r 5 Þ r 6 loadi 2 Þ r 7 mult r 6,r 7 Þ r 8 loadai r 0,4 Þ r 5 multi r 5,2 Þ r 7 Must use info from both these nodes. The loadai example was purely local. 14

The Big Picture Need pattern matching techniques Must produce good code Must run quickly Pattern matching is not the answer to all problems with code quality. LVN finds this algebraic identity easily. The simple schemes we saw earlier ran quickly How good was the code? Tree AHSDT Code Desired Code IDENT <a,arp,4> x BER <2> loadi 4 Þ r 5 loadao r 0,r 5 Þ r 6 loadi 2 Þ r 7 mult r 6,r 7 Þ r 8 loadai r 0,4 Þ r 5 add r 5,r 5 Þ r 7 Another possibility that might take less time & energy an algebraic identity 15

How do we perform this kind of matching? Tree-oriented suggests pattern matching on trees Process takes tree-patterns as input, matcher as output Each pattern maps to a sequence of target-machine ops Use dynamic programming or bottom-up rewrite systems Linear suggests using some sort of string matching Process takes strings as input, matcher as output Each string maps to a sequence of target machine ops Use text matching (Aho-Corasick) or peephole matching In practice, both work well & produce efficient matchers. Either can be implemented as a table-driven or hard-coded matcher. 16

Equivalence of Trees and Linear Code Trees and linear code are interchangeable for the kinds of pattern matching done in instruction selection To expose low-level detail, we need an AST with a low-level of abstraction GETS OP Arg 1 Arg 2 Result - x 2 c t 1 - b t 1 a VAL ARP 4 * Quads for a b 2 * c 2 NOTE: All interior nodes have values in registers. a: at ARP4 b: at ARP-16 (call-by-reference) c at @G12 VAL ARP - -16 LAB @G 12 Low-level AST for a b 2 * c 17

Equivalence of Trees and Linear Code Trees and linear code are interchangeable for the kinds of pattern matching done in instruction selection To expose low-level detail, we need an AST with a low-level of abstraction GETS - VAL: in register (r arp ) : constant LAB: ASM label VAL ARP 4 2 * is a constant that fits in an immediate op (e.g., multi). LAB is a constant that does not, but fits in a loadi. a: at ARP4 b: at ARP-16 (call-by-reference) c: at @G12 VAL ARP - 16 LAB @G 12 This kind of mundane detail now matters. Low-level AST for w x 2 * y 18

Prefix Notation To describe low-level ASTs, we need a concise notation VAL Reg i (VAL,) (,Reg i ) Linear prefix form 19

Prefix Notation Corrected after class To describe low-level ASTs, we need a concise notation GETS - -(((-(VAL 2, 2 ))), *( 3, ((LAB 1, 3 )))) VAL ARP 4 * *( 3, ((LAB 1, 4 ))) (VAL 1, 1 ) 2 ((-(VAL 2, 2 ))) - VAL ARP 16 LAB @G 12 GETS((VAL 1, 1 ), -(((-(VAL 2, 2 ))), *( 3,((LAB 1, 4 ))))) Subscripts added to create unique names 20

Prefix Notation To describe the machine operations, we need a concise notation r i c j (r i,c j ) c j r i (c j,r i ) r i r j (r i,r j ) Pattern for commutative variant of (r i,c j ) With each tree pattern, we associate a code template and a cost Template shows how to implement the subtree Cost is used to drive process to low-cost code sequences Prefix notation quickly blurs the distinction between linear and tree s 21

Equivalence of Trees and Linear Code Of course, we can build low-level trees that model low-level linear code Those trees look a lot like dependence graphs built in your schedulers The construction algorithm is quite similar LL Code r 09 2 r 10 @G r 11 12 # @c r 12 r 10 r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r 15 16 # @b r 16 r arp - r 15 r 17 MEM(r 16 ) r 18 MEM(r 17 ) r 19 r 18 - r 14 r 20 4 # @a r 21 r arp r 20 MEM(r 21 ) r 19 r arp VAL ARP r 21 r 20 4 r arp GETS VAL ARP 16 @G 2212 r 18 - r 17 r 16 - r 15 r 19 r 09 2 * r 14 r 10 a: at ARP4 b: at ARP-16 (call-by-reference) c: at @G12 LAB r 13 r 12 r 11