Instruction Selection and Scheduling

Similar documents
Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Instruction Selection: Preliminaries. Comp 412

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Topic 6 Basic Back-End Optimization

CS 406/534 Compiler Construction Instruction Scheduling

Lecture 12: Compiler Backend

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

CSE 504: Compiler Design. Instruction Selection

Instruction Selection, II Tree-pattern matching

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Instruction Scheduling

Introduction to Compiler

CS5363 Final Review. cs5363 1

CS 406/534 Compiler Construction Putting It All Together

CS415 Compilers. Intermediate Represeation & Code Generation

Lab 3, Tutorial 1 Comp 412

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

The ILOC Simulator User Documentation

The ILOC Simulator User Documentation

The ILOC Simulator User Documentation

CS 314 Principles of Programming Languages Fall 2017 A Compiler and Optimizer for tinyl Due date: Monday, October 23, 11:59pm

fast code (preserve flow of data)

Code Shape II Expressions & Assignment

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Local Register Allocation (critical content for Lab 2) Comp 412

Compiler Design. Code Shaping. Hwansoo Han

Compiler Design. Register Allocation. Hwansoo Han

Code generation and local optimization

Code generation and local optimization

Code generation and optimization

Intermediate Code Generation (ICG)

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

: Advanced Compiler Design. 8.0 Instruc?on scheduling

CODE GENERATION Monday, May 31, 2010

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Compiler Optimization Techniques

The structure of a compiler

Code generation for modern processors

Code generation for modern processors

Code Generation. CS 540 George Mason University

Lecture 21 CIS 341: COMPILERS

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Register Allocation. CS 502 Lecture 14 11/25/08

Compiler Architecture

Implementing Control Flow Constructs Comp 412

Compiler Optimization and Code Generation

Review. Pat Morin COMP 3002

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 19: Efficient IL Lowering 5 March 08

Global Optimization. Lecture Outline. Global flow analysis. Global constant propagation. Liveness analysis. Local Optimization. Global Optimization

Intermediate Code & Local Optimizations. Lecture 20

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction

CSc 453. Code Generation I Christian Collberg. Compiler Phases. Code Generation Issues. Slide Compilers and Systems Software.

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Introduction. CSc 453. Compilers and Systems Software. 19 : Code Generation I. Department of Computer Science University of Arizona.

Software Tools for Lab 1

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

CS 432 Fall Mike Lam, Professor. Data-Flow Analysis

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

Code Optimization. Code Optimization

HPC VT Machine-dependent Optimization

IR Lowering. Notation. Lowering Methodology. Nested Expressions. Nested Statements CS412/CS413. Introduction to Compilers Tim Teitelbaum

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Intermediate Representations Part II

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

CS153: Compilers Lecture 15: Local Optimization

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Optimization

Intermediate Representations

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Register allocation. TDT4205 Lecture 31

Compiler Design and Construction Optimization

Comp 204: Computer Systems and Their Implementation. Lecture 22: Code Generation and Optimisation

The View from 35,000 Feet

CSE 490/590 Computer Architecture Homework 2

Code Generation (#')! *+%,-,.)" !"#$%! &$' (#')! 20"3)% +"#3"0- /)$)"0%#"

Lecture Compiler Backend

(just another operator) Ambiguous scalars or aggregates go into memory

CSE 504: Compiler Design. Code Generation

COMPILER DESIGN - CODE OPTIMIZATION

Code Analysis and Optimization. Pat Morin COMP 3002

CS311 Lecture: The Architecture of a Simple Computer

Intermediate representation

Introduction. L25: Modern Compiler Design

Register Allocation. Lecture 16

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

An Instruction Stream Compression Technique 1

CS 432 Fall Mike Lam, Professor. Code Generation

Basic Processor Design

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

6. Intermediate Representation

Transcription:

Instruction Selection and Scheduling

The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate construction of components Front End Middle End Back End Today s lecture: Infrastructure Automating Instruction Selection and Scheduling Front end construction is largely automated Middle is largely hand crafted (Parts of ) back end can be automated

Definitions Instruction selection Mapping IR into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed program (set of operations) Changes demand for registers Register allocation Deciding which values will reside in registers Changes the storage mapping, may add false sharing Concerns about placement of data & memory operations

The Problem Modern computers (still) have many ways to do anything Consider register-to-register copy in ILOC Obvious operation is i2i r i r j Many others exist addi r i,0 r j subi r i,0 r j lshifti r i,0 r j multi r i,1 r j divi r i,1 r j! rshifti r i,0 r j! ori r i,0 r j! xori r i,0 r j and others

The Problem Modern computers (still) have many ways to do anything Human would ignore all of these Algorithm must look at all of them & find lowcost encoding Take context into account (busy functional unit?)

The Goal Want to automate generation of instruction selectors Front End Middle End Back End Infrastructure Machine description Back-end Generator Tables Pattern Matching Engine Description-based retargeting Machine description can help with scheduling & allocation

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 8 r 7 loadao r arp,r 7 r 8 mult r 6,r 8 r 9 loadai r arp,4 r 5 loadai r arp,8 r 6 mult r 5,r 6 r 7

The Big Picture Need pattern matching techniques Must produce good code (some metric for good ) Must run quickly A treewalk code generator runs quickly How good was the code? Tree Treewalk Code Desired Code IDENT <a,arp,4> x IDENT <b,arp,8> loadi 4 r 5 loadao r arp,r 5 r 6 loadi 8 r 7 loadao r arp,r 7 r 8 mult r 6,r 8 r 9 loadai r arp,4 r 5 loadai r arp,8 r 6 mult r 5,r 6 r 7

Tree-oriented IR Suggests pattern matching on trees Tree-patterns as input, matcher as output Each pattern maps to a target-machine instruction sequence Use bottom-up rewrite systems

Linear IR Suggests using some sort of string matching Strings as input, matcher as output Each string maps to a target-machine instruction sequence Use text matching or peephole matching

Peephole Matching Basic idea Compiler can discover local improvements locally Look at a small set of adjacent operations Move a peephole over code & search for improvement Classic example: store followed by load Original code Improved code storeai r 1 r arp,8 storeai r 1 r arp,8 loadai r arp,8 r 15 i2i r 1 r 15

Peephole Matching Basic idea Compiler can discover local improvements locally Look at a small set of adjacent operations Move a peephole over code & search for improvement Classic example: store followed by load Simple algebraic identities Original code Improved code addi r 2,0 r 7 mult r 4,r 7 r 10 mult r 4,r 2 r 10

Peephole Matching Basic idea Compiler can discover local improvements locally Look at a small set of adjacent operations Move a peephole over code & search for improvement Classic example: store followed by load Simple algebraic identities Jump to a jump Original code Improved code jumpi L 10 L 10 : jumpi L 11 L 10 : jumpi L 11

Peephole Matching Implementing it Early systems used limited set of hand-coded patterns Window size ensured quick processing Modern peephole instruction selectors Break problem into three tasks IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM

Peephole Matching Expander Turns IR code into a low-level IR (LLIR) Operation-by-operation, template-driven rewriting Significant, albeit constant, expansion of size IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM

Peephole Matching Simplifier Looks at LLIR through window and rewrites is Uses forward substitution, algebraic simplification, local constant propagation, and dead-effect elimination Performs local optimization within window IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM This is the heart of the peephole system Benefit of peephole optimization shows up in this step

Peephole Matching Matcher Compares simplified LLIR against a library of patterns Picks low-cost pattern that captures effects Generates the assembly code output IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM

Example Original IR Code OP Arg 1 Arg 2 Result mult 2 Y t 1 sub x t 1 w r 14 t 1 = w = r 20 Expand LLIR Code r 10 2 r 11 @y r 12 r arp + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r 15 @x r 16 r arp + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r 19 @w r 20 r arp + r 19 MEM(r 20 ) r 18

Example Original IR Code OP Arg 1 Arg 2 Result mult 2 Y t 1 sub x t 1 w r 14 t 1 = w = r 20 Expand LLIR Code r 10 2 r 11 @y r 12 r arp + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r 15 @x r 16 r arp + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r 19 @w r 20 r arp + r 19 MEM(r 20 ) r 18

Example Original IR Code OP Arg 1 Arg 2 Result mult 2 Y t 1 sub x t 1 w r 14 t 1 = w = r 20 Expand LLIR Code r 10 2 r 11 @y r 12 r arp + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r 15 @x r 16 r arp + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r 19 @w r 20 r arp + r 19 MEM(r 20 ) r 18

Example LLIR Code r 10 2 r 11 @y r 12 r arp + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r 15 @x r 16 r arp + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r 19 @w r 20 r arp + r 19 MEM(r 20 ) r 18 Simplify MEM(r arp + @w) r 18 LLIR Code r 13 MEM(r arp + @y) r 14 2 x r 13 r 17 MEM(r arp + @x) r 18 r 17 - r 14

Example MEM(r arp + @w) r 18 LLIR Code r 13 MEM(r arp + @y) r 14 2 x r 13 r 17 MEM(r arp + @x) r 18 r 17 - r 14 Match ILOC (Assembly) Code loadai r arp,@y r 13 multi 2 x r 13 r 14 loadai r arp,@x r 17 sub r 17 - r 14 r 18 storeai r 18 r arp,@w Introduced all memory operations & temporary names Turned out pretty good code

Making It All Work Details LLIR is largely machine independent Target machine described as LLIR ASM pattern Actual pattern matching Use a hand-coded pattern matcher Several important compilers use this technology It seems to produce good portable instruction selectors (gcc) Key strength appears to be late low-level optimization

Definitions Instruction selection Mapping IR into assembly code Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed program (set of operations) Changes demand for registers Register allocation Deciding which values will reside in registers Changes the storage mapping, may add false sharing Concerns about placement of data & memory operations

What Makes Code Run Fast? Operations have non-zero latencies Modern machines can issue several operations per cycle Execution time is order-dependent

What Makes Code Run Fast? Assumed latencies (conservative) Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 fadd 1 fmult 2 shift 1 branch 0 to 8 Loads & stores may or may not block > Non-blocking fill those issue slots Branch costs vary with path taken Scheduler should hide the latencies

Example: w w * 2 * x * y * z Load causes add to stall Cycles Simple schedule 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 2 registers, 20 cycles

Example: w w * 2 * x * y * z Load causes mult to stall Cycles Simple schedule 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 2 registers, 20 cycles

Example: w w * 2 * x * y * z Load causes mult to stall Cycles Simple schedule 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 2 registers, 20 cycles

Example: w w * 2 * x * y * z Load causes mult to stall Cycles Simple schedule 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 2 registers, 20 cycles

Example: w w * 2 * x * y * z Schedule loads early Cycles Schedule loads early 1 loadai r0,@w r1 2 loadai r0,@x r2 3 loadai r0,@y r3 4 add r1,r1 r1 5 mult r1,r2 r1 6 loadai r0,@z r2 7 mult r1,r3 r1 9 mult r1,r2 r1 11 storeai r1 r0,@w 14 r1 is free 3 registers, 13 cycles Reordering operations to improve some metric is called instruction scheduling

Example: w w * 2 * x * y * z Cycles Schedule loads early Load takes 3 cycles Add can execute, does not stall 1 loadai r0,@w r1 2 loadai r0,@x r2 3 loadai r0,@y r3 4 add r1,r1 r1 5 mult r1,r2 r1 6 loadai r0,@z r2 7 mult r1,r3 r1 9 mult r1,r2 r1 11 storeai r1 r0,@w 14 r1 is free 3 registers, 13 cycles Reordering operations to improve some metric is called instruction scheduling

Instruction Scheduling (Engineer s View) The Problem Given a code fragment and the latencies for each operation, reorder the operations to minimize execution time The Concept Machine description The task Produce correct code Minimize wasted cycles slow code Scheduler fast code Avoid spilling registers Operate efficiently

Instruction Scheduling (The Abstract View) To capture properties of the code, build a dependence graph G Nodes n G are operations An edge e = (n 1,n 2 ) G if n 2 uses the result of n 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Dependence Graph

Instruction Scheduling (What s so difficult?) All operands must be available Multiple operations can be ready Moving operations can lengthen register lifetimes Increases register pressure Operands can have multiple predecessors Together, these issues make scheduling hard (NP-Complete)

Instruction Scheduling (Local list scheduling) 1. Build a dependence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue

Local List Scheduling Cycle 1 Active Ø Ready roots of P while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Initialize and set roots as ready to schedule.

Local List Scheduling Cycle 1 Active Ø Ready roots of P while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Loop while ready queue is not empty. Remove an op from ready queue to schedule (move to active)

Local List Scheduling Cycle 1 Active Ø Ready roots of P while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Simulating architecture; increment cycle count

Local List Scheduling Cycle 1 Active Ø Ready roots of P while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Check if operations in Active queue should be removed based on cycle count.

Local List Scheduling Cycle 1 Active Ø Ready roots of P while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s If successor s operands are ready, put it on Ready queue

Scheduling Example 1. Build the dependence graph a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code a b d f c h e The Dependence Graph i g

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 11 b 14 a 9 d 12 c 10 e 8 f g 7 5 h 3 i The Dependence Graph

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling New register name used 1) a: l o ad A I r 0, @w r 1 2) c: l o ad A I r 0, @ x r 2 3) e: l o ad A I r 0, @ y r 3 4) b: a d d r 1, r 1 r 1 5) d: m ul t r 1, r 2 r 1 6) g: l o ad A I r 0, @z r 2 7) f: m ul t r 1, r 3 r 1 9) h: m ul t r 1, r 2 r 1 11) i: s t o r e A I r 1 r 0, @w The Code 10 b 13 a 9 d 12 c 10 e 8 f g 7 5 h 3 i The Dependence Graph

More List Scheduling List scheduling breaks down into two distinct classes Forward list scheduling Start with available operations Work forward in time Ready all operands available Backward list scheduling Start with no successors Work backward in time Ready result >= all uses