Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Similar documents
CS 406/534 Compiler Construction Instruction Scheduling

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Introduction to Optimization Local Value Numbering

Just-In-Time Compilers & Runtime Optimizers

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Combining Optimizations: Sparse Conditional Constant Propagation

Lab 3, Tutorial 1 Comp 412

Global Register Allocation via Graph Coloring

Intermediate Representations

CS 406/534 Compiler Construction Putting It All Together

Instruction Selection: Preliminaries. Comp 412

Implementing Control Flow Constructs Comp 412

Syntax Analysis, III Comp 412

Parsing II Top-down parsing. Comp 412

Local Register Allocation (critical content for Lab 2) Comp 412

Syntax Analysis, III Comp 412

fast code (preserve flow of data)

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Introduction to Optimization. CS434 Compiler Construction Joel Jones Department of Computer Science University of Alabama

Transla'on Out of SSA Form

The Software Stack: From Assembly Language to Machine Code

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Intermediate Representations

CS553 Lecture Profile-Guided Optimizations 3

Arrays and Functions

The Processor Memory Hierarchy

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

CSE P 501 Compilers. SSA Hal Perkins Spring UW CSE P 501 Spring 2018 V-1

Runtime Support for Algol-Like Languages Comp 412

Intermediate representation

Control Flow Analysis

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Lazy Code Motion. Comp 512 Spring 2011

Loop Invariant Code Mo0on

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

EECS 583 Class 2 Control Flow Analysis LLVM Introduction

Naming in OOLs and Storage Layout Comp 412

Intermediate Representations Part II

Code Merge. Flow Analysis. bookkeeping

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Computing Inside The Parser Syntax-Directed Translation. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

CS293S Redundancy Removal. Yufei Ding

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Topic I (d): Static Single Assignment Form (SSA)

Overview Of Op*miza*on, 2

CS5363 Final Review. cs5363 1

Compiler Optimization Techniques

Challenges in the back end. CS Compiler Design. Basic blocks. Compiler analysis

The View from 35,000 Feet

Compiler Construction 2009/2010 SSA Static Single Assignment Form

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

CSC D70: Compiler Optimization Register Allocation

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

EECS 583 Class 3 More on loops, Region Formation

Building a Parser Part III

Control-Flow Analysis

Handling Assignment Comp 412

Compiler Design. Register Allocation. Hwansoo Han

Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

ECE 5775 High-Level Digital Design Automation Fall More CFG Static Single Assignment

Predicated Software Pipelining Technique for Loops with Conditions

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

Example (cont.): Control Flow Graph. 2. Interval analysis (recursive) 3. Structural analysis (recursive)

CS415 Compilers. Lexical Analysis

Engineering a Compiler

Optimizer. Defining and preserving the meaning of the program

Computing Inside The Parser Syntax-Directed Translation. Comp 412

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

k register IR Register Allocation IR Instruction Scheduling n), maybe O(n 2 ), but not O(2 n ) k register code

CS415 Compilers. Intermediate Represeation & Code Generation

CS 512: Comments on Graph Search 16:198:512 Instructor: Wes Cowan

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Lecture 3 Local Optimizations, Intro to SSA

Lecture Compiler Backend

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Optimisation

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Hiroaki Kobayashi 12/21/2004

CS 432 Fall Mike Lam, Professor. Data-Flow Analysis

Syntax Analysis, VI Examples from LR Parsing. Comp 412

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

CONTROL FLOW ANALYSIS. The slides adapted from Vikram Adve

Parsing II Top-down parsing. Comp 412

CS577 Modern Language Processors. Spring 2018 Lecture Optimization

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Static single assignment

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment

Global Register Allocation via Graph Coloring The Chaitin-Briggs Algorithm. Comp 412

Parsing Part II (Top-down parsing, left-recursion removal)

Transcription:

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators Comp 412 COMP 412 FALL 2016 source code IR Front End Optimizer Back End IR target code Copyright 201, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Chapter 11 (& ) in EaC2e

Local Instruction Scheduling Greedy heuristic technique that operates over a single basic block 1. Rename registers to eliminate artificial constraints (anti-dependences) 2. Build a dependence graph 3. Compute one or more priority functions on the graph 4. Schedule all the operations with respect to the dependences & priorities As long as we stay within a single block List scheduling does well Underlying ideas remain easy to understand (& to implement) COMP 412, Fall 2016 1

Scheduling Larger Regions To achieve further improvement, the compiler can schedule over larger regions Extended Basic Block is a maximal set of blocks such that: The set has a single entry node, B i Each block other than B i has exactly one predecessor B j, and B j is in the set Example CFG to the right has three EBBs * Big EBB has two paths {,, } & {, } Other two EBBs are degenerate Each contains a single block { } & { } h i e f l B a b c d j k g COMP 412, Fall 2016 2

Scheduling Larger Regions Superlocal Scheduling Schedule entire paths through an EBB together In example, schedule {,, }, {, }, { }, & { } a b c d For example, first schedule {,, }, then {, }, then {B }, & { } c,e f c g Having in both {,, } and {, } causes conflicts - Moving an op out of causes problems on the other path - Must insert compensation code in unless c is dead in - Increases code space, may not help {, } h i l B j k c was moved for correctness not for speed! COMP 412, Fall 2016 3

Scheduling Larger Regions Superlocal Scheduling Schedule entire paths through an EBB together In example, schedule {,, }, {, }, { }, & { } a b c d,f For example, first schedule {,, }, then {, }, then { }, & { } e f undo f g Having in both {,, } and {, } causes conflicts - Moving an op into causes problems on the other path - May need compensation code in, although renaming may avoid it - Lengthens {, }, even without compensation code h i l B j k Compensation code makes the path even longer! COMP 412, Fall 2016 4

Scheduling Larger Regions Superlocal Scheduling How much improvement can we get? Schielke showed 11 to 12% speed ups Constrained away compensation code So, it is worth doing a b c d e f g h i B j k Cooper & Schielke, Non-local instruction scheduling with limited code growth, LCTES Workshop, June 199 COMP 412, Fall 2016 l

Scheduling Larger Regions Can we be More Aggressive? Create even more context for local scheduler to use Join points in the CFG create blocks that must work in multiple contexts Maybe we can eliminate join points Join points e f a b c d g h i B j k 2 paths Superblock cloning is a transformation that eliminates some of the join points in the CFG l 3 paths Has application to other optimizations COMP 412, Fall 2016 edges 6

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks at each join point that does not involve a loop-closing branch Superblock Cloning Enabling transformation Start at the root and clone up to a backward branch Creates better conditions for other transformations - e.g., superlocal scheduling or superlocal value numbering Increases code size (replication) Other enabling transformations include: loop unrolling (..2) and inline substitution (..1) a h i l e f B a b j k l a b c d B b c g j k l COMP 412, Fall 2016 edges, but no path is longer

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks at each join point that does not involve a loop-closing branch Some of the resulting blocks can combine Single successor, single predecessor a b c d e f g h i B a j k B b j k a l b l c l COMP 412, Fall 2016 edges

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks at each join point that does not involve a loop-closing branch Some of the resulting blocks can combine Single successor, single predecessor Now, schedule the EBB paths {,, }, {,,B q }, {, } Pay attention to compensation code e f a b c d g j k l h i l B a j k l Works well for forward motion Backward motion still need compensation code COMP 412, Fall 2016 4 edges, every path is shorter 9

A Pointed Digression If you want to work in code optimization, you need to get good at manipulating graphs. Dr. Timothy Harvey How does the compiler identify a loop-closing branch? It uses the notion of dominance in a flow graph We can compute dominance using data flow analysis ( 9.2.1 in EaC2e) Definition In a flow graph, y dominates x iff every path from the graph s entry node to x includes y. x DOM(x) We often denote y dominates y as y DOM x.,, node x G, its set of dominators is called DOM(x) node x G, x DOM(x) DOM(x) 1 B,, B, B, Original idea: R.T. Prosser. Applications of Boolean matrices to the analysis of flow diagrams, COMP Proceedings 412, Fall of the 2016 Eastern Joint Computer Conference, Spartan Books, New York, pages 133-13, 199. 10

A Pointed Digression How does the compiler identify a loop-closing branch? B 0 By inspection, (, ) is a loop-closing branch. x DOM(x) B 0 B 0 B 0, B 0, B B 0,, B 0, A More Complex Graph B B B B 0,, B B 0, B B 0,, B, B A loop-closing branch runs from a node x to some node y in DOM(x) B B 0,, B, B B 0,, B, COMP The computation 412, Fall 2016of DOM sets is covered in 9.2.1 in EaC2e. It is straight forward. 11

Another Pointed Digression To improve scheduling, we turned to extended basic blocks and superblock cloning to give the scheduler more context We can value number over extended basic blocks, as well Treat each path in the EBB as a single block and apply LVN Creates the Superlocal Value Numbering algorithm (SVN) Finds more opportunities for improvement Simple approach reanalyzes blocks With SSA name space, can make this more efficient (See..1 in EaC2e) Superblock cloning can be applied before EBB construction in SVN Eliminates more branches and jumps Introduces some code growth These same tricks have been applied to several local optimizations COMP 412, Fall 2016 12

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Two Options: 1. Profile data instrument the code and run it on representative data interrupt periodically & sample Infer edge counts from perf counters e f a b c d 10 3 g 2. Static estimates use some simple heuristic and guess jumps x 1, branches x 0., backward branches x 10 See the digression on page 42 of EaC2e for more on gathering profile data h i l B 2 j k 3 Block counts can mislead us see B COMP 412, Fall 2016 13

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained from profiling runs Pick the hot path A trace is a maximal length, acyclic path through the CFG The hot path is the trace that has, at each point, the highest count e f a b c d 10 3 g 2 3 h i B j k l Block counts can mislead us see B COMP 412, Fall 2016 14

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained from profiling runs Pick the hot path A trace is a maximal length, acyclic path through the CFG The hot path is the trace that has, at each point, the highest count {,,, } Schedule the hot path Compensation code in & B if needed Get the hot path right! If we picked the right hot path, the other blocks do not matter as much Places a premium on quality profiles h i e f l B 2 a b c d j k 10 3 3 g COMP 412, Fall 2016

Scheduling Larger Regions Trace Scheduling the Entire CFG Pick and schedule the hot path Insert compensation code, as needed Remove hot path from CFG Repeat the process until CFG is empty a b c d 10 3 Example {,,, } then {,B } All other edges run between scheduled blocks Idea Hot paths matter most The farther we go off the hot path, the less it matters h i e f l B 2 j k 3 g COMP 412, Fall 2016 16

Scheduling Larger Regions Sketch of the Trace Selection Algorithm mark each edge as eligible or ineligible initialize the trace with the best eligible edge let x be the best eligible edge entering source(trace) let y be the best eligible edge leaving sink(trace) ( pick the hot path ) best implies highest weight while(one of x or y is non-null ) { let z be the best of x and y add z to the appropriate end of the trace let x be the best eligible edge entering source(trace) let y be the best eligible edge leaving sink(trace) } The details of finding the edges are complicated but straightforward Iterate over inbound edges to find max eligible edge Some Critical Definitions: An edge is ineligible if: (1) Both ends are already in a trace, or (2) It is a loop-closing branch (i.e., y dominates x) Best means the edge in a set of edges that has the highest execution frequency COMP 412, Fall 2016 1

Trace Construction on the More Complex Graph Consider the graph to the right Edges annotated with execution frequencies Can rank edges by frequency Sums make sense along paths 3 10 B B B Example Control-Flow Graph 0 COMP 412, Fall 2016 1

Building Traces First Trace Initial edge is <, > 3 10 B B B 0 COMP 412, Fall 2016 19

Building Traces First Trace Initial edge is <, > Algorithm looks at <, >, <,B >, and <, >. <0, > is a loop closing branch and, therefore, ineligible 3 10 B B B 0 COMP 412, Fall 2016 20

Building Traces First Trace Initial edge is <, > Algorithm looks at <, >, <,B >, and <, >. Adds <,B > 3 10 B B B 0 COMP 412, Fall 2016 21

Building Traces First Trace Initial edge is <, > Algorithm looks at <, >, <,B >, and <, >. Adds <,B > Algorithm looks at <, > and <B,0 > 3 10 B B B 0 COMP 412, Fall 2016 22

Building Traces First Trace Initial edge is <, > Algorithm looks at <, >, <,B >, and <, >. Adds <,B > Algorithm looks at <, > and <B,0 > Adds <B,0 > 3 B 10 B B 0 COMP 412, Fall 2016 23

Building Traces First Trace Initial edge is <, > Algorithm looks at <, >, <,B >, and <, >. Adds <,B > Algorithm looks at <, > and <B,0 > Adds <B,0 > Algorithm looks at <, > and adds it 3 B 10 B B The first trace is {,,,B, 0 }. 0 COMP 412, Fall 2016 24

Building Traces Second Trace Initial edge is one of <, > or <,0 > 3 10 B B B 0 COMP 412, Fall 2016 2

Building Traces Second Trace Initial edge is one of <, > or <,0 > Use <,0 > (arbitrary choice) 3 10 B B B 0 COMP 412, Fall 2016 26

Building Traces Second Trace Initial edge is one of <, > or <,0 > Use <,0 > Algorithm considers <, >; <0, > is ineligible. 3 10 B B B 0 COMP 412, Fall 2016 2

Building Traces Second Trace Initial edge is one of <, > or <,0 > Use <,0 > Algorithm considers <, >, <0, > is ineligible. Adds <, > 3 10 B B B 0 COMP 412, Fall 2016 2

Building Traces Second Trace Initial edge is one of <, > or <,0 > Use <,0 > Algorithm considers <, >, <0, > is ineligible. Adds <, > 3 The second trace is { }, a degenerate (single block) trace. Scheduler might get some context from the prior schedule of 10 B B B 0 COMP 412, Fall 2016 29

Building Traces Third Trace Initial edge is <B,B > 3 10 B B B 0 COMP 412, Fall 2016 30

Building Traces Third Trace Initial edge is <B,B > <,B > and <,B > are eligible 3 10 B B B 0 COMP 412, Fall 2016 31

Building Traces Third Trace Initial edge is <B,B > <,B > and <,B > are eligible Adds <,B > 3 10 B B B 0 COMP 412, Fall 2016 32

Building Traces Third Trace Initial edge is <B,B > <,B > and <,B > are eligible Adds <,B > No more edges are eligible 3 The third trace is {, B, B } 10 B Scheduler can change B and B, but has already been scheduled. B B Scheduler can use the previously scheduled for context. 0 COMP 412, Fall 2016 33

Building Traces Fourth Trace Initial edge is <, > 3 10 B B B 0 COMP 412, Fall 2016 34

Building Traces Fourth Trace Initial edge is <, > <, > and <,B > are eligible 3 10 B B B 0 COMP 412, Fall 2016 3

Building Traces Fourth Trace Initial edge is <, > <, > and <,B > are eligible Algorithm chooses <, > 3 10 B B B 0 COMP 412, Fall 2016 36

Building Traces Fourth Trace Initial edge is <, > <, > and <,B > are eligible Algorithm chooses <, > Only <,B > is eligible 3 10 B B B 0 COMP 412, Fall 2016 3

Building Traces Fourth Trace Initial edge is <, > <, > and <,B > are eligible Algorithm chooses <, > Only <,B > is eligible Algorithm chooses <,B > 3 10 B B B 0 COMP 412, Fall 2016 3

Building Traces Fourth Trace Initial edge is <, > <, > and <,B > are eligible Algorithm chooses <, > Only <,B > is eligible Algorithm chooses <,B > 3 Fourth Trace is {,,, B } 10 B Scheduler cannot change, but it schedules others with context from B B 0 COMP 412, Fall 2016 39

Trace Scheduling Scheduling the Traces The compiler then schedules the traces, in order of discovery: 1. {,,,B,0 } 2. {,, 0 } 3. {,B,B } 4. {,,, B } 3 And, it has selected and scheduled every block 10 B B B 0 COMP 412, Fall 2016 40

Trace Scheduling Trace scheduling applies list scheduling in larger contexts Power of the technique comes from longer code sequences Typical basic blocks, in real code (rather than lab test blocks), are short 3 to 10 operations Power of the technique comes from finding traces that execute often Code that executes infrequently has less impact on performance Cardinal principal of optimization: improve frequently executed code Trace scheduling has been used successfully in commercial practice Particularly useful in compilers for VLIW architectures Larger contexts create more instruction-level parallelism COMP 412, Fall 2016 41