CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

Similar documents
Compiler Design. Code Shaping. Hwansoo Han

Code Shape II Expressions & Assignment

CS415 Compilers. Intermediate Represeation & Code Generation

(just another operator) Ambiguous scalars or aggregates go into memory

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

CS 406/534 Compiler Construction Instruction Scheduling

CS 406/534 Compiler Construction Putting It All Together

Instruction Selection: Preliminaries. Comp 412

Implementing Control Flow Constructs Comp 412

Arrays and Functions

Handling Assignment Comp 412

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Intermediate Representations

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Intermediate Code Generation

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Intermediate Code Generation (ICG)

CSE 504: Compiler Design. Runtime Environments

Instruction Selection and Scheduling

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Runtime Support for Algol-Like Languages Comp 412

The Procedure Abstraction

COMP 181. Prelude. Intermediate representations. Today. High-level IR. Types of IRs. Intermediate representations and code generation

Naming in OOLs and Storage Layout Comp 412

The structure of a compiler

Intermediate Code Generation

CS5363 Final Review. cs5363 1

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Intermediate Representations

CS 406/534 Compiler Construction Intermediate Representation and Procedure Abstraction

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

The Software Stack: From Assembly Language to Machine Code

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Compiling Techniques

Optimizer. Defining and preserving the meaning of the program

Parsing II Top-down parsing. Comp 412

Instruction Selection, II Tree-pattern matching

CS 432 Fall Mike Lam, Professor. Code Generation

Lecture 12: Compiler Backend

CS 406/534 Compiler Construction Parsing Part I

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

Separate compilation. Topic 6: Runtime Environments p.1/21. CS 526 Topic 6: Runtime Environments The linkage convention

Intermediate Representations Part II

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Intermediate Representations

The Processor Memory Hierarchy

Compilers. Lecture 2 Overview. (original slides by Sam

Computing Inside The Parser Syntax-Directed Translation. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

CSE 401/M501 Compilers

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Global Register Allocation via Graph Coloring

Code generation for modern processors

Compiler Optimisation

Just-In-Time Compilers & Runtime Optimizers

Code generation for modern processors

Run Time Environment. Procedure Abstraction. The Procedure as a Control Abstraction. The Procedure as a Control Abstraction

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Intermediate Representations & Symbol Tables

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Topic 6 Basic Back-End Optimization

Instruction Scheduling

Syntax Analysis, III Comp 412

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Runtime management. CS Compiler Design. The procedure abstraction. The procedure abstraction. Runtime management. V.

Alternatives for semantic processing

The View from 35,000 Feet

Local Register Allocation (critical content for Lab 2) Comp 412

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Compilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam

Computing Inside The Parser Syntax-Directed Translation. Comp 412

Syntax Analysis, III Comp 412

Compiler Architecture

CODE GENERATION Monday, May 31, 2010

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Run Time Environment. Activation Records Procedure Linkage Name Translation and Variable Access

CSE P 501 Compilers. Intermediate Representations Hal Perkins Spring UW CSE P 501 Spring 2018 G-1

Grammars. CS434 Lecture 15 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

Lab 3, Tutorial 1 Comp 412

CMSC 430 Introduction to Compilers. Spring Code Generation

Concepts Introduced in Chapter 7

CSE 504. Expression evaluation. Expression Evaluation, Runtime Environments. One possible semantics: Problem:

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

fast code (preserve flow of data)

Code Generation. The Main Idea of Today s Lecture. We can emit stack-machine-style code for expressions via recursion. Lecture Outline.

Compiler Optimization Techniques

opt. front end produce an intermediate representation optimizer transforms the code in IR form into an back end transforms the code in IR form into

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

We can emit stack-machine-style code for expressions via recursion

Acknowledgement. CS Compiler Design. Intermediate representations. Intermediate representations. Semantic Analysis - IR Generation

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

CS415 Compilers. Lexical Analysis

Transcription:

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy and Dr. Linda Torczon s teaching materials at Rice University. All rights reserved. 1

What We Did Last Time Motivation for instruction scheduling Instruction scheduling for basic block List scheduling Lab3 CS406/534 Fall 2004, Prof. Li Xu 2 2

Today s Goals Instruction scheduling beyond basic block Code generation CS406/534 Fall 2004, Prof. Li Xu 3 3

Scheduling Larger Regions Motivation: limited ILP within basic blocks Superlocal Scheduling Work EBB at a time Example has four EBBs B 2 e f B 1 a b c d B 3 g Basic Block: straight line code EBB B={B1, B2, B3,, Bn}, B2..Bn has unique predecessor in B B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 4 4

Scheduling Larger Regions Superlocal Scheduling Work on EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems B 2 e f B 1 a b c d B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 5 5

Scheduling Larger Regions Superlocal Scheduling Work on EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems B 2 c,e f B 1 a b c d B 3 no c here! g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 6 6

Scheduling Larger Regions Superlocal Scheduling Work on EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems Must insert compensation code in B 3 Increases code space B 4 h i B 2 c,e f B 1 B 5 a b c d j k B 3 This one wasn t done for speed! c g B 6 l CS406/534 Fall 2004, Prof. Li Xu 7 7

Scheduling Larger Regions Superlocal Scheduling Work on EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op into B 1 causes problems B 2 e f B 1 a b c d B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 8 8

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op into B 1 causes problems Lengthens {B 1,B 3 } B 4 Adds computation to {B 1,B 3 } May need compensation code, too Renaming may avoid undof h i B 2 e f B 1 B 5 a b c d,f j k B 3 This makes the path even longer! undo f g B 6 l CS406/534 Fall 2004, Prof. Li Xu 9 9

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context B 1 a b c d B 2 e f B 3 g Join points create blocks that must work in multiple contexts B 4 h i B 5 j k 2 paths from B1 B 6 l 3 paths from B1 CS406/534 Fall 2004, Prof. Li Xu 10 10

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 1 a b c d B 2 e f B 3 g B 4 h i B 5a j k B 5b j k B 6a l B 6b l B 6c l CS406/534 Fall 2004, Prof. Li Xu 11 11

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 1 a b c d B 2 e f B 3 g B 4 h i B 5a j k B 5b j k B 6a l B 6b l B 6c l CS406/534 Fall 2004, Prof. Li Xu 12 12

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5q }, {B 1,B 3,B 5b } Pay heed to compensation code B 2 e f B 1 a b c d B 3 g Works well for forward motion Backward motion still has off-path problems Speeding up one path can slow down others (undo) B 4 h i l B 5a j k l B 5b j k l CS406/534 Fall 2004, Prof. Li Xu 13 13

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling B 1 a b c d B 2 e f B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 14 14

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1 7 a b c d 10 3 B 2 e f B 3 g 5 2 3 B 4 h i B 5 j k 5 B 6 l 5 Block counts could mislead us see B 5 CS406/534 Fall 2004, Prof. Li Xu 15 15

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1,B 2,B 4,B 6 Schedule it Compensation code in B 3,B 5 if needed Get the hot path right! B 4 If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles h i B 2 5 5 B 6 e f l B 1 7 B 5 2 5 a b c d j k 10 B 3 3 3 g CS406/534 Fall 2004, Prof. Li Xu 16 16

Scheduling Larger Regions Trace Scheduling Entire CFG Pick & schedule hot path Insert compensation code Remove hot path from CFG Repeat the process until CFG is empty B 1 7 a b c d 10 3 B 2 e f B 3 g Idea Hot paths matter Farther off hot path, less it matters B 4 h i 5 5 B 6 l B 5 2 5 j k 3 CS406/534 Fall 2004, Prof. Li Xu 17 17

Summary Generic instruction scheduling framework List scheduling Precedence graph Forward and backward scheduling Many heuristics Works well for basic blocks List scheduling beyond basic block Create larger regions EBBs, Traces, Cloning Compensation code Other techniques Software pipelining for loop kernels CS406/534 Fall 2004, Prof. Li Xu 18 18

Structure of a Compiler O(n) O(n) O(n log n) Scanner words Parser IR Analysis & Optimization IR Instruction Selection Either fast or NP-Complete asm asm asm regs Instruction Scheduling NP-Complete regs Register Allocation NP-Complete A compiler is a lot of fast stuff followed by some hard problems The hard stuff is mostly in code generation and optimization For superscalars, register allocation & scheduling that count k regs CS406/534 Fall 2004, Prof. Li Xu 19 19

Structure of a Compiler For the rest of course, we assume the following model IR Analysis IR asm asm asm Instruction Instruction Register & Selection Scheduling Allocation Optimization regs regs regs regs Selection is relatively simple (RISC factor) Allocation & scheduling are complex Operation placement can become a problem What about the IR? Low-level, RISC-like IR called ILOC Has enough registers ILOC was designed to model modern architecture k regs {Branches, compares, & labels Memory tags Hierarchy of loads & stores Provision for multiple ops/cycle CS406/534 Fall 2004, Prof. Li Xu 20 20

Instruction selection Mapping IR into assembly code Definitions Assumes a fixed storage mapping & code shape Combining operations, using address modes Instruction scheduling Reordering operations to hide latencies Assumes a fixed program (set of operations) Changes demand for registers These 3 problems are tightly coupled. Register allocation Deciding which values will reside in registers Changes the storage mapping, may add false sharing Concerns about placement of data & memory operations CS406/534 Fall 2004, Prof. Li Xu 21 21

How hard are these problems? The Big Picture Instruction selection Can make locally optimal choices, with automated tool Global optimality is (undoubtedly) NP-Complete Instruction scheduling Single basic block heuristics work quickly General problem, with control flow NP-Complete Register allocation Single basic block, no spilling, & 1 register size linear time Whole procedure is NP-Complete CS406/534 Fall 2004, Prof. Li Xu 22 22

Instruction selection The Big Picture Conventional wisdom says that we lose little by solving these problems independently Use some form of pattern matching Assume enough registers or target important values Instruction scheduling Within a block, list scheduling is close to optimal Across blocks, build framework to apply list scheduling Register allocation Start from virtual registers & map enough into k With targeting, focus on good priority heuristic This slide is full of fuzzy terms CS406/534 Fall 2004, Prof. Li Xu 23 23

Definition Code Shape All nebulous properties of the code that impact performance & code quality Includes approach for different constructs, cost, storage requirements & mapping, & choice of operations Code shape is the end product of many decisions (big & small) Impact Code shape influences algorithm choice & results Code shape can encode important facts, or hide them Rule of thumb: expose as much derived information as possible Example: explicit branch targets in ILOC simplify analysis Example: hierarchy of memory operations in ILOC (in EaC) CS406/534 Fall 2004, Prof. Li Xu 24 24

Code Shape Motivating example x + y + z x + y t1 x + z t1 y + z t1 t1+ z t2 t1+ y t2 t1+ z t2 + + + + x y z x + y z x + z y y + z x What if x is 2 and z is 3? What if y+z is evaluated earlier? Addition is commutative & associative for integers The best shape for x+y+z depends on contextual knowledge There may be several conflicting options CS406/534 Fall 2004, Prof. Li Xu 25 25

Code Shape Another example -- the case statement Implement it as cascaded if-then-else statements Cost depends on where your case actually occurs O(number of cases) Implement it as a binary search Need a dense set of conditions to search Uniform (log n) cost Implement it as a jump table Lookup address in a table & jump to it Uniform (constant) cost Compiler must choose best implementation strategy CS406/534 Fall 2004, Prof. Li Xu 26 26

Code Shape The key code quality issue is holding values in registers When can a value be safely allocated to a register? When only 1 name can reference its value Pointers, parameters, aggregates & arrays all cause trouble When should a value be allocated to a register? When it is both safe & profitable Encoding this knowledge into the IR Use code shape to make it known to every later phase Assign a virtual register to anything that can go into one Load or store the others at each reference ILOC has textual memory tags on loads, stores, & calls ILOC has a hierarchy of loads & stores Relies on a strong register allocator CS406/534 Fall 2004, Prof. Li Xu 27 27

Generating Code for Expressions Int expr(node) { int result, t1, t2; switch (type(node)) { case,,+, : t1 expr(left child(node)); t2 expr(right child(node)); result NextRegister(); emit (op(node), t1, t2, result); break; case IDENTIFIER: t1 base(node); t2 offset(node); result NextRegister(); emit (loadao, t1, t2, result); break; case NUMBER: result NextRegister(); emit (loadi, val(node), none, result); break; } return result; } The concept Use a simple treewalk evaluator Bury complexity in routines it calls > base(), offset(), & val() Implements expected behavior > Visits & evaluates children > Emits code for the op itself > Returns register with result Works for simple expressions Easily extended to other operators Does not handle control flow CS406/534 Fall 2004, Prof. Li Xu 28 28

Generating Code for Expressions Int expr(node) { int result, t1, t2; switch (type(node)) { case,,+, : t1 expr(left child(node)); t2 expr(right child(node)); result NextRegister(); emit (op(node), t1, t2, result); break; case IDENTIFIER: t1 base(node); t2 offset(node); result NextRegister(); emit (loadao, t1, t2, result); break; case NUMBER: result NextRegister(); emit (loadi, val(node), none, result); break; } return result; } Example: Produces: x + expr( x ) loadi @x r1 loadao r0, r1 r2 expr( y ) y loadi @y r3 loadao r0, r3 r4 NextRegister() r5 emit(add,r2,r4,r5) add r2, r4 r5 CS406/534 Fall 2004, Prof. Li Xu 29 29

Generating Code for Expressions Int expr(node) { int result, t1, t2; switch (type(node)) { case,,+, : t1 expr(left child(node)); t2 expr(right child(node)); result NextRegister(); emit (op(node), t1, t2, result); break; case IDENTIFIER: t1 base(node); t2 offset(node); result NextRegister(); emit (loadao, t1, t2, result); break; case NUMBER: result NextRegister(); emit (loadi, val(node), none, result); break; } return result; } Example: Generates: x 2 loadi @x r1 loadao r0, r1 r2 loadi 2 r3 loadi @y r4 loadao r0,r4 r5 mult r3, r5 r6 sub r2, r6 r7 y CS406/534 Fall 2004, Prof. Li Xu 30 30

Generating Code in the Parser Need to generate an initial IR form Chapter 5 talks about ASTs & ILOC Might generate an AST, use it for some high-level, nearsource work (type checking, optimization), then traverse it and emit a lower-level IR similar to ILOC The big picture Recursive algorithm really works bottom-up Actions on non-leaves occur after children are done Can encode same basic structure into ad-hoc SDT scheme Identifiers load themselves & stack virtual register name Operators emit appropriate code & stack resulting VR name Assignment requires evaluation to an lvalue or an rvalue CS406/534 Fall 2004, Prof. Li Xu 31 31

Ad-hoc SDT versus a Recursive Treewalk Int expr(node) { int result, t1, t2; switch (type(node)) { case,,+, : t1 expr(left child(node)); t2 expr(right child(node)); result NextRegister(); emit (op(node), t1, t2, result); break; case IDENTIFIER: t1 base(node); t2 offset(node); result NextRegister(); emit (loadao, t1, t2, result); break; case NUMBER: result NextRegister(); emit (loadi, val(node), none, result); break; } return result; } Goal : Expr { $$ = $1; } ; Expr: Expr PLUS Term { t = NextRegister(); emit(add,$1,$3,t); $$ = t; } Expr MINUS Term { } Term{ $$ = $1; } ; Term: Term TIMES Factor { t = NextRegister(); emit(mult,$1,$3,t); $$ = t; }; Term DIVIDES Factor{ } Factor { $$ = $1; }; Factor: NUMBER { t = NextRegister(); emit(loadi,val($1),none, t ); $$ = t; } ID { t1 = base($1); t2 = offset($1); t = NextRegister(); emit(loadao,t1,t2,t); $$ = t; } CS406/534 Fall 2004, Prof. Li Xu 32 32

lhs rhs Strategy Handling Assignment Evaluate rhs to a value (an rvalue) Evaluate lhs to a location (an lvalue) lvalue is a register move rhs lvalue is an address store rhs If rvalue & lvalue have different types Evaluate rvalue to its natural type Convert that value to the type of *lvalue Unambiguous scalars go into registers Ambiguous scalars or aggregates go into memory CS406/534 Fall 2004, Prof. Li Xu 33 33

Handling Assignment What if the compiler cannot determine the rhs s type? This is a property of the language & the specific program If type-safety is desired, compiler must insert a run-time check Add a tag field to the data items to hold type information Code for assignment becomes more complex Generated code template: evaluate rhs iftype(lhs) rhs.tag then convert rhs to type(lhs) or signal a run-time error lhs rhs This is much more complex than if it knew the types CS406/534 Fall 2004, Prof. Li Xu 34 34

Handling Assignment Compile-time type-checking Goal is to eliminate both the check & the tag Determine, at compile time, the type of each subexpression Use compile-time types to determine if a run-time check is needed Optimization strategy If compiler knows the type, move the check to compile-time Unless tags are needed for garbage collection, eliminate them If check is needed, try to overlap it with other computation Can design the language so all checks are static CS406/534 Fall 2004, Prof. Li Xu 35 35

Handling Array First, must agree on a storage scheme Row-major order Lay out as a sequence of consecutive rows Rightmost subscript varies fastest A[1,1], A[1,2], A[1,3], A[2,1], A[2,2], A[2,3] Column-major order Lay out as a sequence of columns Leftmost subscript varies fastest A[1,1], A[2,1], A[1,2], A[2,2], A[1,3], A[2,3] Indirection vectors Vector of pointers to pointers to to values Takes much more space, trades indirection for arithmetic Not amenable to analysis (most languages) (Fortran) (Java) CS406/534 Fall 2004, Prof. Li Xu 36 36

The Concept A Laying Out Arrays 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 Array storage layout have distinct & different cache behavior Row-major order A 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 Column-major order A 1,1 2,1 1,2 2,2 1,3 2,3 1,4 2,4 Indirection vectors 1,1 1,2 1,3 1,4 A 2,1 2,2 2,3 2,4 CS406/534 Fall 2004, Prof. Li Xu 37 37

A[ i ] Computing an Array Address @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) CS406/534 Fall 2004, Prof. Li Xu 38 38

A[ i ] Computing an Array Address @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) int A[1:10] low is 1 Make low 0 for faster access (saves a ) Almost always a power of 2, known at compile-time use a shift for speed CS406/534 Fall 2004, Prof. Li Xu 39 39

A[ i ] Computing an Array Address @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) What about A[i 1,i 2 ]? This stuff looks expensive! Lots of implicit +, -, x ops Row-major order, two dimensions @A + (( i 1 low 1 ) x (high 2 low 2 + 1) + i 2 low 2 ) x sizeof(a[1]) Column-major order, two dimensions @A + (( i 2 low 2 ) x (high 1 low 1 + 1) + i 1 low 1 ) x sizeof(a[1]) Indirection vectors, two dimensions *(A[i 1 ])[i 2 ] where A[i 1 ] is, itself, a 1-d array reference CS406/534 Fall 2004, Prof. Li Xu 40 40

Optimizing Array Address Calculation In row-major order where w = sizeof(a[1,1]) @A + (i low 1 )(high 2 low 2 +1) x w + (j low 2 ) x w Which can be factored into @A + i x (high 2 low 2 +1) x w + j x w (low 1 x (high 2 low 2 +1) x w) + (low 2 x w) If low i, high i, and w are known, the last term is a constant Define @A 0 as @A (low 1 x (high 2 low 2 +1) x w + low 2 x w And len 2 as (high 2 -low 2 +1) Then, the address expression becomes @A 0 + (i x len 2 + j ) x w Compile-time constants CS406/534 Fall 2004, Prof. Li Xu 41 41

Array References in Procedure Calls What about arrays as actual parameters? Whole arrays, as call-by-reference parameters Need dimension information build a dope vector Store the values in the calling sequence Pass the address of the dope vector in the parameter slot Generate complete address polynomial at each reference Some improvement is possible Save len i and low i rather than low i and high i Pre-compute the fixed terms in prologue sequence What about call-by-value? Most c-b-v languages pass arrays by reference This is a language design issue @A low 1 high 1 low 2 high 2 CS406/534 Fall 2004, Prof. Li Xu 42 42

Array Address Calculations in a Loop DO J = 1, N A[I,J] = A[I,J] + B[I,J] END DO Naïve: Perform the address calculation twice DO J = 1, N R1 = @A 0 + (J x len 1 + I ) x floatsize R2 = @B 0 + (J x len 1 + I ) x floatsize MEM(R1) = MEM(R1) + MEM(R2) END DO CS406/534 Fall 2004, Prof. Li Xu 43 43

Array Address Calculations in a Loop DO J = 1, N A[I,J] = A[I,J] + B[I,J] END DO Sophisticated: Move common calculations out of loop R1 = I x floatsize c = len 1 x floatsize! Compile-time constant R2 = @A 0 + R1 R3 = @B 0 + R1 DO J = 1, N a = J x c R4 = R2 + a R5 = R3 + a MEM(R4) = MEM(R4) + MEM(R5) END DO CS406/534 Fall 2004, Prof. Li Xu 44 44

Array Address Calculations in a Loop DO J = 1, N A[I,J] = A[I,J] + B[I,J] END DO Very sophisticated: Convert multiply to add (Operator Strength Reduction) R1 = I x floatsize c = len 1 x floatsize! Compile-time constant R2 = @A 0 + R1 ; R3 = @B 0 + R1 DO J = 1, N R2 = R2 + c R3 = R3 + c MEM(R2) = MEM(R2) + MEM(R3) END DO See, for example, Cooper, Simpson, & Vick, Operator Strength Reduction, CS406/534 Fall 2004, Prof. Li Xu ACM TOPLAS, Sept 2001 45 45

Boolean & Relational Values How should the compiler represent them? Answer depends on the target machine Two classic approaches Numerical representation Positional (implicit) representation Correct choice depends on both context and ISA CS406/534 Fall 2004, Prof. Li Xu 46 46

Boolean & Relational Values Numerical representation Assign values to TRUE and FALSE Use hardware AND, OR, and NOT operations Use comparison to get a boolean from a relational expression Examples x < y becomes cmp_lt r x,r y r 1 if (x < y) cmp_lt r then stmt 1 becomes x,r y r 1 else stmt cbr r1 _stmt 1,_stmt 2 2 CS406/534 Fall 2004, Prof. Li Xu 47 47

Boolean & Relational Values What if the ISA uses a condition code? Must use a conditional branch to interpret result of compare Necessitates branches in the evaluation Example: cmp r x, r y cc 1 cbr_lt cc 1 L T,L F x < y becomes L T : loadi 1 r 2 br L E L F : loadi 0 r 2 L E : other stmts Condition codes are an architect s hack allow ISA to avoid some comparisons complicates code for simple cases This positional representation is much more complex CS406/534 Fall 2004, Prof. Li Xu 48 48

Boolean & Relational Values The last example actually encodes result in the PC If result is used to control an operation, this may be enough Example if (x < y) then a c + d else a e + f VARIATIONS ON THE ILOC BRANCH STRUCTURE Straight Condition Codes Boolean Compares comp r x,r y cc 1 cmp_lt r x,r y r 1 cbr_lt cc 1 L 1,L 2 cbr r 1 L 1,L 2 L 1 : add r c,r d r a L 1 : add r c,r d r a br L OUT br L OUT L 2 : add r e,r f r a L 2 : add r e,r f r a br L OUT br L OUT L OUT : nop L OUT : nop Condition code version does not directly produce (x < y) Boolean version does Still, there is no significant difference in the code produced CS406/534 Fall 2004, Prof. Li Xu 49 49

Boolean & Relational Values Conditional move & predication both simplify this code Example if (x < y) then a c + d else a e + f OTHER ARCHITECTURAL VARIATIONS Conditional Move Predicated Execution comp r x,r y cc 1 cmp_lt r x,r y r 1 add r c,r d r 1 (r 1 )? add r c,r d r a add r e,r f r 2 ( r 1 )? add r e,r f r a i2i_< cc 1,r 1,r 2 r a Both versions avoid the branches Both are shorter than CCs or Boolean-valued compare CS406/534 Fall 2004, Prof. Li Xu 50 50

Boolean & Relational Values Consider the assignment x a < b c < d VARIATIONS ON THE ILOC BRANCH STRUCTURE Straight Condition Codes Boolean Compare comp r a,r b cc 1 cmp_lt r a,r b r 1 cbr_lt cc 1 L 1,L 2 cmp_lt r c,r d r 2 L 1 : comp r c,r d cc 2 and r 1,r 2 r x cbr_lt cc 2 L 3,L 2 L 2 : loadi 0 r x br L OUT L 3 : loadi 1 r x br L OUT L OUT : nop Here, the boolean compare produces much better code CS406/534 Fall 2004, Prof. Li Xu 51 51

Boolean & Relational Values Conditional move & predication help here, too x a < b c < d OTHER ARCHITECTURAL VARIATIONS Conditional Move Predicated Execution comp r a,r b cc 1 cmp_lt r a,r b r 1 i2i_< cc 1,r T,r F r 1 cmp_lt r c,r d r 2 comp r c,r d cc 2 and r 1,r 2 r x i2i_< cc 2,r T,r F r 2 and r 1,r 2 r x Conditional move is worse than Boolean compares Predication is identical to Boolean compares Context & hardware determine the appropriate choice CS406/534 Fall 2004, Prof. Li Xu 52 52

If-then-else Control Flow Follow model for evaluating relationals & booleans with branches Branching versus predication (e.g., IA-64) Frequency of execution Uneven distribution do what it takes to speed common case Amount of code in each case Unequal amounts means predication may waste issue slots Control flow inside the construct Any branching activity within the base case complicates the predicates and makes branches attractive CS406/534 Fall 2004, Prof. Li Xu 53 53

Loops Control Flow Evaluate condition before loop (if needed) Evaluate condition after loop Branch back to the top (if needed) Merges test with last block of loop body Pre-test while, for, do, & until all fit this basic model Loop body Post-test Next block CS406/534 Fall 2004, Prof. Li Xu 54 54

Loop Implementation Code for (i = 1; i< 100; i++) { body } next statement loadi 1 r 1 loadi 1 r 2 loadi 100 r 3 cmp_ge r 1, r 3 r 4 cbr r 4 L 2, L 1 L 1 : body add r 1, r 2 r 1 cmp_lt r 1, r 3 r 5 cbr r 5 L 1, L 2 L 2 : next statement Initialization Pre-test Post-test CS406/534 Fall 2004, Prof. Li Xu 55 55

Break statements Many modern programming languages include a break Exits from the innermost control-flow statement Out of the innermost loop Out of a case statement Pre-test Translates into a jump Targets statement outside controlflow construct Creates multiple-exit construct Skip in loop goes to next iteration Break in B1 Loop head B 1 B 2 Post-test Skip in B2 Only make sense if loop has > 1 block Next block CS406/534 Fall 2004, Prof. Li Xu 56 56

Case Statements Control Flow 1 Evaluate the controlling expression 2 Branch to the selected case 3 Execute the code for that case 4 Branch to the statement after the case Parts 1, 3, & 4 are well understood, part 2 is the key CS406/534 Fall 2004, Prof. Li Xu 57 57

Control Flow Case Statements 1 Evaluate the controlling expression 2 Branch to the selected case 3 Execute the code for that case 4 Branch to the statement after the case (use break) Parts 1, 3, & 4 are well understood, part 2 is the key Strategies Linear search (nested if-then-else constructs) Build a table of case expressions & binary search it Directly compute an address (requires dense case set) CS406/534 Fall 2004, Prof. Li Xu 58 58

Procedure Linkages Standard procedure linkage procedure p prolog pre-call post-return epilog procedure q prolog epilog Procedure has standard prolog standard epilog Each call involves a pre-call sequence post-return sequence These are completely predictable from the call site depend on the number & type of the actual parameters CS406/534 Fall 2004, Prof. Li Xu 59 59

Implementing Procedure Calls If p calls q In the code for p, compiler emits pre-call sequence Evaluates each parameter & stores it appropriately Loads the return address from a label (with access links) sets up q s access link Branches to the entry of q In the code for p, compiler emits post-return sequence Copy return value into appropriate location Free q s AR, if needed Resume p s execution Invariant parts of pre-call sequence might be moved into the prolog CS406/534 Fall 2004, Prof. Li Xu 60 60

Implementing Procedure Calls If p calls q In the prolog, q must Set up its execution environment (with display) update the display entry for its lexical level Allocate space for its (AR &) local variables & initialize them If q calls other procedures, save the return address Establish addressability for static data area(s) In the epilog, q must Store return value (unless return statement already did so) (with display) restore the display entry for its lexical level Restore the return address (if saved ) Begin restoring p s environment Load return address and branch to it CS406/534 Fall 2004, Prof. Li Xu 61 61

Implementing Procedure Calls If p calls q, one of them must Preserve register values (caller-saves versus callee saves) Caller-saves registers stored/restored by p in p s AR Callee-saves registers stored/restored by q in q s AR Allocate the AR Heap allocation callee allocates its own AR Stack allocation caller & callee cooperate to allocate AR Space tradeoff Pre-call & post-return occur on every call Prolog & epilog occur once per procedure More calls than procedures Moving operations into prolog/epilog saves space CS406/534 Fall 2004, Prof. Li Xu 62 62

Implementing Procedure Calls Evaluating parameters Call by reference evaluate parameter to an lvalue Call by value evaluate parameter to an rvalue & store it Aggregates, arrays, & strings are usually c-b-r Language definition issues Alternative is copying them at each procedure call Small structures can be passed in registers (in & out ) Can pass large c-b-v objects c-b-r and copy on modification AIX does this for C CS406/534 Fall 2004, Prof. Li Xu 63 63

Implementing Procedure Calls Evaluating parameters Call by reference evaluate parameter to an lvalue Call by value evaluate parameter to an rvalue & store it Procedure-valued parameters Must pass starting address of procedure With access links, need the lexical level as well Procedure value is a tuple < level,address > May also need shared data areas (file-level scopes ) In-file & out-of-file calls have (slightly ) different costs This lets the caller set up the appropriate access link CS406/534 Fall 2004, Prof. Li Xu 64 64

What About Calls in an OOL (Dispatch)? In an OOL, most calls are indirect calls (virtual calls) Compiled code does not contain address of callee Finds it by indirection through class method table Required to make virtual calls find right methods Code compiled in class C cannot know of subclass methods that override methods in C and C s superclasses In the general case, need dynamic dispatch Map method name to a search key Perform a run-time search through hierarchy Start with object s class, search for 1 st occurrence of key This can be expensive Use a method cache to speed search Cache holds < key, class, method pointer > How big? Bigger more hits & longer search Smaller fewer hits, faster search CS406/534 Fall 2004, Prof. Li Xu 65 65

What About Calls in an OOL (Dispatch)? Improvements are possible in special cases If class has no subclasses, can generate direct call Class structure must be static or class must be FINAL If class structure is static Can generate complete method table for each class Single indirection through class pointer (1 or 2 operations) Keeps overhead at a low level If class structure changes infrequently Build complete method tables at run time Initialization & any time class structure changes If running program can create new classes, Well, not all things can be done quickly CS406/534 Fall 2004, Prof. Li Xu 66 66

Summary Instruction scheduling beyond basic block Code shape and code generation Expression Assignment Array access Boolean value Control flow (if-else, loop, case) Procedure call Dynamic dispatch for OOL CS406/534 Fall 2004, Prof. Li Xu 67 67

Next Class Instruction selection through pattern matching Register allocation CS406/534 Fall 2004, Prof. Li Xu 68 68