CS 406/534 Compiler Construction Instruction Scheduling

Similar documents
CS415 Compilers. Instruction Scheduling and Lexical Analysis

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Instruction Selection and Scheduling

fast code (preserve flow of data)

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

k register IR Register Allocation IR Instruction Scheduling n), maybe O(n 2 ), but not O(2 n ) k register code

CS 406/534 Compiler Construction Putting It All Together

Instruction Scheduling

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lab 3, Tutorial 1 Comp 412

Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

CS5363 Final Review. cs5363 1

Review: The Ideal Scheduling Outcome. List Scheduling. Scheduling Roadmap. Review: Scheduling Constraints. Todd C. Mowry CS745: Optimizing Compilers

Instruction Selection: Preliminaries. Comp 412

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

CS415 Compilers. Intermediate Represeation & Code Generation

The ILOC Simulator User Documentation

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

Global Register Allocation via Graph Coloring

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Lecture Compiler Backend

The ILOC Simulator User Documentation

The ILOC Simulator User Documentation

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Lecture 12: Compiler Backend

Instruction Selection, II Tree-pattern matching

Implementing Control Flow Constructs Comp 412

Local Register Allocation (critical content for Lab 2) Comp 412

CS415 Compilers. Lexical Analysis

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Advanced Computer Architecture

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

: Advanced Compiler Design. 8.0 Instruc?on scheduling

Code Generation. CS 540 George Mason University

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Hardware Speculation Support

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

Compiler Design. Register Allocation. Hwansoo Han

Software Tools for Lab 1

VLIW/EPIC: Statically Scheduled ILP

Compiler Architecture

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CS 406/534 Compiler Construction Parsing Part I

CS 432 Fall Mike Lam, Professor. Code Generation

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESI

Multiple Instruction Issue. Superscalars

Multi-cycle Instructions in the Pipeline (Floating Point)

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

The Processor: Instruction-Level Parallelism

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Code Shape II Expressions & Assignment

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Code generation for modern processors

Code generation for modern processors

Intermediate Representations

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Processor (IV) - advanced ILP. Hwansoo Han

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

The IA-64 Architecture. Salient Points

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Getting CPI under 1: Outline

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS 432 Fall Mike Lam, Professor. Data-Flow Analysis

CS61C : Machine Structures

CS 406/534 Compiler Construction Intermediate Representation and Procedure Abstraction

Instruction Level Parallelism

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS425 Computer Systems Architecture

Intermediate Representations

Preventing Stalls: 1

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

An Experimental Evaluation of List Scheduling

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

November 7, 2014 Prediction

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Transcription:

CS 406/534 Compiler Construction Instruction Scheduling Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy and Dr. Linda Torczon s teaching materials at Rice University. All rights reserved. 1

Administravia Lab2 due Class presentation by student today Lab3 has been posted on the web Due on 12/5 in 3 weeks Instruction scheduling for single basic block Implement 2 algorithms and compare results Lecture today covers the necessary background simulator and test files in ~cs406/lab3 CS406/534 Fall 2004, Prof. Li Xu 2 2

What We Did Last Time Procedure activation record and run-time structures for OO languages Change of schedule today to jump start lab3 Back to code generation in the next class CS406/534 Fall 2004, Prof. Li Xu 3 3

Today s Goals Instruction scheduling Motivation List scheduling Scheduling for larger regions CS406/534 Fall 2004, Prof. Li Xu 4 4

The Big Picture Source Code Front End IR Middle IR Back End Machine code Errors So far, we have covered the front-end Parsing, intermediate representation Middle end Data flow analysis and optimization, later Back end Generate machine code and deal with processor architecture Instruction scheduling today CS406/534 Fall 2004, Prof. Li Xu 5 5

The Big Picture IR Instruction Selection IR Register Allocation IR Instruction Scheduling Machine code Errors Back end has become increasingly important due to hardware architectures Multiple-issue, deep pipelining in modern micro-architecture Memory hierarchy to deal with CPU-memory gap Critical for application performance on modern machines CS406/534 Fall 2004, Prof. Li Xu 6 6

Recap from Computer Architecture Instruction-level parallelism Execute multiple instructions in parallel Employ multiple functional units and pipelining Typical paradigms Out-of-order superscalar Hardware dynamically reorders instructions Compiler scheduling can boost performance Examples: alpha, sparc, mips and other RISCs, even Pentium VLIW Compiler statically schedule parallel instructions into a long instruction word Depends on compiler to generate correct code Not just performance issues Examples: Intel Itanium IA-64, TI C6xxx CS406/534 Fall 2004, Prof. Li Xu 7 7

Motivation Instruction Scheduling Operations have non-uniform latencies due to pipelining Multiple-issue arch can issue several operations per cycle As a result, execution time is order-dependent Instruction Scheduling: decide on instruction execution order to minimize execution cycles Assumed latencies (conservative) Operation Cycles Loads & stores may or may not block load 3 > Non-blocking fill those issue slots store 3 Branch costs vary with path taken loadi 1 Branches typically have delay slots add 1 > Fill slots with unrelated operations mult 2 > Percolates branch upward fadd 1 Scheduler should hide the latencies fmult 2 shift 1 branch 0 to 8 Lab 3 will build a local scheduler CS406/534 Fall 2004, Prof. Li Xu 8 8

Example w w * 2 * x * y * z Simple schedule Schedule loads early 1 loadai r0,@w r1 4 add r1,r1 r1 5 loadai r0,@x r2 8 mult r1,r2 r1 9 loadai r0,@y r2 12 mult r1,r2 r1 13 loadai r0,@z r2 16 mult r1,r2 r1 18 storeai r1 r0,@w 21 r1 is free 1 loadai r0,@w r1 2 loadai r0,@x r2 3 loadai r0,@y r3 4 add r1,r1 r1 5 mult r1,r2 r1 6 loadai r0,@z r2 7 mult r1,r3 r1 9 mult r1,r2 r1 11 storeai r1 r0,@w 14 r1 is free 2 registers, 20 cycles 3 registers, 13 cycles Reordering operations for speed is called instruction scheduling CS406/534 Fall 2004, Prof. Li Xu 9 9

The Problem Instruction Scheduling Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The Concept Machine description The task Produce correct code Minimize execution cycles slow code Scheduler fast code Avoid spilling registers Operate efficiently CS406/534 Fall 2004, Prof. Li Xu 10 10

Instruction Scheduling To capture properties of the code, build a precedence graph G Node n G are operations with type(n) and delay(n) An edge e = (n 1,n 2 ) G if & only if n 2 uses the result of n 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 11 11

Instruction Scheduling A correct schedule S maps each n N into a non-negative integer representing its cycle number, and 1. S(n) 0, for all n N 2. If (n 1,n 2 ) E, S(n 1 ) + delay(n 1 ) S(n 2 ) 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S), is L(S) = max n N (S(n) + delay(n)) The goal is to find the shortest possible correct schedule. S is time-optimal if L(S) L(S 1 ), for all other schedules S 1 A schedule might also be optimal in terms of registers, power, or space. CS406/534 Fall 2004, Prof. Li Xu 12 12

Critical Points Instruction Scheduling All operands must be available Multiple operations can be ready Moving operations can lengthen register lifetimes Placing uses near definitions can shorten register lifetimes Operands can have multiple predecessors Together, these issues make scheduling hard (NP-Complete) Local scheduling is the simple case Restricted to straight-line code Consistent and predictable latencies CS406/534 Fall 2004, Prof. Li Xu 13 13

Instruction Scheduling The big picture 1. Build a precedence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue Local list scheduling The dominant algorithm for twenty years A greedy, heuristic, local technique CS406/534 Fall 2004, Prof. Li Xu 14 14

Local List Scheduling Cycle 1 Ready leaves of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Removal in priority order op has completed execution If successor s operands are ready, put it on Ready CS406/534 Fall 2004, Prof. Li Xu 15 15

Scheduling Example 1. Build the precedence graph a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 16 16

Scheduling Example 1. Build the precedence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w 11 14 a b 9 d 7 12 c 10 e 8 g f 5 h 3 i The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 17 17

Scheduling Example 1. Build the precedence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling New register name used 1) a: loadai r0,@w r1 2) c: loadai r0,@x r2 3) e: loadai r0,@y r3 4) b: add r1,r1 r1 5) d: mult r1,r2 r1 6) g: loadai r0,@z r2 7) f: mult r1,r3 r1 9) h: mult r1,r2 r1 11) i: storeai r1 r0,@w 10 13 a b 9 d 7 12 c 10 e 8 g f 5 h 3 i The Code The Precedence Graph CS406/534 Fall 2004, Prof. Li Xu 18 18

Detailed Scheduling Algorithm Idea: Keep a collection of worklists W[c], one per cycle We need MaxC = max delay + 1 such worklists Code: for each n N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2) E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1] {n2}; end for i := 0 to MaxC 1 do W[i] := ; Wcount := 0; for each n N do if count[n] = 0 then begin W[0] := W[0] {n}; Wcount := Wcount + 1; end c := 0; // c is the cycle number cw := 0;// cw is the number of the worklist for cycle c instr[c] := ; CS406/534 Fall 2004, Prof. Li Xu 19 19

Detailed Scheduling Algorithm while Wcount > 0 do begin while W[cW] = do begin c := c + 1; instr[c] := ; cw := mod(cw+1,maxc); end nextc := mod(c+1,maxc); while W[cW] do begin Priority select and remove an arbitrary instruction x from W[cW]; if free issue units of type(x) on cycle c then begin instr[c] := instr[c] {x}; Wcount := Wcount - 1; for each y successors[x] do begin count[y] := count[y] 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],maxc); W[loc] := W[loc] {y}; Wcount := Wcount + 1; end end else W[nextc] := W[nextc] {x}; end end CS406/534 Fall 2004, Prof. Li Xu 20 20

More List Scheduling List scheduling breaks down into two distinct classes Forward list scheduling Start with available operations Work forward in time Ready all operands available Backward list scheduling Start with no successors Work backward in time Ready latency covers uses Variations on list scheduling Prioritize critical path(s) Schedule last use as soon as possible Depth first in precedence graph (minimize registers) Breadth first in precedence graph (minimize interlocks) Prefer operation with most successors CS406/534 Fall 2004, Prof. Li Xu 21 21

Local Scheduling As long as we stay within a single block List scheduling does well Problem is hard, so tie-breaking matters More descendants in dependence graph Prefer operation with a last use over one with none Breadth first makes progress on all paths Tends toward more ILP & fewer interlocks Depth first tries to complete uses of a value Tends to use fewer registers Classic work on this is Gibbons & Muchnick CS406/534 Fall 2004, Prof. Li Xu 22 22

Local Scheduling Forward and backward can produce different results 8 8 8 8 8 loadi 1 lshift loadi 2 loadi 3 loadi 4 7 7 7 7 7 add 1 add 2 add 3 add 4 addi Latency to the cbr Subscript to identify 2 5 5 5 5 5 cmp store 1 store 2 store 3 store 4 store 5 1 cbr Block from SPEC benchmark go Operation load loadi add addi store cmp Latency 4 1 2 1 4 1 CS406/534 Fall 2004, Prof. Li Xu 23 23

F o r w a r d S c h e d u l e 1 2 3 4 5 6 7 8 9 10 11 12 13 Int loadi 1 loadi 2 loadi 4 add 2 add 4 cmp cbr Local Scheduling Int Mem Int B lshift a 1 loadi 4 loadi 3 c 2 addi add k 1 3 add w 4 add 3 a 4 add 3 addi store 1 r 5 add 2 d store 2 6 add 1 store 3 S 7 store 4 c 8 h store 5 9 e d 10 u 11 cmp l e 12 cbr 13 Using latency to root as the priority Mem CS406/534 Fall 2004, Prof. Li Xu 24 Int lshift loadi 3 loadi 2 loadi 1 store 5 store 4 store 3 store 2 store 1 24

Schielke s RBF algorithm Local Scheduling Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule Shortest time to completion Other metrics are possible (shortest time + fewest registers) Randomized Backward & Forward In practice, this does very well CS406/534 Fall 2004, Prof. Li Xu 25 25

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs B 1 a b c d B 2 e f B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 26 26

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems B 2 e f B 1 a b c d B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 27 27

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems B 2 c,e f B 1 a b c d B 3 no c here! g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 28 28

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op out of B 1 causes problems Must insert compensation code in B 3 Increases code space B 4 h i B 2 c,e f B 1 B 5 a b c d j k B 3 This one wasn t done for speed! c g B 6 l CS406/534 Fall 2004, Prof. Li Xu 29 29

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op into B 1 causes problems B 4 h i B 2 e f B 1 B 5 a b c d j k B 3 g B 6 l CS406/534 Fall 2004, Prof. Li Xu 30 30

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts Moving an op into B 1 causes problems Lengthens {B 1,B 3 } B 4 Adds computation to {B 1,B 3 } May need compensation code, too Renaming may avoid undof h i B 2 e f B 1 B 5 a b c d,f j k B 3 This makes the path even longer! undo f g B 6 l CS406/534 Fall 2004, Prof. Li Xu 31 31

Scheduling Larger Regions Superlocal Scheduling How much can we get? Schielke saw 11 to 12% speed ups Constrained away compensation code B 2 e f B 1 a b c d B 3 g B 4 h i B 5 j k B 6 l Value numbering is the best case for superlocal scope CS406/534 Fall 2004, Prof. Li Xu 32 32

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context B 1 a b c d B 2 e f B 3 g Join points create blocks that must work in multiple contexts B 4 h i B 5 j k 2 paths B 6 l 3 paths CS406/534 Fall 2004, Prof. Li Xu 33 33

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 1 a b c d B 2 e f B 3 g B 4 h i B 5a j k B 5b j k B 6a l B 6b l B 6c l CS406/534 Fall 2004, Prof. Li Xu 34 34

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 1 a b c d B 2 e f B 3 g B 4 h i B 5a j k B 5b j k B 6a l B 6b l B 6c l CS406/534 Fall 2004, Prof. Li Xu 35 35

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5q }, {B 1,B 3,B 5b } Pay heed to compensation code B 2 e f B 1 a b c d B 3 g Works well for forward motion Backward motion still has off-path problems Speeding up one path can slow down others (undo) B 4 h i l B 5a j k l B 5b j k l CS406/534 Fall 2004, Prof. Li Xu 36 36

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling B 1 a b c d B 2 e f B 3 g B 4 h i B 5 j k B 6 l CS406/534 Fall 2004, Prof. Li Xu 37 37

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1 7 a b c d 10 3 B 2 e f B 3 g 5 2 3 B 4 h i B 5 j k 5 B 6 l 5 Block counts could mislead us see B 5 CS406/534 Fall 2004, Prof. Li Xu 38 38

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the hot path B 1,B 2,B 4,B 6 Schedule it Compensation code in B 3,B 5 if needed Get the hot path right! B 4 If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles h i B 2 5 5 B 6 e f l B 1 7 B 5 2 5 a b c d j k 10 B 3 3 3 g CS406/534 Fall 2004, Prof. Li Xu 39 39

Scheduling Larger Regions Trace Scheduling Entire CFG Pick & schedule hot path Insert compensation code Remove hot path from CFG Repeat the process until CFG is empty B 1 7 a b c d 10 3 B 2 e f B 3 g Idea Hot paths matter Farther off hot path, less it matters B 4 h i 5 5 B 6 l B 5 2 5 j k 3 CS406/534 Fall 2004, Prof. Li Xu 40 40

Interaction with Register Allocation Dependence constraint due to register names True dependence (a: =>r1 b:,r1 =>) Anti dependence (a: r1 => b: => r1) Output dependence (a: =>r1 b: => r1) Renaming can remove anti and output dependence Usually increase register pressure Solution: pre- and post-scheduling CS406/534 Fall 2004, Prof. Li Xu 41 41

Lab 3 Static schedule machine model Run iloc_sim s 3 <file.i for non-scheduled code Implement two schedulers for basic blocks in ILOC One must be list scheduling with specified priority Other can be a second priority, or a different algorithm Same ILOC subset as in lab 1 (plus NOP) Different execution model Two asymmetric functional units Latencies different from lab 1 Simulator different from lab 1 CS406/534 Fall 2004, Prof. Li Xu 42 42

Lab 3 Specific questions 1. Are we in over our heads? No. The hard part of this lab should be trying to get good code. The programming is not bad; you can reuse some stuff from lab 1. You have all the specifications & tools you need. Jump in and start programming. 2. How many registers can we use? Assume that you have as many registers as you need. If the simulator runs out, we ll get more. Rename registers to avoid false dependences & conflicts. 3. What about a store followed by a load? If you can show that the two operations must refer to different memory locations, the scheduler can overlap their execution. Otherwise, the store must complete before the load issues. 4. What about test files? We will put some test files online. We will test your lab on files you do not see. We have had problems (in the past) with people optimizing their labs for the test data. (Not that you would do that!) CS406/534 Fall 2004, Prof. Li Xu 43 43

Lab 3 - Hints Begin by renaming registers to eliminate false dependencies Pay attention to loads and stores When can they be reordered? Understand the simulator Inserting NOPs may be necessary to get correct code Tiebreakers can make a difference CS406/534 Fall 2004, Prof. Li Xu 44 44

Summary Instruction scheduling List scheduling Scheduling larger regions CS406/534 Fall 2004, Prof. Li Xu 45 45

Next Class Code shape Code generation CS406/534 Fall 2004, Prof. Li Xu 46 46