CS415 Compilers. Instruction Scheduling and Lexical Analysis

Similar documents
Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

CS 406/534 Compiler Construction Instruction Scheduling

Instruction Selection and Scheduling

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Instruction Scheduling

CS5363 Final Review. cs5363 1

Lab 3, Tutorial 1 Comp 412

CS415 Compilers. Lexical Analysis

Instruction Selection, II Tree-pattern matching

Lecture 12: Compiler Backend

fast code (preserve flow of data)

Instruction scheduling. Advanced Compiler Construction Michel Schinz

CS 406/534 Compiler Construction Putting It All Together

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

CS415 Compilers. Intermediate Represeation & Code Generation

Instruction Selection: Preliminaries. Comp 412

CSE P 501 Compilers. Register Allocation Hal Perkins Autumn /22/ Hal Perkins & UW CSE P-1

Chapter 06: Instruction Pipelining and Parallel Processing

CS 406/534 Compiler Construction Instruction Selection and Global Register Allocation

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Code Generation. CS 540 George Mason University

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

CS 432 Fall Mike Lam, Professor. Data-Flow Analysis

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

The ILOC Simulator User Documentation

Compiler Design. Register Allocation. Hwansoo Han

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Lecture 19: Instruction Level Parallelism

Compiler Architecture

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

The ILOC Simulator User Documentation

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

: Advanced Compiler Design. 8.0 Instruc?on scheduling

Code Shape II Expressions & Assignment

The ILOC Simulator User Documentation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS425 Computer Systems Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Local Register Allocation (critical content for Lab 2) Comp 412

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Global Register Allocation via Graph Coloring

Static vs. Dynamic Scheduling

CS152 Computer Architecture and Engineering. Complex Pipelines

Topic 6 Basic Back-End Optimization

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

6.823 Computer System Architecture

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining. CSC Friday, November 6, 2015

Handout 2 ILP: Part B

Hardware-based Speculation

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

CS 406/534 Compiler Construction Instruction Scheduling beyond Basic Block and Code Generation

Superscalar Processors

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

The basic structure of a MIPS floating-point unit

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Lecture-13 (ROB and Multi-threading) CS422-Spring

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

k register IR Register Allocation IR Instruction Scheduling n), maybe O(n 2 ), but not O(2 n ) k register code

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Introduction to Compiler

Advanced Computer Architecture

Instruction Pipelining Review

Out of Order Processing

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Getting CPI under 1: Outline

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

5008: Computer Architecture

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Instruction Level Parallelism

Intermediate Representations

Hardware-Based Speculation

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Intermediate Representations

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Runtime Support for Algol-Like Languages Comp 412

Software Tools for Lab 1

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

COSC 6385 Computer Architecture - Pipelining

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Course on Advanced Computer Architectures

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Preventing Stalls: 1

Transcription:

CS415 Compilers Instruction Scheduling and Lexical Analysis These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Instruction Scheduling (Engineer s View) The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The Concept Machine description The task Produce correct code Minimize wasted (idle) cycles slow code Operate efficiently fast Scheduler code 2

Data Dependences (stmt./instr. level) Dependences defined on memory locations / registers and not values Statement/instruction b depends on statement/instruction a if there exists: of flow dependence a writes a location/register that b later reads dependence a reads a location/register that b later writes output dependence a writes a location/register that b later writes (RAW conflict) (WAR conflict) (WAW conflict) Dependences define ORDER CONSTRAINTS that need to be respected in order to generate correct code. a = = a = a a = output a = a = 3

Instruction Scheduling (The Abstract View) To capture properties of the code, build a dependence graph G Nodes n G are operations with type(n) and delay(n) An edge e = (n 1,n 2 ) G iff n 2 depends on n 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d f c h i e g The Code The Dependence Graph (all output dependences are covered, i.e., are satisfied through other 4 dependences)

Instruction Scheduling The big picture 1. Build a dependence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time (can only issue/schedule at most one instructions per cycle) a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue Local list scheduling The dominant algorithm for twenty years A greedy, heuristic, local technique 5

Local (Forward) List Scheduling Cycle 1 Ready leaves of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Removal in priority order op has completed execution If successor s operands are ready, put it on Ready 6

Scheduling Example Operation Cycles load 3 loadi 1 loadai 3 store 3 storeai 3 add 1 mult 2 fadd 1 fmult 2 shift 1 branch 0 to Loads & stores may or may not block > Non-blocking fill those issue slots Branches typically have delay slots > Fill slots with operations unrelated to branch condition evaluation > Percolates branch upward Branch Prediction may hide branch latencies (hardware feature) Build a simple local scheduler (basic block) - non-blocking loads & stores - different latencies load/store vs. arith. etc. operations - different heuristics - forward / backward scheduling 7

Scheduling Example 1. Build the dependence graph 1 4 5 9 12 13 16 1 21 20 cycles a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code a b d f c h i e g The Dependence Graph

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w 11 14 a b d 7 13 c e g f 5 h 3 i The Code The Dependence Graph 9

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 11 14 a b d 7 13 c e g f 5 h 3 i The Dependence Graph Note: Here we assume that operation has to finish to satisfy an dependence. Our ILOC simulator takes only one cycle to satisfy an dependence since read-stage is executed before write stage

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 13 a b 9 d 7 12 c e g f 5 h 3 i The Dependence Graph Note: Here we assume that operation has to finish to satisfy an dependence. Our ILOC simulator takes only one cycle to satisfy an dependence since read-stage is executed before write stage (EaC). 11

Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling (forward) a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r3 f: mult r1,r3 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 11 14 a b d 7 13 c e g f 5 h i 3 The Dependence Graph 12

Scheduling Example 1 2 3 4 5 7 12 15 14 cycles 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling (forward) a: c: e: b: d: g: f: l o ad A I r 0, @w r 1 l o ad A I r 0, @ x r 2 l o ad A I r 0, @ y r 3 a d d r 1, r 1 r 1 m ul t r 1, r 2 r 1 l o ad A I r 0, @z r 2 m ul t r 1, r 3 r 1 h: m ul t r 1, r 2 r 1 i: s t o r e A I r 1 r 0, @w The Code 11 14 a b d 7 13 c e g f 5 h i 3 The Dependence Graph Our ILOC simulator takes only one cycle to satisfy an dependence 13

More on Scheduling Forward list scheduling start with available ops work forward ready all operands available Backward list scheduling start with no successors work backward ready latency covers operands Different heuristics (forward) based on Dependence Graph 1. Longest latency weighted path to root ( critical path) 2. Highest latency instructions ( more overlap) 3. Most immediate successors ( create more candidates) 4. Most descendents ( create more candidates) 5. Interactions with register allocation perform dynamic register renaming ( may require spill code) move life ranges around ( may remove or require spill code) 14

Next class More Lexical Analysis; Syntax Analysis Read EaC: Chapters 2.1 2.5; 3.1 3.3 Homework Problem Set 2 is posted. 15