TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

Similar documents
Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

CS553 Lecture Profile-Guided Optimizations 3

Instruction Scheduling. Software Pipelining - 3

High-Level Synthesis (HLS)

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Instruction Scheduling

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

fast code (preserve flow of data)

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Computer Science 246 Computer Architecture

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies


CMSC411 Fall 2013 Midterm 2 Solutions

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

HIGH-LEVEL SYNTHESIS

Instruction Scheduling

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Unit 2: High-Level Synthesis

Generic Software pipelining at the Assembly Level

CS 351 Final Exam Solutions

CS 152, Spring 2011 Section 10

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Multiple Instruction Issue. Superscalars

Predicated Software Pipelining Technique for Loops with Conditions

CS425 Computer Systems Architecture

CS152 Computer Architecture and Engineering. Complex Pipelines

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Computer Science 146. Computer Architecture

Problem Set 4 Solutions

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis

(Refer Slide Time: 00:02:04)

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Hiroaki Kobayashi 12/21/2004

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

CS 654 Computer Architecture Summary. Peter Kemper

HW 2 is out! Due 9/25!

Software De-Pipelining Technique

Instruction-Level Parallelism and Its Exploitation

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

VLIW/EPIC: Statically Scheduled ILP

EECS 583 Class 2 Control Flow Analysis LLVM Introduction

Floating Point/Multicycle Pipelining in DLX

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

G.1 Introduction: Exploiting Instruction-Level Parallelism Statically G-2 G.2 Detecting and Enhancing Loop-Level Parallelism G-2 G.

Modulo-Variable Expansion Sensitive Scheduling

SCHEDULING II Giovanni De Micheli Stanford University

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Conditional Elimination through Code Duplication

Advanced Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Register Allocation. CS 502 Lecture 14 11/25/08

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

CS425 Computer Systems Architecture

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

EE 4683/5683: COMPUTER ARCHITECTURE

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Full Datapath. Chapter 4 The Processor 2

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

CMSC 611: Advanced Computer Architecture

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

EECS 583 Class 13 Software Pipelining

An Approach for Integrating Basic Retiming and Software Pipelining

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CS252 Graduate Computer Architecture Midterm 1 Solutions

ECE 505 Computer Architecture

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Modern Processors. RISC Architectures

Chapter 4 The Processor (Part 4)

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Dependence Analysis. Hwansoo Han

Compiler Architecture

Multiple Issue ILP Processors. Summary of discussions

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Dynamic Scheduling. CSE471 Susan Eggers 1

Code Generation. CS 540 George Mason University

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Four Steps of Speculative Tomasulo cycle 0

Transcription:

9. Code Scheduling for ILP-Processors Typical layout of compiler: traditional, optimizing, pre-pass parallel, post-pass parallel {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2 Basic block scheduling 9.3 Loop scheduling 9.4 Global scheduling CH01 TECH Computer Science Levels of static scheduling Contrasting the performance of {software/hardware} static code schedulers and ILP-processor 9.1 Basic Block Schedulers: List scheduler: in each step Basic Block: straight-line code, can only enter at beginning and left at end -Eligible Instructions are dependency-free no predecessors in the Data Dependency Graph or data to be produced by a predecessor node is already available f timing: checking when the data will be ready hardware required resources are available

-Looks for the Best Choice from the eligible set of instructions {critical path in Data Dependency Graph} Choose First: Schedule instructions in the critical path Choose Next: others Heuristic Rule for choosing the next instructions Set of criteria 1. whether there is an immediate successor that is dependent on the instruction in question 2. how many successors the instruction has 3. how long is the longest path from the instruction in question to the leaves Case Study: compiler for IBM Power1 Basic block approach: list scheduler Using Data Dependency Graph Critical path the longest (in term of execution time) Earliest time when check the data for the instruction will be available Latency value (on each arc of DDG) how many time units the successors node have to wait before the result of the predecessor node becomes available Example program: The intermediate code and the corresponding DDG The first three steps in scheduling the example DDG using Warren s algoruthm

Successive steps in scheduling the example DDG Successive steps in scheduling the example DDG 9.3 Loop Scheduling -Loop Unrolling for I = 1 to 3 do { b(i) = 2 * a(i) } Loop Unrolled: basic concept b(1) = 2 * a(1) b(2) = 2 * a(2) b(3) = 2 * a(3) save execution time at the expense of code length omitting inter-iteration loop updating code performance improvement enlarging the basic block size can lead to more effective schedule with considerable speed up Simply unrolling is not practicable when a loop has to be executed a large number of times solution: unroll the loop a given number of time (e.g. 3) larger basic block Inter-iteration dependency only feasible if the required data becomes available in due time Speed-up produced by loop unrolling for the first 14 lawrence livermore loops

-Software Pipelining Software pipelining is an analogous term to hardware pipelining Software pipelining each pipeline stage uses one EU each pipeline cycle execute one instruction E.G. for I = 1 to 7 do { b(i) = 2 * a(i) } Basic methods for software pipelining -Principle of the unrolling-based URPR sofeware pipelining method E.G. loop 7 times, 4 EU s VLIW, fmul takes 3 cycles -Modulo scheduling // find the repetitive character of the schedule guess the minimal required length of the new {partially unrolled} loop period of repetition try schedule for this interval {period} taking into account data and resource dependencies if the schedule is not feasible then the length {period} is increased try again 9.4 Global Scheduling: Scheduling of individual instructions beyond basic blocks

Moving up an independent instruction beyond a basic block boundary along both paths of a conditional branch Moving up an independent instruction beyond a basic block boundary along a single path of a conditional branch {guess T} Trace Scheduling: flow graph, guess 3 Bookkeeping Code [B] Finite Resource Global Scheduling: solving long compilation times and code explosion For VLIW 3.7 Speed-up over Power1 49% Compilation time over PL.8 2.1 Code explosion over RISC