Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Similar documents
Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

ECE 505 Computer Architecture

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

CS553 Lecture Profile-Guided Optimizations 3

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Course on Advanced Computer Architectures

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture: Pipeline Wrap-Up and Static ILP

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CPE300: Digital System Architecture and Design

Instruction-Level Parallelism (ILP)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Floating Point/Multicycle Pipelining in DLX

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CSE 490/590 Computer Architecture Homework 2

Computer Architecture Practical 1 Pipelining

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Dependence Analysis. Hwansoo Han

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS252 Graduate Computer Architecture Midterm 1 Solutions

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Advanced Computer Architecture

Hardware-based Speculation

Hardware-based Speculation

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

VLIW/EPIC: Statically Scheduled ILP

CMSC411 Fall 2013 Midterm 2 Solutions

MIT Spring 2011 Quiz 3

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS425 Computer Systems Architecture

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Basic Pipelining Concepts

Out of Order Processing

The basic structure of a MIPS floating-point unit

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

CS425 Computer Systems Architecture

CS433 Homework 2 (Chapter 3)

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Advanced Computer Architecture

Hiroaki Kobayashi 12/21/2004

University of Toronto Faculty of Applied Science and Engineering

Instruction Pipelining Review

5008: Computer Architecture

Getting CPI under 1: Outline

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Lecture: Out-of-order Processors

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Instruction Scheduling

fast code (preserve flow of data)

CS 406/534 Compiler Construction Instruction Scheduling

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Appendix C: Pipelining: Basic and Intermediate Concepts

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction Level Parallelism

Hardware-Based Speculation

DLX Unpipelined Implementation

CS433 Homework 2 (Chapter 3)

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

ECE 486/586. Computer Architecture. Lecture # 12

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

1 Hazards COMP2611 Fall 2015 Pipelined Processor

ILP: Instruction Level Parallelism

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 5: Pipelining Basics

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Transcription:

Lecture 7 Instruction Scheduling I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 10.4 CS243: Instruction Scheduling 1

Scheduling Constraints Data dependences The operations must generate the same results as the corresponding ones in the original program. Control dependences All the operations executed in the original program must be executed in the optimized program Resource constraints No over-subscription of resources. CS243: Instruction Scheduling 2

Data Dependence Must maintain order of accesses to potentially same locations True dependence: write -> read (RAW hazard) a = = a Output dependence: write -> write (WAW hazard) a = a = Anti-dependence: read -> write (WAR hazard) = a a = Data Dependence Graph Nodes: operations Edges: n 1 -> n 2 if n 2 is data dependent on n 1 labeled by the execution length of n 1 CS243: Instruction Scheduling 3

Analysis on Memory Variables Undecidable in general read x read y A[x] = = A[y] Two memory accesses can potentially be the same unless proven otherwise Classes of analysis: simple: base+offset1 = base+offset2? data dependence analysis : Array accesses whose indices are affine expressions of loop indices A[2i] = A[2i+1]? interprocedural analysis: global = parameter? pointer analysis: pointer1 = pointer2? Data dependence analysis is useful for many other purposes CS243: Instruction Scheduling 4

Resource Constraints Each instruction type has a resource reservation table Functional units ld st alu fmpy fadd br Time 0 1 2 Pipelined functional units: occupy only one slot Non-pipelined functional units: multiple time slots Instructions may use more than one resource Multiple units of same resource Limited instruction issue slots may also be managed like a resource CS243: Instruction Scheduling 5

Example of a Machine Model Each machine cycle can execute 2 operations 1 ALU operation or branch operation Op dst,src1,src2 executes in 1 clock 1 load or store operation LD dst, addr ST src, addr result is available in 2 clocks pipelined: can issue LD next clock executes in 1 clock cycle CS243: Instruction Scheduling 6

Basic Block Scheduling alu mem 2 LD R2 <- 0(R1) 2 ST 4(R1) <- R2 i 1 i 2 LD R3 <- 8(R1) 2 ADD R3 <- R3,R4 1 ADD R3 <- R3,R2 1 ST 12(R1) <- R3 i 3 i 4 i 5 i 6 ST 0(R7) <- R7 i 7 CS243: Instruction Scheduling 7

With Resource Constraints NP-complete in general Heuristics time! List Scheduling: READY = nodes with 0 predecessors Loop until READY is empty { Let n be the node in READY with highest priority } Schedule n in the earliest slot that satisfies precedence + resource constraints Update predecessor count of n s successor nodes Update READY CS243: Instruction Scheduling 8

List Scheduling Scope: DAGs Schedules operations in topological order Never backtracks Variations: Priority function for node n critical path: max clocks from n to any node resource requirements source order CS243: Instruction Scheduling 9

II. Introduction to Global Scheduling Assume each clock can execute 2 operations of any kind. if (a==0) goto L c = b LD R6 <- 0(R1) nop BEQZ R6, L B 1 L: e = d + d LD R7 <- 0(R2) nop ST 0(R3) <- R7 B 2 L: LD R8 <- 0(R4) nop ADD R8 <- R8,R8 ST 0(R5) <- R8 B 3 CS243: Instruction Scheduling 10

Result of Code Scheduling LD R6 <- 0(R1) ; LD R8 <- 0(R4) LD R7 <- 0(R2) ADD R8 <- R8,R8 ; BEQZ R6, L B 1 L: ST 0(R5) <- R8 B 3 ST 0(R5) <- R8 ; ST 0(R3) <- R7 B 3 CS243: Instruction Scheduling 11

Terminology Control equivalence: Two operations o 1 and o 2 are control equivalent if o 1 is executed if and only if o 2 is executed. Control dependence: An op o 2 is control dependent on op o 1 if the execution of o 2 depends on the outcome of o 1. Speculation: An operation o is speculatively executed if it is executed before all the operations it depends on (control-wise) have been executed. Requirement: Raises no exception, Satisfies data dependences CS243: Instruction Scheduling 12

Code Motions Goal: Shorten execution time probabilistically Moving instructions up: Move instruction to a cut set (from entry) Speculation: even when not anticipated. src src Moving instructions down: Move instruction to a cut set (from exit) May execute extra instruction Can duplicate code CS243: Instruction Scheduling 13

A Note on Data Dependences a = 0 a = 1 CS243: Instruction Scheduling 14

General-Purpose Applications Lots of data dependences Key performance factor: memory latencies Move memory fetches up Speculative memory fetches can be expensive Control-intensive: get execution profile Static estimation Innermost loops are frequently executed back edges are likely to be taken Edges that branch to exit and exception routines are not likely to be taken Dynamic profiling Instrument code and measure using representative data CS243: Instruction Scheduling 15

A Basic Global Scheduling Algorithm Schedule innermost loops first Only upward code motion No creation of copies Only one level of speculation CS243: Instruction Scheduling 16

Program Representation A region in a control flow graph is: a set of basic blocks and all the edges connecting these blocks, such that control from outside the region must enter through a single entry block. A function is represented as a hierarchy of regions The whole control flow graph is a region Each natural loop in the flow graph is a region Natural loops are hierarchically nested Schedule regions from inner to outer treat inner loop as a black box unit can schedule around it but not into it ignore all the loop back edges get an acyclic graph CS243: Instruction Scheduling 17

Compute data dependences; Algorithm For each region from inner to outer { For each basic block B in prioritized topological order { CandBlocks = ControlEquiv{B} Dominated-Successors{ControlEquiv{B}}; CandInsts = ready operations in CandBlocks; For (t = 0, 1,... until all operations from B are scheduled) { For (n in CandInst in priority order) { }}} if (n has no resource conflicts at time t) { S(n) = < B, t > } Update resource commitments Update data dependences } Update CandInsts; Priority functions: non-speculative before speculative CS243: Instruction Scheduling 18

Extensions Prepass before scheduling: loop unrolling Especially important to move operation up loop back edges CS243: Instruction Scheduling 19

Summary List scheduling Global scheduling Legal code motions Heuristics CS243: Instruction Scheduling 20