Instruction scheduling. Advanced Compiler Construction Michel Schinz

Similar documents
Chapter 06: Instruction Pipelining and Parallel Processing

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

Superscalar Processors

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Instruction Selection and Scheduling

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Static vs. Dynamic Scheduling

Tomasulo s Algorithm

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

EE 4683/5683: COMPUTER ARCHITECTURE

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

: Advanced Compiler Design. 8.0 Instruc?on scheduling

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

ILP: Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instruction-Level Parallelism (ILP)

Hardware-Based Speculation

Organisasi Sistem Komputer

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

fast code (preserve flow of data)

COSC 6385 Computer Architecture - Pipelining

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Department of Computer Science and Engineering

Advanced Computer Architecture

Register Allocation. Michael O Boyle. February, 2014

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Course on Advanced Computer Architectures

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Instruction Level Parallelism (ILP)

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

ECE 587 Advanced Computer Architecture I

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Updated Exercises by Diana Franklin

Instruction Pipelining Review

UNIT I (Two Marks Questions & Answers)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Instruction Pipelining

Instruction Pipelining

Advanced Computer Architecture

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Register Allocation & Liveness Analysis

Floating Point/Multicycle Pipelining in DLX

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

There are different characteristics for exceptions. They are as follows:

Instruction Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

Computer Systems Architecture

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Announcement. Exercise #2 will be out today. Due date is next Monday

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS Computer Architecture

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CPE300: Digital System Architecture and Design

Processor: Superscalars Dynamic Scheduling

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Multi-cycle Instructions in the Pipeline (Floating Point)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction Scheduling

Spring 2014 Midterm Exam Review

DLX Unpipelined Implementation

Chapter 14 - Processor Structure and Function

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

CMSC411 Fall 2013 Midterm 2 Solutions

Structure of Computer Systems

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

High-Level Synthesis (HLS)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

CS 406/534 Compiler Construction Instruction Scheduling

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Preventing Stalls: 1

SUPERSCALAR AND VLIW PROCESSORS

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures

Transcription:

Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21

Instruction ordering When a compiler emits the instructions corresponding to a program, it imposes a total order on them. However, that order is usually not the only valid one, in the sense that it can be changed without modifying the program s behavior. For example, if two instructions i 1 and i 2 appear sequentially in that order and are independent, then it is possible to swap them.

Instruction scheduling Among all the valid permutations of the instructions composing a program i.e. those that preserve the program s behavior some can be more desirable than others. For example, one order might lead to a faster program on some machine, because of architectural constraints. The aim of instruction scheduling is to find a valid order that optimizes some metric, like execution speed.

Pipeline stalls Modern, pipelined architectures can usually issue at least one instruction per clock cycle. However, an instruction can be executed only if the data it needs is ready. Otherwise, the pipeline stalls for one or several cycles. Stalls can appear because some instructions, e.g. division, require several cycles to complete, or because data has to be fetched from memory.

Scheduling example The following example will illustrate how proper scheduling can reduce the time required to execute a piece of RTL code. We assume the following delays for instructions: Instruction kind RTL notation Delay Memory load or store R a Mem[R b +c] Mem[R b +c] R a 3 Multiplication R a R b * R c 2 Addition R a R b + R c 1

Scheduling example Cycle Instruction Cycle Instruction 1 R 1 Mem[R SP ] 4 R 1 R 1 + R 1 5 R 2 Mem[R SP +1] 8 R 1 R 1 * R 2 9 R 2 Mem[R SP +2] 12 R 1 R 1 * R 2 13 R 2 Mem[R SP +3] 16 R 1 R 1 * R 2 18 Mem[R SP +4] R 1 1 R 1 Mem[R SP ] 2 R 2 Mem[R SP +1] 3 R 3 Mem[R SP +2] 4 R 1 R 1 + R 1 5 R 1 R 1 * R 2 6 R 2 Mem[R SP +3] 7 R 1 R 1 * R 3 9 R 1 R 1 * R 2 11 Mem[R SP +4] R 1 After scheduling (including renaming), the last instruction is issued at cycle 11 instead of 18!

Instruction dependences An instruction i 2 depends on an instruction i 1 when it is not possible to execute i 2 before i 1 without changing the behavior of the program. The most common reason for dependence is datadependence: i 2 uses a value that is computed by i 1. However, as we will see, there are other kinds of dependences.

Data dependences We distinguish three kinds of dependences between two instructions i 1 and i 2 : 1. true dependence i 2 reads a value written by i 1 (read after write or RAW), 2. antidependence i 2 writes a value read by i 1 (write after read or WAR), 3. antidependence i 2 writes a value written by i 1 (write after write or WAW).

Antidependences Antidependences are not real dependences, in the sense that they do not arise from the flow of data. They are due to a single location being used to store different values. Most of the time, antidependences can be removed by renaming locations e.g. registers. In the example below, the program on the left contains a WAW antidependence between the two memory load instructions, that can be removed by renaming the second use of R 1. R 1 Mem[R SP ] R 4 R 4 + R 1 R 1 Mem[R SP +1] R 4 R 4 + R 1 R 1 Mem[R SP ] R 4 R 4 + R 1 R 2 Mem[R SP +1] R 4 R 4 + R 2

Computing dependences Identifying dependences among instructions that only access registers is easy. Instructions that access memory are harder to handle. In general, it is not possible to know whether two such instructions refer to the same memory location. Conservative approximations not examined here therefore have to be used.

Dependence graph The dependence graph is a directed graph representing dependences among instructions. Its nodes are the instructions to schedule, and there is an edge from node n 1 to node n 2 iff the instruction of n 2 depends on n 1. Any topological sort of the nodes of this graph represents a valid way to schedule the instructions.

Dependence graph example Name Instruction a R 1 Mem[R SP ] b a c b R 1 R 1 + R 1 c R 2 Mem[R SP +1] d R 1 R 1 * R 2 e R 2 Mem[R SP +2] f R 1 R 1 * R 2 g R 2 Mem[R SP +3] h R 1 R 1 * R 2 i Mem[R SP +4] R 1 d f e h g i true dependence antidependence

Difficulty of scheduling Optimal instruction scheduling is NP-complete. As always, this implies that we will use techniques based on heuristics to find a good but sometimes not optimal solution to that problem. List scheduling is a technique to schedule the instructions of a single basic block. Its basic idea is to simulate the execution of the instructions, and to try to schedule instructions only when all their operands can be used without stalling the pipeline.

List scheduling algorithm The list scheduling algorithm maintains two lists: ready is the list of instructions that could be scheduled without stall, ordered by priority, active is the list of instructions that are being executed. At each step, the highest-priority instruction from ready is scheduled, and moved to active, where it stays for a time equal to its delay. Before scheduling is performed, renaming is done to remove all antidependences that can be removed.

Prioritizing instructions Nodes (i.e. instructions) are sorted by priority in the ready list. Several schemes exist to compute the priority of a node, which can be equal to: the length of the longest latency-weighted path from it to a root of the dependence graph, the number of its immediate successors, the number of its descendants, its latency, etc. Unfortunately, no single scheme is better for all cases.

List scheduling example a 13 b 10 d 9 c 12 f 7 priority e 10 g 8 h 5 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph

List scheduling example a 13 Cycle ready active b 10 d 9 c 12 f 7 priority e 10 g 8 h 5 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 e 10 priority 1 [a 13,c 12,e 10,g 8 ] [a] f 7 g 8 h 5 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 e 10 priority 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] f 7 g 8 h 5 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 e 10 priority 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] f 7 g 8 h 5 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h] 11 [i 3 ] [i]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h] 11 [i 3 ] [i] 12 [] [i]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h] 11 [i 3 ] [i] 12 [] [i] 13 [] [i]

List scheduling example b 10 d 9 a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h] 11 [i 3 ] [i] 12 [] [i] 13 [] [i] 14 [] []

Scheduling conflicts It is hard to decide whether scheduling should be done before or after register allocation. If register allocation is done first, it can introduce antidependences when reusing registers. If scheduling is done first, register allocation can introduce spilling code, destroying the schedule. Solution: schedule first, then allocate registers and schedule once more if spilling was necessary.