Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Similar documents
Improving Software Pipelining with Hardware Support for Self-Spatial Loads

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Instruction Scheduling

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

Program Transformations for the Memory Hierarchy

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Multiple Instruction Issue. Superscalars

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Tiling: A Data Locality Optimizing Algorithm

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Floating Point/Multicycle Pipelining in DLX

University of Toronto Faculty of Applied Science and Engineering

Loop Transformations! Part II!

Lecture 18 List Scheduling & Global Scheduling Carnegie Mellon

A Preliminary Assessment of the ACRI 1 Fortran Compiler

Register Allocation. Lecture 16

CMSC 611: Advanced Computer Architecture

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

Relaxed Memory Consistency

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

CS 261 Data Structures. Big-Oh Analysis: A Review

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

Lecture: Pipeline Wrap-Up and Static ILP

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

ECE 5730 Memory Systems

Computer Science 246 Computer Architecture

Example. Selecting Instructions to Issue

Dependence Analysis. Hwansoo Han

Appendix C: Pipelining: Basic and Intermediate Concepts

fast code (preserve flow of data)

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

EE 4683/5683: COMPUTER ARCHITECTURE

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS425 Computer Systems Architecture

Cache-oblivious Programming

Chapter 06: Instruction Pipelining and Parallel Processing

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Register Assignment for Software Pipelining with Partitioned Register Banks

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Binding Pipelining

Masterpraktikum Scientific Computing

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Linear Loop Transformations for Locality Enhancement

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Computer Science 146. Computer Architecture

Instruction-Level Parallelism (ILP)

CS222: Cache Performance Improvement

ECE 669 Parallel Computer Architecture

Matrix Multiplication

HW 2 is out! Due 9/25!

Spring 2 Spring Loop Optimizations

Matrix Multiplication

Multi-cycle Instructions in the Pipeline (Floating Point)

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 10: Asymptotic Complexity and

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Course on Advanced Computer Architectures

Register Allocation. Lecture 38

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Instruction Scheduling. Software Pipelining - 3

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EECS 583 Class 13 Software Pipelining

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Advanced issues in pipelining

Lecture Notes on Loop Optimizations

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Register Allocation via Hierarchical Graph Coloring

Compiling for Advanced Architectures

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Data Prefetch and Software Pipelining. Stanford University CS243 Winter 2006 Wei Li 1

Uniprocessor Optimizations. Idealized Uniprocessor Model

PERFORMANCE OPTIMISATION

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Data Dependences and Parallelization

Preventing Stalls: 1

Register Allocation (via graph coloring) Lecture 25. CS 536 Spring

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Impact of Source-Level Loop Optimization on DSP Architecture Design

CSCI-580 Advanced High Performance Computing

Program design and analysis

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

Transcription:

Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas

Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling Pipelines Limitations Enhancements Conclusions

Instruction Scheduling Motivated by instruction-level parallelism Place operations in minimum number of instructions Complications Resource conflicts Precedence relation among operations List Scheduling a popular heuristic that operates on a data dependence DAG (DDD)

Data Dependence DAG Nodes represent operations to schedule Edges represent dependences needed to preserve program semantics Edges are weighted with < Min,Max > times to represent complex timing of architecture Resource latencies Pipelines Transient resources

List Scheduling Initialize Data Ready Set (DRS) to be those nodes in the DDD with no predecessors Maintain an array of Scheduled Instructions, SC Initialize index, I, to 1 While DDD is not empty choose a node, N, from DRS if N can be placed in SC(I), schedule N, then remove N and its arcs from the DDD. For each of N s successors, S, if S has no predecessors, add S to DRS. when none of the nodes in DRS can be scheduled in SC(I), increment I.

Local vs. Global Instruction Scheduling Local instruction scheduling orders instructions within a basic block only. Global instruction scheduling considers multiple blocks when scheduling instructions. Local instruction scheduling is easier but generally less effective at using available parallelism. Tjaden and Flynn report parallelism of less than 2 within basic blocks. Nicolau and Fisher report parallelism of 90 when inter-block motion allowed.

Why Bother? Instruction-level parallelism (ILP) requires scheduling 90% of execution time spent in loops Loops can exhibit significant parallelism across iterations of a loop. Local and global instruction scheduling don t address parallelism across loop iterations

Example Code int a[4][4]; int b[4][4]; int c[4][4]; main() { int i, j, k; } for(i=o; i<4; i++) for(j=0; j<4; j++) { c[i][j] = 0; for(k=0; k<4; k++) c[i][j] += a[i][k] * b[k][j]; }

Machine Model (for example) Long instruction word (LIW) Separate add and multiply pipes Add produces a result in one machine cycle Multiply requires two machine cycles to produce result Multiply is pipelined

Local Schedule 192 cycles 1. t = a[i][k] * b[k][j] 2. nop 3. c[i][j] = t + c[i][j]

A Pipelined Schedule 144 cycles Prelude: executed once sum = a[i][0] * b[0][j] Loop Body: executed for k = 1 to 3 nop t = a[i][k] * b[k][j] # sum += t Postlude: executed once nop sum += t Note: Can actually get 96 cycles by overlapping two iterations of the loop in a single 2-cycle loop body.

Software Pipelining Methods Allan et al. (Computing Surveys, 1995) defines two major approaches to software pipelining Kernel Recognition unrolls the innermost loop an appropriate number of times and searches for a repeating pattern. This repeating pattern then becomes the loop body. Modulo Scheduling selects a (minimum) schedule for one loop iteration such that, when the schedule is repeated, no constraints are violated. Length of this schedule is called the initiation interval, II.

Iterative Modulo Scheduling Schedule loop with the smallest possible II Build data dependence graph (DDG) Nodes represent operations to be performed Weighted, directed arcs represent precedence constraint between two nodes Compute resource constraint, resii, smallest II that contains enough resources (functional units?) to satisfy all the resources required by one loop iteration. Compute recurrence constraint, recii, longest dependence loop in the DDG. minii = max(resii, recii) Try to schedule with II = minii, II = minii + 1, minii + 2,... until a valid schedule is found

Scheduling a Pipeline How do you Try to schedule with II = minii, II = minii + 1, minii + 2,... until a valid schedule is found Use slightly modified local scheduling algorithm (List scheduling, but with slightly modified heuristics for prioritizing DDG nodes.) Set the schedule length to be the II we re attempting to schedule Attempt to fit the operations to be scheduled II cycles If no schedule found backtrack and try again in II cycles. After sufficient number of backtracking attempts, admit failure.

Dependence Operation A is dependent upon operation B iff B precedes A in execution and both operations access the same memory location. Four basic types of dependence True dependence (RAW) Anti-dependence (WAR) Output dependence (WAW) Input dependence (RAR) Input dependence usually ignored in scheduling

Dependence (cont.) Represented in DDG Two groups of dependences Loop-independent dependence Loop-carried dependence DDG arcs decorated with two integers, (dist, min) Dist is the difference in iterations between the head and tail of the dependence Min is the latency required between two nodes Dependence analysis is quite complicated, but techniques do exist.

Limitations of Modulo Scheduling Control flow makes things difficult Modulo scheduling requires a single DDG Can use if-conversion (Warter) but graph explodes quickly Lam s hierarchical reduction is another method, but not without problems Software pipelining in general increases register needs dramatically. Modulo scheduling exacerbates this problem. Techniques exist to minimize required registers but cannot guarantee fitting in machine s registers To address register problems, some increase II; others spill Compile time is a problem, particularly computing the distance matrix

Improved (?) Modulo Scheduling Unroll-and-Jam on nested loops can significantly shorten the execution time needed for a loop at the cost of code size and register usage. Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses. A plan to software pipeline with a fixed number of registers; like spilling while scheduling.

Unroll-and-Jam Requires nested loops Unroll outer loop and jam resulting inner loops back together Introduces more parallelism into inner loop and it will not lengthen any recurrence cycles. Using unroll-and-jam on 26 FORTRAN nested loops before performing modulo scheduling led to: Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% Increased register requirements significantly, often by a factor of 5.

Example Code with Unroll-and-Jam for(i=o; i<4; i++) for(j=0; j<4; j+=2) { c[i][j] = 0; c[i][j+1] = 0; for(k=0; k<4; k++) { c[i][j] += a[i][k] * b[k][j]; c[i][j+1] += a[i][k] * b[k][j+1]; } }

Using Cache Reuse Information Caches lead to uncertain (for compiler) latencies Local scheduling typically assumes all cache accesses are hits Assuming all hits leads to stalls and slower execution Typical modulo schedulers assume all cache accesses are misses. (Modulo scheduling can effectively hide long latency of miss) Assuming cache miss stretches register lifetimes and thus leads to greater requirements

Cache Reuse Experiment Using a simple cache reuse model, our modulo scheduler Improved execution time roughly 11% over an all-hit assumption with little change in register usage Used 17.9% fewer registers than an allmiss assumption, while generating 8% slower code

Summary Modulo scheduling can lead to dramatic speed improvements for single-block loops Modulo scheduling does have limitations, namely difficulty with control flow and (potentially) massive register requirements We have implemented some enhancements to modulo scheduling to make it even more effective and practical, and in one case improved upon the optimal schedule.