Tiling: A Data Locality Optimizing Algorithm

Similar documents
Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm

Control flow graphs and loop optimizations. Thursday, October 24, 13

Review. Loop Fusion Example

The Polyhedral Model (Transformations)

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Loop Transformations! Part II!

Linear Loop Transformations for Locality Enhancement

Loop Transformations, Dependences, and Parallelization

Legal and impossible dependences

Null space basis: mxz. zxz I

Simone Campanoni Loop transformations

Compiling for Advanced Architectures

Program Transformations for the Memory Hierarchy

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Polyhedral Operations. Algorithms needed for automation. Logistics

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

CS553 Lecture Dynamic Optimizations 2

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Fourier-Motzkin and Farkas Questions (HW10)

CSC D70: Compiler Optimization Memory Optimizations

Unroll-and-Jam Using Uniformly Generated Sets

Cache-oblivious Programming

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Coarse-Grained Parallelism

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

CS 293S Parallelism and Dependence Theory

CUDA Performance Considerations (2 of 2)

Parallelisation. Michael O Boyle. March 2014

Supercomputing in Plain English Part IV: Henry Neeman, Director

Lecture 11 Loop Transformations for Parallelism and Locality

Preface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Plan for Today. Concepts. Next Time. Some slides are from Calvin Lin s grad compiler slides. CS553 Lecture 2 Optimizations and LLVM 1

A Layout-Conscious Iteration Space Transformation Technique

Transforming Imperfectly Nested Loops

SORTING. Insertion sort Selection sort Quicksort Mergesort And their asymptotic time complexity

PERFORMANCE OPTIMISATION

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

high-speed-high-capacity memory

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Exploring Parallelism At Different Levels

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

ECE 5730 Memory Systems

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Computing and Informatics, Vol. 36, 2017, , doi: /cai

The Polyhedral Model (Dependences and Transformations)

RICE UNIVERSITY. Transforming Complex Loop Nests For Locality by Qing Yi

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Loop Modifications to Enhance Data-Parallel Performance

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

Lecture 2. Memory locality optimizations Address space organization

A polyhedral loop transformation framework for parallelization and tuning

Rechnerarchitektur (RA)

Manual and Compiler Optimizations

c Nawaaz Ahmed 000 ALL RIGHTS RESERVED

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Data-centric Transformations for Locality Enhancement

Reducing Memory Requirements of Nested Loops for Embedded Systems

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian

Unrolling parallel loops

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Lecture Notes on Loop Transformations for Cache Optimization

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

ECE 669 Parallel Computer Architecture

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

CS553 Lecture Profile-Guided Optimizations 3

Polyèdres et compilation

Kismet: Parallel Speedup Estimates for Serial Programs

Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications


Cache Performance II 1

Machine Learning for Software Engineering

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Alternative Array Storage Layouts for Regular Scientific Programs

Code Merge. Flow Analysis. bookkeeping

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Lecture 12: Instruction Execution and Pipelining. William Gropp

Data Dependence Analysis

A Loop Transformation Theory and an Algorithm to Maximize Parallelism

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Transcription:

Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly & Pugh transformation framework Loop fusion and fission Brief intro to scheduling Alpha programs Today Unroll and Jam and Tiling Review of the paper A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam CS553 Lecture Tiling 2 Loop Unrolling Motivation Reduces loop overhead Improves effectiveness of other transformations Code scheduling CSE The Transformation Make n copies of the loop: n is the unrolling factor Adjust loop bounds accordingly CS553 Lecture Tiling 3 1

Loop Unrolling (cont) Example do i=1,n do i=1,n by 2 A(i) = B(i) + C(i) A(i) = B(i) + C(i) A(i+1) = B(i+1) + C(i+1) Details When is loop unrolling legal? Handle end cases with a cloned copy of the loop Enter this special case if the remaining number of iteration is less than the unrolling factor CS553 Lecture Tiling 4 Loop Balance Problem We d like to produce loops with the right balance of memory operations and floating point operations The ideal balance is machine-dependent e.g. How many load-store units are connected to the L1 cache? e.g. How many functional units are provided? Example do j = 1,2*n The inner loop has 1 memory operation per iteration and 1 floating A(j) = A(j) + B(i) point operation per iteration If our target machine can only support 1 memory operation for every two floating point operations, this loop will be memory bound What can we do? CS553 Lecture Tiling 5 2

Unroll and Jam Idea Restructure loops so that loaded values are used many times per iteration Unroll and Jam Unroll the outer loop some number of times Fuse (Jam) the resulting inner loops Example do j = 1,2*n A(j) = A(j) + B(i) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) CS553 Lecture Tiling 6 Unroll and Jam Example (cont) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) The inner loop has 1 load per iteration and 2 floating point operations per iteration We reuse the loaded value of B(i) The Loop Balance matches the machine balance Jam the inner loops do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) CS553 Lecture Tiling 7 3

Unroll and Jam (cont) Legality When is Unroll and Jam legal? Disadvantages What limits the degree of unrolling? CS553 Lecture Tiling 8 Unroll and Jam IS Tiling (followed by inner loop unrolling) Original Loop do j = 1,2*n by 2 A(j)= A(j) + B(i) j i After Tiling do jj = 1,2*n by 2 do j = jj, jj+2-1 A(j)= A(j)+B(i) After Unroll and Jam do jj = 1,2*n by 2 A(j)= A(j)+B(i) A(j+1)= A(j+1)+B(i) CS553 Lecture Tiling 9 4

Paper Critique and Presentation Format Critique: 1-2 pages with a paragraph answering each of the following questions What problem did the paper address? Is the problem important/interesting? What is the approach used to solve the problem? How does the paper support or justify the conclusions it reaches? What problems are explicitly or implicitly left as future research questions? Presentation: 10-15 slides that present the answers to the critique questions (example for the paper A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam follows) CS553 Lecture Tiling 10 What is the problem the paper addresses? How can we apply loop interchange, skewing, and reversal to generate a loop that is legally tilable (ie. fully permutable) a loop that when tiled will result in improved data locality Original Loop do j = 1,2*n by 2 A(j)= A(j) + B(i) j i CS553 Lecture Tiling 11 5

Is the problem important/interesting? Performance improvements due to tiling can be significant For matrix multiply, 2.75 speedup on a single processor Enables better scaling on parallel processors Tiling Loops More Complex than MM requires making loops permutable goal is to make loops exhibiting reuse permutable CS553 Lecture Tiling 12 What is the approach used to solve the problem? Create a unimodular transformation that results in loops experiencing reuse becoming fully permutable and therefore tilable Formulation of the data locality optimization problem (the specific problem their approach solves) For a given iteration space with a set of dependence vectors, and uniformly generate reference sets the data locality optimization problem is to find the unimodular and/or tiling transform, subject to data dependences, that minimizes the number of memory accesses per iteration. The problem is hard Just finding a legal unimodular transformation is exponential in the number of loops. CS553 Lecture Tiling 13 6

Terminology Dependence vector - a generalization of distance and direction vectors Reuse versus Locality Localized vector space Uniformly generated reference sets CS553 Lecture Tiling 14 Heuristic for solving data locality optimization problem Perform reuse analysis to determine innermost tile (ie. localized vector space) only consider elementary vectors as reuse vectors For the localized vector space, break problem into all possible tiling combinations Apply SRP algorithm in an attempt to make loops fully permutable (S)kew transformations, (R)eversal transformation, and (P)ermutation Definitely works when dependences are lexicographically positive distance vectors O(n 2 *d) where n is the loop nest depth and d is the number of dependence vectors CS553 Lecture Tiling 15 7

How does the paper support the conclusion it reaches? The algorithm... is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. They implement their algorithm in the SUIF compiler They have the compiler generate serial and parallel code for the SGI 4D/380 They perform some optimization by hand register allocation of array elements loop invariant code motion unrolling the innermost loop Benchmarks and parameters LU kernel on 1, 4, and 8 processors using a matrix of size 500x500 and tile sizes of 32x32 and 64x64 SOR kernel on 500x500 matrix, 30 time steps CS553 Lecture Tiling 16 SOR Transformations Variations of 2D (data) SOR wavefront version, theoretically great parallelism, but bad locality 2D tiling, better than wavefront, doesn t exploit temporal reuse 3D tile version, best performance Picture for 1D (data) SOR CS553 Lecture Tiling 17 8

What problems are left as future research? Explicitly stated future work The authors suggest that their SRP algorithm may have its performance improved with a branch and bound formulation. Questions left unanswered in the paper How should the tile sizes be selected? After performing tiling, what algorithm should be used to determine further transformations for improved performance? They perform inner loop unrolling and other, but do not perform a model for which transformations should be performed and what their parameters should be. What is the relationship between storage reuse, data locality, and parallelism? CS553 Lecture Tiling 18 Concepts Unroll and Jam is the same as Tiling with the inner loop unrolled Tiling can improve... loop balance spatial locality data locality CS553 Lecture Tiling 19 9

Next Time (November 28th) Student Surveys Lecture Compiling object-oriented languages CS553 Lecture Tiling 20 10