Tiling: A Data Locality Optimizing Algorithm

Similar documents
Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm

Control flow graphs and loop optimizations. Thursday, October 24, 13

Legal and impossible dependences

Linear Loop Transformations for Locality Enhancement

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

The Polyhedral Model (Transformations)

Null space basis: mxz. zxz I

Loop Transformations, Dependences, and Parallelization

Loop Transformations! Part II!

Simone Campanoni Loop transformations

Program Transformations for the Memory Hierarchy

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Computing and Informatics, Vol. 36, 2017, , doi: /cai

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Compiling for Advanced Architectures

Fourier-Motzkin and Farkas Questions (HW10)

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Unroll-and-Jam Using Uniformly Generated Sets

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

A Layout-Conscious Iteration Space Transformation Technique

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Review. Loop Fusion Example

Coarse-Grained Parallelism

Data-centric Transformations for Locality Enhancement

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian

CSC D70: Compiler Optimization Memory Optimizations

A Loop Transformation Theory and an Algorithm to Maximize Parallelism

PERFORMANCE OPTIMISATION

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Polyhedral Operations. Algorithms needed for automation. Logistics

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

CS 293S Parallelism and Dependence Theory

Data Dependence Analysis

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Plan for Today. Concepts. Next Time. Some slides are from Calvin Lin s grad compiler slides. CS553 Lecture 2 Optimizations and LLVM 1

A polyhedral loop transformation framework for parallelization and tuning

Transforming Imperfectly Nested Loops

CUDA Performance Considerations (2 of 2)

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

Exploring Parallelism At Different Levels

CS553 Lecture Dynamic Optimizations 2

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Cache Performance II 1

LooPo: Automatic Loop Parallelization

The Polyhedral Model (Dependences and Transformations)

Preface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Parallelisation. Michael O Boyle. March 2014

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Cache-oblivious Programming

Outline. Column Generation: Cutting Stock A very applied method. Introduction to Column Generation. Given an LP problem

Column Generation: Cutting Stock

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Spring 2 Spring Loop Optimizations

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Reducing Memory Requirements of Nested Loops for Embedded Systems

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Overpartioning with the Rice dhpf Compiler

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Rechnerarchitektur (RA)

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

A Compiler Approach to Fast Hardware Design Space Exploration in FPGA-based Systems

Automatic Tiling of Iterative Stencil Loops

Advanced Compiler Construction Theory And Practice

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

RICE UNIVERSITY. Transforming Complex Loop Nests For Locality by Qing Yi

CS671 Parallel Programming in the Many-Core Era

Lecture 2. Memory locality optimizations Address space organization

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

c Nawaaz Ahmed 000 ALL RIGHTS RESERVED

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Parameterized Tiled Loops for Free

Alternative Array Storage Layouts for Regular Scientific Programs

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Predictive Modeling in a Polyhedral Optimization Space

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

Transcription:

Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique with Lam and Wolfe, A Data Locality Optimizing Algorithm as an example CS 553 Tiling 1

Unroll and Jam Idea Restructure loops so that loaded values are used many times per iteration Unroll and Jam Unroll the outer loop some number of times Fuse (Jam) the resulting inner loops Example do j = 1,2*n A(j) = A(j) + B(i) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) CS 553 Tiling 2

Unroll and Jam Example (cont) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) - The inner loop has 1 load per iteration and 2 floating point operations per iteration - We reuse the loaded value of B(i) - Aim to match the machine balance, how many floating point ops per load? Jam the inner loops do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) CS 553 Tiling 3

Tiling A non-unimodular transformation that... groups iteration points into tiles that are executed atomically can improve spatial and temporal data locality can expose larger granularities of parallelism j i Implementing tiling how can we specify tiling? when is tiling legal? how do we generate tiled code? do ii = 1,6, by 2 do jj = 1, 5, by 2 do i = ii, ii+2-1 do j = jj, min(jj+2-1,5) A(i,j) =... CS 553 Tiling 4

Unroll and Jam IS Tiling (followed by inner loop unrolling) Original Loop do j = 1,2*n A(j)= A(j) + B(i) j i After Tiling do jj = 1,2*n by 2 do j = jj, jj+2-1 A(j)= A(j)+B(i) After Unroll and Jam do jj = 1,2*n by 2 A(j)= A(j)+B(i) A(j+1)= A(j+1)+B(i) CS 553 Tiling 5

Code Generation for Tiling do ii = 1,6, by 2 Fixed-size Tiles do jj = 1, 5, by 2 Omega library do i = ii, ii+2-1 Cloog do j = jj, min(jj+2-1,5) for rectangular space and tiles, straight-forward A(i,j) =... Parameterized tile sizes Parameterized tiled loops for free, PLDI 2007 HiTLOG - A Tiled Loop Generator that is part of AlphaZ Overview of decoupled approach find polyhedron that may contain any loop origins generate code that traverses that polyhedron post process the code to start a tile origins and step by tile size generate loops over points in tile to stay within original iteration space and within tile CS 553 Tiling 6

Specifying Tiling Rectangular tiling tile size vector tile offset, Possible Transformation Mappings creating a tile space j i keeping tile iterators in original iteration space CS 553 Tiling 7

Using iscc to do code generation for tiling Iteration space: S := { s[i,j] : 1 <=i<=6 && 1<=j<=5 }; Tiling specification T :={s[i,j]->[ti,tj,i,j]: ti=(i-1)/2&&tj=(j-1)/2}; Gives output but not correct. Getting rid of integer divison ti=(i-1)/2 becomes 0<=ri<2 && (i-1)=2*ti+ri tj=(j-1)/2 becomes 0<=rj<2 && (j-1)=2*tj+rj T :={s[i,j]->[ti,tj,i,j]: exists ri,rj: 0<=ri<2 && i-1=ti*2+ri && 0<=rj<2 && j-1=tj*2+ rj}; Let s try other specification that doesn t create new tile space. CS 553 Tiling 8

Concepts Unroll and Jam is the same as Tiling with the inner loop unrolled Tiling can improve... loop balance spatial locality data locality computation to communication ratio Implementing tiling specification checking legality code generation CS 553 Tiling 9

Paper Writing and Critique in General Papers should answer the following questions What problem did the paper address? Is the problem important/interesting? What is the approach used to solve the problem? How does the paper support or justify the conclusions it reaches? What problems are explicitly or implicitly left as future research questions? CS 553 Tiling 10

A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam, 1991. Specific Problem: How can we apply loop interchange, skewing, and reversal to generate a loop that is legally tilable (ie. fully permutable) a loop that when tiled will result in improved data locality Original Loop do j = 1,2*n by 2 A(j)= A(j) + B(i) j i CS 553 Tiling 11

Is the problem important/interesting? Performance improvements due to tiling can be significant For matrix multiply, 2.75 speedup on a single processor Enables better scaling on parallel processors Tiling Loops More Complex than MM requires making loops permutable goal is to make loops exhibiting reuse permutable CS 553 Tiling 12

What is the approach used to solve the problem? Create a unimodular transformation that results in loops experiencing reuse becoming fully permutable and therefore tilable Formulation of the data locality optimization problem (the specific problem their approach solves) For a given iteration space with a set of dependence vectors, and uniformly generate reference sets the data locality optimization problem is to find the unimodular and/or tiling transform, subject to data dependences, that minimizes the number of memory accesses per iteration. The problem is hard Just finding a legal unimodular transformation is exponential in the number of loops. CS 553 Tiling 13

Heuristic for solving data locality optimization problem Perform reuse analysis to determine innermost tile (ie. localized vector space) only consider elementary vectors as reuse vectors For the localized vector space, break problem into all possible tiling combinations Apply SRP algorithm in an attempt to make loops fully permutable (S)kew transformations, (R)eversal transformation, and (P)ermutation Definitely works when dependences are lexicographically positive distance vectors O(n 2 *d) where n is the loop nest depth and d is the number of dependence vectors CS 553 Tiling 14

How does the paper support the conclusion it reaches? The algorithm... is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. They implement their algorithm in the SUIF compiler They have the compiler generate serial and parallel code for the SGI 4D/ 380 They perform some optimization by hand register allocation of array elements loop invariant code motion unrolling the innermost loop Benchmarks and parameters LU kernel on 1, 4, and 8 processors using a matrix of size 500x500 and tile sizes of 32x32 and 64x64 SOR kernel on 500x500 matrix, 30 time steps CS 553 Tiling 15

SOR Transformations Variations of 2D (data) SOR wavefront version, theoretically great parallelism, but bad locality 2D tile, better than wavefront, doesn t exploit temporal reuse 3D tile version, best performance Picture for 1D (data) SOR CS 553 Tiling 16

What problems are left as future research? Explicitly stated future work The authors suggest that their SRP algorithm may have its performance improved with a branch and bound formulation. Questions left unanswered in the paper How should the tile sizes be selected? After performing tiling, what algorithm should be used to determine further transformations for improved performance? They perform inner loop unrolling and other, but do not perform a model for which transformations should be performed and what their parameters should be. What is the relationship between storage reuse, data locality, and parallelism? CS 553 Tiling 17