Tiling: A Data Locality Optimizing Algorithm

Similar documents
Tiling: A Data Locality Optimizing Algorithm

CS671 Parallel Programming in the Many-Core Era

Tiling: A Data Locality Optimizing Algorithm

Polyhedral Operations. Algorithms needed for automation. Logistics

Polyhedral Compilation Foundations

The Polyhedral Model (Transformations)

Review. Loop Fusion Example

The Polyhedral Compilation Framework

Control flow graphs and loop optimizations. Thursday, October 24, 13

Iterative Optimization in the Polyhedral Model

Loop Transformations! Part II!

Fourier-Motzkin and Farkas Questions (HW10)

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Compiling for Advanced Architectures

Loop Transformations, Dependences, and Parallelization

Polly Polyhedral Optimizations for LLVM

TOBIAS GROSSER Parkas Group, Computer Science Department, Ècole Normale Supèrieure / INRIA 45 Rue d Ulm, Paris, 75005, France

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

Program transformations and optimizations in the polyhedral framework

PoCC. The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th Louis-Noël Pouchet

A polyhedral loop transformation framework for parallelization and tuning

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

More Data Locality for Static Control Programs on NUMA Architectures

Legal and impossible dependences

CS 293S Parallelism and Dependence Theory

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Simone Campanoni Loop transformations

The Polyhedral Model (Dependences and Transformations)

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Alan LaMielle, Michelle Strout Colorado State University March 16, Technical Report CS

Polyhedral Compilation Foundations

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY


PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Clint: A Direct Manipulation Tool for Parallelizing Compute-Intensive Program Parts

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

PERFORMANCE OPTIMISATION

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

Parallelisation. Michael O Boyle. March 2014

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations

Parallel Numerical Algorithms

Language and compiler parallelization support for Hash tables

Auto-Vectorization of Interleaved Data for SIMD

Polyhedral Search Space Exploration in the ExaStencils Code Generator

An Approach for Code Generation in the Sparse Polyhedral Framework

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Automatic Polyhedral Optimization of Stencil Codes

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES

Program Transformations for the Memory Hierarchy

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

CSC D70: Compiler Optimization Memory Optimizations

Loop Modifications to Enhance Data-Parallel Performance

The Polyhedral Model Is More Widely Applicable Than You Think

Fusion of Loops for Parallelism and Locality

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

GRAPHITE: Polyhedral Analyses and Optimizations

EE 4683/5683: COMPUTER ARCHITECTURE

CS560 Lecture Parallel Architecture 1

Introducing the GCC to the Polyhedron Model

A Framework for Automatic OpenMP Code Generation

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Optimising for the p690 memory system

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

LooPo: Automatic Loop Parallelization

Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications

FADA : Fuzzy Array Dataflow Analysis

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Day 6: Optimization on Parallel Intel Architectures

Preface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal

Generation of parallel synchronization-free tiled code

Language and Compiler Parallelization Support for Hashtables

ECE 5730 Memory Systems

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Performance Comparison Between Patus and Pluto Compilers on Stencils

Offload acceleration of scientific calculations within.net assemblies

ECE 563 Spring 2012 First Exam

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Cache Performance II 1

CSC D70: Compiler Optimization Prefetching

Polly - Polyhedral optimization in LLVM

Transcription:

Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness of the polyhedral model Using Pluto to automatically transform code Unroll and Jam and Tiling Specifying tiling in the Kelly and Pugh transformation framework Status of code generation for tiling Tiling 1 Usefulness of the Polyhedral Model Loop transformations and parallelization can have a significant impact on performance Other than new algorithms, data locality and parallelization is necessary for future improvements Recall that loop permutation in smithwaterman.c resulted in an order of magnitude execution time difference 350 300 Execution time for varying problem sizes User time System time Elapsed time Execution time for varying problem sizes after switching lines 207 and 208 20 18 16 User time System time Elapsed time 250 14 Time in seconds 200 150 Time in seconds 12 10 8 100 6 4 50 2 0 5000 10000 15000 20000 25000 30000 Problem size 0 5000 10000 15000 20000 25000 30000 Tiling Problem size 2

Usefulness of the Polyhedral Model, cont Existing tools in industry that use the polyhedral model IBM XL/Poly Reservoir Labs R-Stream Apolent Polyhedral model in open source and research Graphite GCC, au auto parallelization pass Pluto and Orio PoCC (Clan-Candl-LetSee-Pluto-Cloog-Polylib-PIPLib-ISL-FM) AlphaZ, POET, Chill, URUK, Omega, Loopo Programming language abstractions that are related Domains in Chapel Regions in X10 Tiling 3 Getting Started with Pluto Pluto has been installed on the CS linux machines Try out some examples git clone git://repo.or.cz/pluto.git will create a directory called pluto Test and examples directory contains example input files polycc test/seidel.c --tile parallel Read some documentation more pluto/doc/doc.txt http://pluto-compiler.sourceforge.net/ Using it on your own code Put in #pragma scop and #pragma endscop around the code to be transformed <show using polycc with the smithwaterman.c example> Tiling 4

Static Control Programs Static Control Programs Definition Symbolic constants are variables that are not modified in the loop Loop bounds are affine combinations of loop variables and symbolic constants Array accesses are affine combinations of loop variables and symbolic constants If conditions are affine functions of loop variables and symbolic constants All function calls are pure References Pluto_general_notes.txt (will post with this lecture) Pluto/doc/DOC.txt Section 2.2 of the Feautrier paper Tiling 5 Control over Pluto transformations (see doc/doc.txt) polycc infile.c --parallel, Parallelize code with OpenMP --tile, Tile code. Can control tile sizes with an input file in a specific format. --maxfuse, Do the maximum amount of loop fusion --unroll, does unroll and jam up to two loops --prevector, make code amenable to vectorization --debug, Intermediate files are kept so can see dependences, etc. What is tiling? Unroll and Jam and Tiling Specifying tiling in the Kelly and Pugh transformation framework Status of code generation for tiling Tiling 6

Loop Unrolling Motivation Reduces loop overhead Improves effectiveness of other transformations Code scheduling CSE The Transformation Make n copies of the loop: n is the unrolling factor Adjust loop bounds accordingly Tiling 7 Loop Unrolling (cont) Example do i=1,n do i=1,n-1 by 2 A(i) = B(i) + C(i) A(i) = B(i) + C(i) A(i+1) = B(i+1) + C(i+1) if (i=n) A(i) = B(i) + C(i) Details When is loop unrolling legal? Handle end cases with a cloned copy of the loop Enter this special case if the remaining number of iteration is less than the unrolling factor Tiling 8

Loop Balance Problem We d like to produce loops with the right balance of memory operations and floating point operations The ideal balance is machine-dependent e.g. How many load-store units are connected to the L1 cache? e.g. How many functional units are provided? Example do j = 1,2*n The inner loop has 1 memory operation per iteration and 1 floating A(j) = A(j) + B(i) point operation per iteration If our target machine can only support 1 memory operation for every two floating point operations, this loop will be memory bound What can we do? Tiling 9 Unroll and Jam Idea Restructure loops so that loaded values are used many times per iteration Unroll and Jam Unroll the outer loop some number of times Fuse (Jam) the resulting inner loops Example do j = 1,2*n A(j) = A(j) + B(i) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) Tiling 10

Unroll and Jam Example (cont) Unroll the Outer Loop do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) The inner loop has 1 load per iteration and 2 floating point operations per iteration We reuse the loaded value of B(i) The Loop Balance matches the machine balance Jam the inner loops do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j+1) = A(j+1) + B(i) Tiling 11 Unroll and Jam (cont) Legality When is Unroll and Jam legal? Disadvantages What limits the degree of unrolling? Tiling 12

Tiling A non-unimodular transformation that... groups iteration points into tiles that are executed atomically can improve spatial and temporal data locality can expose larger granularities of parallelism Implementing tiling how can we specify tiling? when is tiling legal? how do we generate tiled code? j i do ii = 1,6, by 2 do jj = 1, 5, by 2 do i = ii, ii+2-1 do j = jj, min(jj+2-1,5) A(i,j) =... Tiling 13 Specifying Tiling Rectangular tiling tile size vector tile offset, Possible Transformation Mappings creating a tile space j i keeping tile iterators in original iteration space Tiling 14

Legality of Tiling A legal rectangular tiling each tile executed atomically no dependence cycles between tiles Check legality by verifying that transformed data dependences are lexicographically positive j i Fully permutable loops rectangular tiling is legal on fully permutable loops j i Tiling 15 Code Generation for Tiling Fixed-size Tiles Omega library Cloog for rectangular space and tiles, straight-forward Parameterized tile sizes Parameterized tiled loops for free, PLDI 2007 HiTLOG - A Tiled Loop Generator that is part of AlphaZ do ii = 1,6, by 2 do jj = 1, 5, by 2 do i = ii, ii+2-1 do j = jj, min(jj+2-1,5) A(i,j) =... Overview of decoupled approach find polyhedron that may contain any loop origins generate code that traverses that polyhedron post process the code to start a tile origins and step by tile size generate loops over points in tile to stay within original iteration space and within tile Tiling 16

Unroll and Jam IS Tiling (followed by inner loop unrolling) Original Loop do j = 1,2*n A(j)= A(j) + B(i) j i After Tiling do jj = 1,2*n by 2 do j = jj, jj+2-1 A(j)= A(j)+B(i) After Unroll and Jam do jj = 1,2*n by 2 A(j)= A(j)+B(i) A(j+1)= A(j+1)+B(i) Tiling 17 Concepts Unroll and Jam is the same as Tiling with the inner loop unrolled Tiling can improve... loop balance spatial locality data locality computation to communication ratio Implementing tiling specification checking legality code generation Tiling 18

Next Time Lecture Review of transformations covered so far Run-time reordering transformations Schedule Project intermediate report due March 28 th April 3 rd will be a lab day during class, Manaf will help people with AlphaZ and Pluto. Distance students can email questions or use the discussion board. HW6 and HW7 will BOTH be due April 4 th Tiling 19