Compiling for Advanced Architectures

Similar documents

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Loop Transformations! Part II!

Data Dependences and Parallelization

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Module 17: Loops Lecture 34: Symbolic Analysis. The Lecture Contains: Symbolic Analysis. Example: Triangular Lower Limits. Multiple Loop Limits

Coarse-Grained Parallelism

Advanced Compiler Construction Theory And Practice

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

CS 293S Parallelism and Dependence Theory

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Review. Loop Fusion Example

Exploring Parallelism At Different Levels

Tiling: A Data Locality Optimizing Algorithm

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Parallelisation. Michael O Boyle. March 2014

Fusion of Loops for Parallelism and Locality

Enhancing Parallelism

Transforming Imperfectly Nested Loops

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

The Polyhedral Model (Transformations)

Data Dependence Analysis

PERFORMANCE OPTIMISATION

CSC D70: Compiler Optimization Memory Optimizations

Control flow graphs and loop optimizations. Thursday, October 24, 13

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Simone Campanoni Loop transformations

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

CSC D70: Compiler Optimization Prefetching

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Program Transformations for the Memory Hierarchy

Dependence Analysis. Hwansoo Han

Loop Modifications to Enhance Data-Parallel Performance

Compiler techniques for leveraging ILP

Analysis and Transformation in an. Interactive Parallel Programming Tool.

Tiling: A Data Locality Optimizing Algorithm

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Tiling: A Data Locality Optimizing Algorithm

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

ECE 5730 Memory Systems

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Fusion of Loops for Parallelism and Locality

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Loop Transformations, Dependences, and Parallelization

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Preface. Intel Technology Journal Q4, Lin Chao Editor Intel Technology Journal

Advanced optimizations of cache performance ( 2.2)

Parallelization for Multi-dimensional Loops with Non-uniform Dependences

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Linear Loop Transformations for Locality Enhancement

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Lecture 11 Loop Transformations for Parallelism and Locality

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

Overpartioning with the Rice dhpf Compiler

Cache Performance II 1

Optimising for the p690 memory system

Optimising with the IBM compilers

Legal and impossible dependences

FADA : Fuzzy Array Dataflow Analysis

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

Profiling Dependence Vectors for Loop Parallelization

J. E. Smith. Automatic Parallelization Vector Architectures Cray-1 case study. Data Parallel Programming CM-2 case study

Handling Assignment Comp 412

Data Flow Analysis. Program Analysis

Assignment statement optimizations

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Global Register Allocation via Graph Coloring

Polyèdres et compilation

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Efficient Microarchitecture Modeling and Path Analysis for Real-Time Software. Yau-Tsun Steven Li Sharad Malik Andrew Wolfe

Lecture 2. Memory locality optimizations Address space organization

Supercomputing in Plain English Part IV: Henry Neeman, Director

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Intermediate Representation (IR)

Copyright 2012, Elsevier Inc. All rights reserved.

A Framework for Safe Automatic Data Reorganization

Transcription:

Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have loops that contain most of the computation in the program As a result, advanced optimizing transformations concentrate on loop level optimizations Most loop level optimizations are source to source, ie, reshape loops at the source level Today, we will talk about briefly about Dependence analysis Vectorization Parallelization Blocking for cache Distributed memory compilation

Dependence Overview dependence relation: Describes all statement to statement execution orderings for a sequential program that must be preserved if the meaning of the program is to remain the same There are two sources of dependences: data dependence S 1 pi = 314 S 2 r = 50 area = pi * r**2 S 3 control dependence S S S 1 2 3 S 1 if (t ne 00) then S 2 a = a/t endif S 1 How to preserve the meaning of these programs? Execute the statements in an order that preserves the original load/store order S 2

Dependence Overview Definition There is a data dependence from statement S 1 to statement S 2 (S 1 δs 2 )if 1 Both statements access the same memory location, and 2 There is a run time execution path from S 1 to S 2 Data dependence classification S 2 depends on S 1 S 1 δs 2 true (flow) dependence occurs when S 1 writes a memory location that S 2 later reads anti dependence occurs when S 1 reads a memory location that S 2 later writes output dependence occurs when S 1 writes a memory location that S 2 later writes input dependence occurs when S 1 reads a memory location that S 2 later reads Note: Input dependences do not restrict statement (load/store) order!

Dependence Where do we need it? We restrict our discussion to data dependence for scalar and subscripted variables (no pointers and no control dependence) Examples: do I = 1, 100 do J = 1, 100 A(I,J) = A(I,J) + 1 do I = 1, 99 do J = 1, 100 A(I,J) = A(I+1,J) + 1 vectorization A(1:100:1,1:100:1) = A(1:100:1,1:100:1) + 1 A(1:99,1:100) = A(2:100,1:100) + 1 parallelization doall I = 1, 100 doall J = 1, 100 A(I,J) = A(I,J) + 1 implicit barrier sync implicit barrier sync do I = 1, 99 doall J = 1, 100 A(I,J) = A(I+1,J) + 1 implicit barrier sync

Dependence Analysis Question Do two variable references never/maybe/always access the same memory location? Benefits improves alias analysis enables loop transformations Motivation classic optimizations instruction scheduling data locality (register/cache reuse) vectorization, parallelization Obstacles array references pointer references

Vectorization vs Parallelization vectorization Find parallelism in innermost loops; fine grain parallelism parallelization Find parallelism in outermost loops; coarse grain parallelism Parallelization is considered more complex than vectorization, since finding coarse grain parallelism requires more analysis (eg, interprocedural analysis) Automatic vectorizers have been very successful

Dependence Analysis for Array References do I = 1, 100 A(I) = = A(I-1) 1 2 3 4 I 5 do I = 1, 100 A(I) = = A(I) 1 2 3 4 I 5 A loop-independent dependence exists regardless of the loop structure The source and sink of the dependence occur on the same loop iteration A loop-carried dependence is induced by the iterations of a loop The source and sink of the dependence occur on different loop iterations Loop-carried dependences can inhibit parallelization and loop transformations

Dependence Testing Given do i 1 = L 1,U 1 do i n = L n,u n S 1 A(f 1 (i 1,,i n ),,f m (i 1,,i n )) = S 2 = A(g 1 (i 1,,i n ),,g m (i 1,,i n )) A dependence between statement S 1 and S 2, denoted S 1 δs 2, indicates that S 1,thesource, must be executed before S 2,thesink on some iteration of the nest Let α & β be a vector of n integers within the ranges of the lower and upper bounds of the n loops Does α β, st f k (α) = g k (β) k, 1 k m?

Iteration Space do I = 1, 5 do J = I, 6 I 1 I 5 I J 6 lexicographical (sequential) order for the above iteration space is (1,1), (1,2),, (1,6) (2,2), (2,3), (2,6) (5,5), (5,6) given I =(i 1,i n )andi =(i 1,,i n ), I < I iff (i 1,i 2,i k )=(i 1,i 2,i k) & i k+1 <i k+1 J

Distance & Direction Vectors do I = 1, N do J = 1, N A(I,J) = A(I,J-1) S 1 do I = 1, N do J = 1, N A(I,J) = A(I-1,J-1) B(I,J) = B(I-1,J+1) S 2 S 3 I I J J Distance Vector = number of iterations between accesses to the same location Direction Vector = direction in iteration space (=,<,>) S 1 δs 1 S 2 δs 2 S 3 δs 3 distance vector direction vector

Which Loops are Parallel? do I = 1, N do J = 1, N A(I,J) = A(I,J-1) S 1 do I = 1, N do J = 1, N A(I,J) = A(I-1,J-1) S 2 I do I = 1, N do J = 1, N B(I,J) = B(I-1,J+1) S 3 J a dependence D =(d 1,,d k )iscarried at level i, ifd i is the first nonzero element of the distance/direction vector aloopl i is parallel, if a dependence D j carried at level i distance vector direction vector D j d 1,,d i 1 > 0 d 1,,d i 1 = < OR d 1,,d i = 0 d 1,,d i = =

Approaches to Dependence Testing can we solve this problem exactly? what is conservative in this framework? restrict the problem to consider index and bound expressions that are linear functions = solving general system of linear equations in integers is NP-hard Solution Methods inexact methods Greatest Common Divisor (GCD) Banerjee s inequalities cascade of exact, efficient tests (fall back on inexact methods if needed) Rice Stanford exact general tests (integer programming)

Loop Transformations Goal modify execution order of loop iterations preserve data dependence constraints Motivation data locality (increase reuse of registers, cache) parallelism (eliminate loop-carried deps, incr granularity) Taxonomy loop interchange (change order of loops in nest) loop fusion (merge bodies of adjacent loops) loop distribution (split body of loop into adjacent loops) strip-mine and interchange (tiling, blocking) (split loop into nested loops, then interchange)

Loop Interchange S 1 S 2 do I = 1, N do J = 1, N A(I,J) = A(I,J-1) B(I,J) = B(I-1,J-1) I = loop interchange = J S 1 S 2 do J = 1, N do I = 1, N A(I,J) = A(I,J-1) B(I,J) = B(I-1,J-1) I Loop interchange is safe iff J it does not create a lexicographically negative direction vector (1, 1) ( 1,1) Benefits may expose parallel loops, incr granularity reordering iterations may improve reuse

Loop Fusion S 1 S 2 do i = 2, N A(i) = B(i) do i = 2, N B(i) = A(i-1) = loop fusion = S 1 S 2 do i = 2, N A(i) = B(i) B(i) = A(i-1) Loop fusion is safe iff no loop-independent dependence between nests is converted to a backward loop-carried dep (would fusion be safe if S 2 referenced a(i+1)?) Benefits reduces loop overhead improves reuse between loop nests increases granularity of parallel loop

Loop Distribution (Fission) S 1 S 2 do i = 2, N A(i) = B(i) B(i) = A(i-1) = loop distribution = S 1 S 2 do i = 2, N A(i) = B(i) do i = 2, N B(i) = A(i-1) Loop distribution is safe iff statements involved in a cycle of true deps (recurrence) remain in the same loop, and if a dependence between two statements placed in different loops, it must be forward Benefits necessary for vectorization may enable partial/full parallelization may enable other loop transformations may reduce register/cache pressure

Data Locality Why locality? memory accesses are expensive exploit higher levels of memory hierarchy by reusing registers, cache lines, TLB, etc locality of reference reuse Locality temporal locality spatial locality reuse of a specific location reuse of adjacent locations (cache lines, pages) What locality/reuse occurs in this loop nest? do i = 1, N do j = 1, M A(i) = A(i) + B(j) A B +

Strip-Mine and Interchange (Tiling) do i = 1, N do j = 1, M A(i) = A(i) + B(j) = Strip Mine = I do i = 1, N do jj = 1, M, T do j = jj, jj+t-1 A(i) = A(i) + B(j) J = Interchange = do jj = 1, M, T do i = 1, N do j = jj, jj+t-1 A(i) = A(i) + B(j) I J Strip mining is always safe, with interchange it changes shape of iteration space can exploit reuse for multiple loops

Loop Transformations to Improve Reuse Assumptions cache architecture (simple): one word cache lines, LRU replacement policy, fully associative cache, M>cache size Analysis Original loop nest A B + A(i) N cache misses, one for each outer iteration (cold misses); reuse for inner iterations B(j) N M cache misses due to LRU policy (capacity misses)

Loop Transformations to Improve Reuse Transformed loop nest strip mining and interchange A(i) N M/T cache misses (conservative) B(j) M cache misses; once element is in cache, it stays in cache until all computation using it is done A B + T Comparison N + N M misses vs M + N M T tradeoff decision misses Note: strip mine and interchange is similar to unroll and jam typically, cache architectures are more complex scalar replacement transformation for registers

Compiling for Distributed Memory Multiprocessors Assumptions Data parallelism has been specified via data layout; Example align A(i) with B(i) distribute A(block) on 4 procs Compiler generates SPMD code (single program, multiple data) Compiler uses owner computes rule Compiler challenges Has to manage local name spaces (there is no global name space!) Insert necessary communication Goal: Minimize cost of communication while maximizing parallelism

Compiling for Distributed Memory Multiprocessors Example real A(100), B(100) do i = 1, 99 A(i) = A(i+1) + B(i+1) A B 1 25 26 50 51 75 76 100 1 25 26 50 51 75 76 100 Proc 0 Proc 1 Proc 2 Proc 3 real A(26), B(26) // 1 element overlap to the right if my$proc == 3 then my$up=24 else my$up=25 endif if my$proc > 0 then send( A(1), my$proc - 1) //send to left send( B(1), my$proc - 1) //send to left endif if my$proc < 3 then receive( A(26), my$proc + 1) //receive from right receive( B(26), my$proc + 1) //receive from right endif do i = 1, my$up A(i) = A(i+1) + B(i+1)