CS 293S Parallelism and Dependence Theory

Similar documents
A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Advanced Compiler Construction Theory And Practice

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

Compiling for Advanced Architectures

The Polyhedral Model (Transformations)

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

EE 4683/5683: COMPUTER ARCHITECTURE

Loop Transformations, Dependences, and Parallelization

Compiler techniques for leveraging ILP

Enhancing Parallelism

CS425 Computer Systems Architecture

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Legal and impossible dependences

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Computer System Architecture Final Examination Spring 2002

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Loop Transformations! Part II!

Review. Loop Fusion Example

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Floating Point/Multicycle Pipelining in DLX

CS671 Parallel Programming in the Many-Core Era

HY425 Lecture 09: Software to exploit ILP

Dependence Analysis. Hwansoo Han

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

HY425 Lecture 09: Software to exploit ILP

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Data Dependence Analysis

Transforming Imperfectly Nested Loops

Lecture 11 Loop Transformations for Parallelism and Locality


Linear Loop Transformations for Locality Enhancement

Auto-Vectorization with GCC

Tiling: A Data Locality Optimizing Algorithm

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

CSC D70: Compiler Optimization Memory Optimizations

A loopy introduction to dependence analysis

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Data Dependences and Parallelization

ECE 563 Spring 2012 First Exam

Parallelisation. Michael O Boyle. March 2014

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Tiling: A Data Locality Optimizing Algorithm

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Coarse-Grained Parallelism

High Performance Computing Systems

CSC D70: Compiler Optimization Prefetching

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

CS 351 Final Exam Solutions

Supercomputing in Plain English Part IV: Henry Neeman, Director

Introduction to OpenMP

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

ILP: Instruction Level Parallelism

Tutorial 11. Final Exam Review

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

The Polyhedral Model (Dependences and Transformations)

Advanced optimizations of cache performance ( 2.2)

Simone Campanoni Loop transformations

Fourier-Motzkin and Farkas Questions (HW10)

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Tiling: A Data Locality Optimizing Algorithm

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations

Null space basis: mxz. zxz I

Affine and Unimodular Transformations for Non-Uniform Nested Loops

CS 351 Final Review Quiz

CS222: Cache Performance Improvement

Advanced Compiler Construction

1/25/12. Administrative

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

HW 2 is out! Due 9/25!

EECC551 Exam Review 4 questions out of 6 questions

An Overview to. Polyhedral Model. Fangzhou Jiao

EE/CSCI 451 Midterm 1

CSE 417 Network Flows (pt 4) Min Cost Flows

Program Transformations for the Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Adapted from David Patterson s slides on graduate computer architecture

EE/CSCI 451: Parallel and Distributed Computation

COSC 6385 Computer Architecture - Pipelining (II)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Transcription:

CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall

End of Moore's Law necessitate parallel computing End of Moore s law necessitate a means of increasing performance beyond simply producing more complex chips. One such method is to employ cheaper and less complex chips in parallel architectures 2

Amdahl s law if f is the fraction of the code parallelized, and if the parallelized version runs on a p-processor machine with no communication or parallelization overhead, the speedup is 1 1 # + (#/') If f = 50%, than the maximum speedup would be? 3

Data locality Temporal locality occurs when the same data is used several times within a short time period. Spatial locality occurs when different data elements that are located near to each other are used within a short period of time. Better locality à less cache misses An important form of spatial locality occurs when all the elements that appear on one cache line are used together. 1. Parallelism and data locality are often correlated. 2. Same/Similar set of Techniques for exploring parallelism and maximizing data locality. 4

Data locality Kernels can often be written in many semantically equivalent ways but with widely varying data localities and performances for (i=1; i<n; i++) C[i] = A[i] - B[i]; for (i=1; i<n; i++) C[i] = C[i] * C[i]; for (i=1; i<n; i++) C[i] = A[i] - B[i]; C[i] = C[i] * C[i]; 5

Data locality Kernels can often be written in many semantically equivalent ways but with widely varying data localities and performances for (j=1; j<n; j++) for (i=1; i<n; i++) A[i, j] = 0; (a) Zeroing an array column-by-column for (i=1; i<n; i++) for (j=1; j<n; j++) A[i, j] = 0; (b) Zeroing an array row-by-row. b = ceil (N/M) for (i= b * p; i < min(n, b*(p+1)); i++) for (j=1; j<n; j++) A[i, j] = 0; (c) Zeroing an array row-by-row in parallel. 6

Data locality Kernels can often be written in many semantically equivalent ways but with widely varying data localities and performances for (j=1; j<n; j++) for (i=1; i<n; i++) A[i, j] = 0; (a) Zeroing an array column-by-column for (i=1; i<n; i++) for (j=1; j<n; j++) A[i, j] = 0; (b) Zeroing an array row-by-row. b = ceil (N/M) for (i= b * p; i < min(n, b*(p+1)); i++) for (j=1; j<n; j++) A[i, j] = 0; (c) Zeroing an array row-by-row in parallel. 7

How to get efficient parallel programs? Programmer: writing correct and efficient sequential programs is not easy; writing parallel programs that are correct and efficient is even harder. data locality, data dependence Debugging is hard Compiler? Correctness V.S. Efficiency Simple assumption no pointers and pointer arithmetic Affine: Affine loop + affine array access + 8

Affine Array Accesses Common patterns of data accesses: (i, j, k are loop indexes) A[i], A[j], A[i-1], A[0], A[i+j], A[2*i], A[2*i+1], A[i,j], A[i-1, j+1] Array indexes are affine expressions of surrounding loop indexes Loop indexes: i n, i n-1,..., i 1 Integer constants: c n, c n-1,..., c 0 Array index: c n i n + c n-1 i n-1 +... + c 1 i 1 + c 0 Affine expression: linear expression + a constant term (c 0 ) 9

Affine loop All loop bounds and contained control conditions have to be expressible as a linear affine expression in the containing loop index variables Affine array accesses No pointers + no possible aliasing (e.g., overlap of two arrays) between statically distinct base addresses. 10

Loop/Array Parallelism for (i=1; i<n; i++) C[i] = A[i]+B[i]; The loop is parallelizable because each iteration accesses a different set of data. We can execute the loop on a computer with N processors by giving each processor an unique ID p = 0, 1,..., M - 1 and having each processor execute the same code: C[p] = A[p]+B[p]; 11

Parallelism & Dependence for (i=1; i<n; i++) A[i] = A[i-1]+B[i]; A[1] = A[0]+B[1]; A[2] = A[1]+B[2]; A[3] = A[2]+B[3]; 12

Focus of the this lecture Data Dependence True, Anti-, Output dependence Source and Sink Distance vector, direction vector Relation between Reordering transformation and Direction vector Loop dependence loop-carried dependence Loop-Independent Dependences Dependence graph 13

Dependence Concepts Assume statement S 2 depends on statement S 1. 1. True dependences (RAW hazard): read after write. Denoted by S 1 d S 2 2. Antidependence (WAR hazard): write after read. Denoted by S 1 d -1 S 2 3. Output dependence (WAW hazard): write after write. Denoted by S 1 d 0 S 2 14

Dependence Concepts Source and Sink Source: the statement (instance) executed earlier Sink: the statement (instance) executed later Graphically, a dependence is an edge from source to sink sources S1 S2 S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2 S3 sink 15

Dependence in Loops Let us look at two different loops: S 1 DO I = 1, N A(I+1) = A(I) + B(I) S 1 DO I = 1, N A(I+2) = A(I) + B(I) In both cases, statement S 1 depends on itself However, there is a significant difference We need a formalism to describe and distinguish such dependences 16

Data Dependence Analysis Objective: compute the set of statement instances which are dependent Possible approaches: q Distance vector: compute an indicator of the distance between two dependent iteration q Dependence polyhedron: compute list of sets of dependent instances, with a set of dependence polyhedra for each pair of statements 17

Program Abstraction Level Statement For (i = 1; i <=10; i++) A[i] = A[i-1] + 1 Instance of statement A[4] = A[3] + 1 18

Iteration Domain Iteration Vector A n-level loop nest can be represented as a n-entry vector, each component corresponding to each level loop iterator For (x 1 =L 1 ; x 1 <U 1 ; x 1 ++) For (x 2 =L 2 ; x 2 <U 2 ; x 2 ++) For (x n =L n ; x n <U n ; x n ++) <some statement S 1 > The iteration vector (2, 1, ) denotes the instance of S 1 executed during the 2nd iteration of the X 1 loop and the 1st iteration of the X 2 loop 19

Iteration Domain Dimension of Iteration Domain: Decided by loop nesting levels Bounds of Iteration Domain: Decided by loop bounds Using inequalities For (i=1; i<=n; i++) For (j=1; j<=n; j++) if (i<=n+2-j) b[j]=b[j]+a[i]; 20

Modeling Iteration Domains Representing iteration bounds by affine function: 21

Loop Normalization Algorithm: Replace loop boundaries and steps: for (i = L, i < U, i = i + S) à for (i = L, i < (U-L+S)/S, i = i + 1) Replace each reference to original loop variable i with: i * S - S + L 22

Examples: Loop Normalization For (i=4; i<=n; i+=6) For (j=0; j<=n; j+=2) A[i] = 0 For (ii=1; ii<=(n+2)/6; ii++) For (jj=1; jj<=(n+2)/2; jj++) i=ii*6-6+4 j=jj*2-2 A[i]=0 23

Distance/Direction Vectors The distance vector is a vector d(i,j) such that: d(i,j) k = j k - i k. i.e., the difference between their iteration vectors sink - source!! The direction vector is a vector D(i,j) such that: D(i,j)k = < if d(i,j)k > 0; > if d(i,j)k <0; = otherwise. 24

Example 1: S 1 DO I = 1, N A(I+1) = A(I) + B(I) q Dependence distance vector of the true dependence: source: A(I+1); sink: A(I) q Consider a memory location A(x) iteration vector of source: (x-1) iteration vector of sink: (x) q Distance vector: (x) - (x-1) = (1) q Direction vector: (<) 25

Example 2: DO I = 1, N DO J = 1, M DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 What is the dependence distance vector of the true dependence? What is the dependence distance vector of the antidependence? 26

Example 2: DO I = 1, N DO J = 1, M DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 For the true dependence: Distance Vector: (1, 0, -1) Direction Vector: (<, =, >) For the anti-dependence: Distance Vector: (-1, 0, 1) Direction Vector: (>, =, <) sink happens before source: the assumed anti-dependence is invalid! 27

Example 3: DO K = 1, L DO J = 1, M DO I = 1, N S1 A(I+1, J, K-1) = A(I, J, K) + 10 What is the dependence distance vector of the true dependence? What is the dependence distance vector of the antidependence? 28

Example 3: DO K = 1, L DO J = 1, M DO I = 1, N S1 A(I+1, J, K-1) = A(I, J, K) + 10 For the true dependence: Distance Vector: (-1, 0, 1) Direction Vector: (>, =, <) For the anti-dependence: Distance Vector: (1, 0, -1) Direction Vector: (<, =, >) The assumed true dependence is invalid! 29

Example 2 Example 3 DO I = 1, N DO J = 1, M DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 DO K = 1, L DO J = 1, M DO I = 1, N S1 A(I+1, J, K-1) = A(I, J, K) + 10 q True dependence turns into an anti-dependence. Write then read turns into read then write. q Reflected in direction vector of the true dependence: (<, =, >) turns into (>, =, <) 30

Example 4: DO J = 1, M DO I = 1, N DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 What is the dependence distance vector of the true dependence? What is the dependence distance vector of the anti-dependence? Is this program equivalent with Example 2? 31

Example 2 DO I = 1, N DO J = 1, M DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 Example 4 DO J = 1, M DO I = 1, N DO K = 1, L S1 A(I+1, J, K-1) = A(I, J, K) + 10 Consider the true dependence distance vectors direction vectors (1, 0, -1) source (<, =, >) write (0, 1, -1) source (=, <, >) sink read sink write read So, it is still a true dependence. q True dependence stays as true dependence. Write then read stays as Write then read. q Reflected in direction vector of the true dependence: (<, =, >) turns into (=, <, >) 32

Reordering Transformations Definition: merely changes the order of execution of the code no adding or deleting A reordering transformation does not eliminate dependences However, it can change the execution order of original sink and source, causing incorrect behavior 33

Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. ---- Fundamental Theorem of Dependence 34

Theorem of loop reordering Direction Vector Transformation Let T be a reordering transformation that is applied to a loop nest and that does not rearrange the statements in the body of the loop. Then the transformation is valid if, after it is applied, none of the direction vectors for dependences with source and sink in the original nest has a leftmost non- = component that is >. Follows from Fundamental Theorem of Dependence: All dependences exist None of the dependences have been reversed 35

Procedure to Check Validity of a Loop Reordering 1. List the direction vectors of all types of data dependences in the original program 2. According to the new order of loops, exchange the elements in the direction vectors to derive the new direction vectors. 3. If all the direction vectors have a < as the first non- = sign, the transformation is valid. A all- = vector will stay as all- = vector; it won t affect the correctness of loop reordering. 36

Example DO H = 1, 10 DO I = 1, 10 Do J = 1, 10 Do K = 1, 10 S A(H, I+1, J-2, K+3) = A(H, I, J, K) + B? DO H = 1, 10 DO J = 1, 10 Do I = 1, 10 Do K = 1, 10 S A(H, I+1, J-2, K+3) = A(H, I, J, K) + B 37

Loop-Carried and Loop-Independent Dependences If in a loop statement S 2 depends on S 1, then there are two possible ways of this dependence occurring: Source and sink happen on different iterations This is called a loop-carried dependence. S1 and S2 execute on the same iteration This is called a loop-independent dependence 38

Loop-Carried Dependence Example: DO I = 1, N S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) 39

Loop-Carried Dependence Dependence Level: Level of a loop-carried dependence is the index of the leftmost non- = of D(i,j) for the dependence. For instance: DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 S1 A(I, J, K+1) = A(I, J, K) Direction vector for S1 is (=, =, <) Level of the dependence is 3 A level-k true dependence between S 1 and S 2 is denoted by S 1 d k S 2 The iterations of a loop can be executed in parallel if the loop carries no dependences 40

Loop-Independent Dependences Example: DO I = 1, 10 S1 A(I) =... S2... = A(I) More complicated example: DO I = 1, 9 S1 A(I) =... S2... = A(10-I) 41

Loop-Independent Dependences Theorem 2.5. If there is a loop-independent dependence from S 1 to S 2, any reordering transformation that does not move statement instances between iterations and preserves the relative order of S 1 and S 2 in the loop body preserves that dependence. S 2 depends on S 1 with a loop independent true dependence is denoted by S 1 d S 2 The direction vector has entries that are all = for loop independent dependences 42

Is the reordering legal? DO I = 1, 100 DO J=1, 100 A(I+1, J) = A(I, 5) + B (<, <) (<, =) (<, >) DO J = 1, 100 DO I=1, 100 A(I+1, J) = A(I, 5) + B (<, <) (=, <) (>, <) 43

S1 S2 Finding out dependences DO I = 1, 100 D(I) = A (5, I) DO J=1, 100 A(J, I-1) = B(I) + C Only consider common loops! from S1 to S2: (<) level-1 antidependence S1 S2 DO I = 1, 100 D(I) = A (102, I) DO J=1, 100 A(J, I-1) = B(I) + C no dependence 44

Dependence Graph S1 S2 DO I = 1, 100 D(I) = A (5, I) DO J=1, 100 A(J, I-1) = B(I) + C from S1 to S2: (<) level-1 antidependence s1 δ1-1 Nodes for statements Edges for data dependences Labels on edges for dependence levels and types s2 45

DO I = 1, 100 S 1 X(I) = Y(I) + 10 S 2 DO J = 1, 100 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) S 4 Y(I+J) = A(J+1, N) 46