Enhancing Parallelism

Similar documents
Coarse-Grained Parallelism

Loop Transformations! Part II!

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Transforming Imperfectly Nested Loops

Compiling for Advanced Architectures

CS 293S Parallelism and Dependence Theory

Exploring Parallelism At Different Levels

Compiler techniques for leveraging ILP


Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Control flow graphs and loop optimizations. Thursday, October 24, 13

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

A Preliminary Assessment of the ACRI 1 Fortran Compiler

Advanced optimizations of cache performance ( 2.2)

Data Dependence Analysis

A Framework for Safe Automatic Data Reorganization

Analysis and Transformation in an. Interactive Parallel Programming Tool.

Parallel Processing: October, 5, 2010

Program Transformations for the Memory Hierarchy

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Auto-Vectorization of Interleaved Data for SIMD

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Review. Loop Fusion Example

Floating Point/Multicycle Pipelining in DLX

Advanced Compiler Construction Theory And Practice

Advanced Compiler Construction

The Polyhedral Model (Transformations)

Parallelisation. Michael O Boyle. March 2014

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Binding Pipelining

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Program Optimization Through Loop Vectorization

CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers

Assignment statement optimizations

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

From Data to Effects Dependence Graphs: Source-to-Source Transformations for C

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

A loopy introduction to dependence analysis

EE 4683/5683: COMPUTER ARCHITECTURE

CSC D70: Compiler Optimization Register Allocation

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

HY425 Lecture 09: Software to exploit ILP

Supercomputing in Plain English Part IV: Henry Neeman, Director

HY425 Lecture 09: Software to exploit ILP

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

Matrix Multiplication

Maximizing Loop Parallelism and. Improving Data Locality. Abstract. Loop fusion is a program transformation that merges multiple

Instruction scheduling. Advanced Compiler Construction Michel Schinz

Matrix Multiplication

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6. Sorting Algorithms

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

Algorithm. Lecture3: Algorithm Analysis. Empirical Analysis. Algorithm Performance

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Divide-and-Conquer. Divide-and conquer is a general algorithm design paradigm:

LIMITS OF ILP. B649 Parallel Architectures and Programming

Increasing Parallelism of Loops with the Loop Distribution Technique

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Numerical Algorithms

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Intermediate representation

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

CS 2461: Computer Architecture 1

Lecture 2. Memory locality optimizations Address space organization

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Code generation for modern processors

Lecture 6: Static ILP

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Code generation for modern processors

PERFORMANCE OPTIMISATION

Compilation for Heterogeneous Platforms

Optimizing MMM & ATLAS Library Generator

Computational Complexity CSC Professor: Tom Altman. Capacitated Problem

Computer Science 246 Computer Architecture

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

4.1 Interval Scheduling

Compiler Design. Register Allocation. Hwansoo Han

Arquitetura e Organização de Computadores 2

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

LAB 1 Perl I. To get the lab approved, send your code to:

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

Control-Flow Analysis

Intermediate Representations Part II

4.10 Historical Perspective and References

CS425 Computer Systems Architecture

Auto-Vectorization with GCC

Transcription:

CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize (L, k, D) // L is the maximal loop nest containing the statement. // k is the current loop level // D is the dependence graph for statements in L. find the set {,,..., S m } of SCCs in D construct L p from L by reducing each S i to a single node use topological sort to order nodes in L p to {p 1, p 2,..., p m } for i = 1 to m do begin if p i is a dependence cycle then generate a level-k DO construct D i be p i dependence edges in D at level k+1 or greater codegen (p i, k+1, D i ) generate the level-k else vectorize p i with respect to every loop containing it end end vectorize 2 Fine-Grained Parallelism Motivational Example Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming T = 0.0 T = T + A(I,K) * B(K,J) C(I,J) = T 3 4 Motivational Example Motivational Example II T$(I) = 0.0 T$(I) = T$(I) + A(I,K) * B(K,J) C(I,J) = T$(I) Loop Distribution gives us: T$(I) = 0.0 T$(I) = T$(I) + A(I,K) * B(K,J) C(I,J) = T$(I) 5 6

Motivational Example III Loop Interchange Finally, interchanging I and K loops, we get: T$(1:N) = 0.0 T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) C(1:N,J) = T$(1:N) S A(I,J+1) = A(I,J) + B DV: A couple of new transformations used: Loop interchange Scalar Expansion 7 8 Loop Interchange Loop Interchange: Safety Loop interchange is a reordering transformation Why? Think of statements being parameterized with the corresponding iteration vector Loop interchange merely changes the execution order of these statements. It does not create new instances, or delete existing instances S <some statement> If interchanged, S(2, 1) will execute before S(1, 2) Safety: not all loop interchanges are safe A(I,J+1) = A(I+1,J) + B If we interchange loops, will we violate a dependence 9 10 Loop Interchange: Safety Scalar Expansion A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence. 11

Scalar Expansion Can we vectorize the following code? T = A(I) A(I) = B(I) B(I) = T Scalar Expansion However, not always profitable. Consider: T = T + A(I) + A(I+1) A(I) = T Scalar expansion gives us: T$(0) = T T$(I) = T$(I-1) + A(I) + A(I+1) A(I) = T$(I) T = T$(N) 13 14 Scalar Expansion: Safety Scalar expansion is always safe When is it profitable? Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. However, we want to predict when expansion is profitable Dependences due to reuse of memory location vs. flow of values Dependences due to flows of values must be preserved Dependences due to reuse of memory location can be deleted by expansion A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. covering covering 15 16 A covering definition does not always exist: In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a φ-function later in the loop which merges its values with those for another control flow path through the loop We will consider a collection of covering definitions There is a collection C of covering definitions for T in a loop if either: There exists no φ-function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or, The φ-function within the loop has no SSA edge to any φ- function including itself 17 18

Remember the loop which had no covering definition: To form a collection of covering definitions, we can insert dummy assignments: ELSE T = T Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: Central idea: Look for parallel paths to a φ-function following the first assignment, until no more exist Algorithm: see textbook 19 20 Deletable Dependences After scalar expansion: T$(0) = T T$(I) = X(I) ELSE T$(I) = T$(I-1) $(I) After inserting covering definitions: ELSE T = T 21 Uses of T before covering definitions are expanded as T$(I - 1) All other uses are expanded as T$(I) The deletable dependences are: Backward carried antidependences Backward carried output dependences Forward carried output dependences Loop-independent antidependences into the covering definition Loop-carried true dependences from a covering definition 22 Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: Expand in a single loop Strip mine loop before expansion Forward substitution: T = A(I) + A(I+1) A(I) = T + B(I) Scalar and Array Renaming A(I) = A(I) + A(I+1) + B(I) 23

S 4 T = A(I) + B(I) C(I) = T + T T = D(I) - B(I) A(I+1) = T * T Renaming scalar T: T1 = A(I) + B(I) C(I) = T1 + T1 T2 = D(I) - B(I) A(I+1) = T2 * T2 S 4 Scalar Renaming Scalar Renaming will lead to: T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) T1$(1:100) = A(1:100) + B(1:100) C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100) 25 26 Scalar Renaming Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations: Use the definition-use graph to: Pick definition Add all uses that the definition reaches to the equivalence class Add all definitions that reach any of the uses..until fixed point is reached Scalar Renaming: Profitability Scalar renaming will break recurrences in which a loopindependent output dependence or antidependence is a critical element of a cycle Relatively cheap to use scalar renaming Usually done by compilers when calculating live ranges for register allocation 27 28 Array Renaming A(I) = A(I-1) + X Y(I) = A(I) + Z A(I) = B(I) + C -1 δ δ δ 1 0 δ Rename A(I) to A$(I): A$(I) = A(I-1) + X Y(I) = A$(I) + Z A(I) = B(I) + C Dependences remaining: δ and δ 1 Array Renaming: Profitability Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete! Solution: determine edges that are removed by array renaming and analyze effects on dependence graph procedure array_partition: Assumes no control flow in loop body identifies collections of references to arrays which refer to the same value identifies deletable output dependences and antidependences Use this procedure to generate code Minimize amount of copying back to the original array at the beginning and the end 29 30

Can we vectorize the following loop?! S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 Algorithm! S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 Break critical antidependence Make copy of node from which antidependence emanates S1 :X$(I) = X(I+1)! S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 Recurrence broken Vectorized to X$(1:N) = X(2:N+1)! X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Takes a constant loop independent antidependence D Add new assignment x: T$=source(D) Insert x before source(d) Replace source(d) with T$ Make changes in the dependence graph Determining minimal set of critical antidependences is in NP-C Perfect job of difficult Heuristic: Select antidependences Delete it to see if acyclic If acyclic, apply Summary Enhancing fine-grained parallelism break dependence cycles pick false dependences caused by memory reuse not value flow can be removed by renaming not dependence cycles can be made acyclic Transformations discussed 36