Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

Size: px
Start display at page:

Download "Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector"

Transcription

1 Parallelization techniques The parallelization techniques for loops normally follow the following three steeps: 1. Perform a Data Dependence Test to detect potential parallelism. These tests may be performed over the statements in a loop, over consecutive iterations of the loop, or even, over more than one loop of the original program.. Restructure the loop into one of the possible forms that represent total, partial or no parallelism: DOALL, DOACROSS or DOSEQ. We may want to restructure the internal layout of the statements to find a more suitable order for other transformations like for example vectorization. Using different transformation we can obtain the greatest degree of parallelism in a program.. Generate parallel code for a particular computer and/or architecture by scheduling the iterations on specific processors and then synthesizing a convenient mechanism for achieving parallelism in a Shared Memory system or in a Distributed Memory system. 1 1 Dependence graph A dependence graph is a precedence graph where nodes are statements and arcs are dependences. A directed graph G (V,E) where V is a set of nodes V = { S1, S... Sn} corresponding to statements in a program, E is a set of arcs E={eij=(Si,Sj) Si,Sj in V} representing data dependencies between statements S1: X = Y + 1 L1: DO =,0 S: C() = X + B() S: A() = C( - 1) + Z S: C( + 1) = B() + A() L: DO J =, 0 S: F(, J) = F(, J - 1) + X S6: Z = Y + Dependence Distance and Distance Vector Suppose a statement S is a nested loop L. Let the first instance of S occur when the loop index is 1, and the second instance occur when the index is, where iteration 1 is the source of a dependence and is the sink of the dependence relation. The dependence distance is 1 - for this dependence. L1: DO = 1, L: DO J = 1, S1: A(, J) = B(, J) + C(, J) S: B(, J + 1) = A(, J) + B(, J) The distance for above example are 0 and 1 for the and J loops, respectively Page 1 1

2 Vectorization The aim of the Vectorization is the automatic transformation of a sequential structure into code suitable for vector machines. To do this, the compiler must check all the dependencies existing inside the loop. n the most simple case, when dependencies do not exist, the compiler must distribute the loop around each statement of the loop, and create a vector statement for each case. DO = 1, N S1: A() = B() * C() S: D() = B() * K MODFED LOOP S1: A(:N) = B(:N) * C(:N) S: D(:N) = K * B(:N) Vectorization of the simple loop (without data dependence) Vectorization S1 DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K S: C()=B() -1 t S a t a S Program and corresponding Dependence Graph A simple analysis of the graph, shows that statements S1 and S are strongly connected, because data dependence exists in more than one direction, and can not be vectorized. Otherwise statement S can be vectorized, because the mentioned condition is not found between S and the other statements. DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K Program after the Vectorization transformation S : C(:N)=B(:N) -1 Vectorization (1/) Loop Reordering DO = 1, 100 S: D() = A(-1) * D() S : A() = B() + C() S S t Program and correspondent Dependence Graph After the Statement Reordering transformation, the program and the Dependence Graph, are as follows: DO = 1, 100 S : A() = B() + C() S: D() = A(-1) * D() S S t Modified program and correspondent Dependence Graph Note that now a dependence relation exists between S and S statements, but it is only in one direction and it does not cross the iteration boundary. 6 6 Page

3 Vectorization (/) - Loop Reordering L :DO = 1, 100 S : A() = B() + C() L :DO = 1, 100 S: D() = A(-1) * D() The next step of transformation - Loop Distribution Now, there are not dependences within L and L and vectorization can be done. Loop after Vectorization A ( 1:100 ) = B ( 1:100 ) + C (1:100) D ( 1:100 ) = A ( 0: 99 ) * D ( 1:100) Program after vectorisation 7 7 Loop Fusion ( also known as Loop Jamming ) (1/) Loop fusion merges two separate loops into a single one. A test of Data Dependence must be performed between the statements inside the two loops to ensure that no dependence relation is being created with the fusion of the loops. S L1: DO = 1, N S1: A() = B() + C(+1) L: DO = 1, N S: C() = A(+) FUSED LOOP DO = 1, N S1: A() = B() + C(+1) S: C() = A(+) The two original loops in a program that can not be fused into one loop. 8 8 Loop Fusion ( also known as Loop Jamming ) (/) S L1: DOALL = 1, N S1: D() = E() + F() + X() L: DOALL J = 1, N S: E(J) = D(J) * F(J) FUSED LOOP L1: DOALL = 1, N S1: D() = E() + F() + X() S: E() = D() * F() The two original loops in a program that can be fused into one loop. 9 9 Page

4 Loop Distribution The idea of this transformation, is to distribute or separate a complete loop around each statement in its body, or around modules inside the loop. n general, this distribution of statements is legal, if there is no data dependence between each pair of statements, or if there are data dependencies in only one direction. n the example, a loop with three statements placed inside, is analyzed and some dependencies are found. The distribution is made taking into account this dependencies found in the anterior phase. n the original loop there exist a flow dependence relation between S and S1. n each iteration the value used by B ( i - 1 ) in S1, is the value calculated in the anterior iteration by S. There is also a flow dependence relation between S1 and S, because the values used in S of A() can not be modified before, by the assignment to A ( i + 1 ) in S1. An antidependence exists between S1 and S, and between S and S we can find a flow dependence. But in this case, these two latter dependencies are towards S, and only in one direction. Thus, we can make the separation. DOSEQ = 1, N S1: A( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K S: C ( ) = B ( ) - 1 END DO TRANSFORMED LOOP DOSEQ = 1, N S1: A ( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K DOALL = 1, N S: C ( ) = B ( ) - 1 A loop with three statements placed inside and program after transformation Loop nterchange (1/) The loop interchange between two nested loops, is a permutation of the loop statements so that the outer loop becomes the inner loop and viceversa. Naturally, the transformation can be applied repeatedly to interchange more than two loops when the program is composed by a set of nested loops. As a difference, when we are using it for parallelization purposes, but not in a vector machine, the most suitable idea is to bring the parallelizable loop to the outermost position, to achieve maximum parallelism. The rule in this case, on the contrary of the anterior one, is that the outermost position we bring the loop, the most iterations can be launched in parallel. L1:DOALL =, N L: DOSEQ J =, M S1:A (, J ) = A (, J-1 ) + 1 TRANSFORMED LOOP L1:DOSEQ J =, M L: DOALL =, N S1:A (, J ) = A (, J-1 ) Loop nterchange (/) The loop is not vectorizable since the innermost loop must be executed serially, but the exterior loop is a parallel one. nterchanging makes vectorization possible. LOOP AFTER VECTORZATON L1: DOSEQ J =, M VS: A( :N, J ) = A ( :N, J - 1 ) + 1 The inner loop is transformed into one vector statement. 1 1 Page

5 Loop nterchange (/) The example shows how to use Loop nterchanging to achieve maximum parallelism. The interchange here is done to place the DOALL loop in the outermost position, so that, all what is inside this loop can be launched in parallel. Bringing this loop to the outermost position, increases the grain of what is going to be executed in parallel. As outer as we bring the parallel loop, more statements and loops will be inside the loop and much more code can be executed in parallel. L1:DOSEQ =, N L: DOALL J = 1, N S1: A(,J) = A(-1,J) + B() TRANSFORMED LOOP L1:DOALL J = 1, N L: DOSEQ = 1, N S1: A(,J) = A(-1,J) + B() 1 1 Node Splitting - Loop Partitioning Two ideas are important when considering the partitioning of a loop. One possibility is to separate some statements that form a loop into parts to eliminate any kind of data dependence existing between them. The other idea consists in simply partition the statements of a loop to convert the problem into independent smaller problems. L1:DOSEQ = 1,N S1: A() = B() + C() S: D() = A(-1) * A(+1) A loop with two statements inside where a dependence cycle exists. n every iteration A is updated, and the value is used in the next iteration. The distance of the dependence is 1. NTRODUCNG TEMPORARY REORDERNG STATEMENTS VARABLE AND RENAMNG L1:DOSEQ = 1,N L1:DOSEQ = 1,N S: TEMP() = A (+1) S1: A() = B() + C() S1: A() = B() + C() S: TEMP() = A (+1) S: D() = A(-1) * TEMP() S: D() = A(-1) * TEMP() 1 1 Node Splitting - Loop Partitioning Two steps of the transformation. n the first step a new variable is introduced and a reordering is done. n the second step a second reordering is performed. After these steps we can arrive to a suitable form to perform the distribution of the loop. The distribution creates three DOALL loops, it means, three loops with possibility of total parallel execution. AFTER DSTRBUTON L1:DOALL = 1, N S: TEMP() = A (+1) L1:DOALL = 1, N S1: A() = B() + C() L1:DOALL = 1, N S: D() = A(-1) * TEMP() 1 1 Page

6 Node Splitting The next example presents a special possibility of Node Spliting. Here, we have a dependence between a pair of statements with a distance of two between them. L1:DO = 1, M S1:A() = A( - ) - The loop with a dependence cycle of distance. The iterations where a dependence exists must be performed serially, but we can perform in parallel the two groups of independent-operation statements that exist in the loop. S1: A ( ) A ( - ) S: A ( + 1 ) A ( - 1 ) S: A ( + ) A ( ) S: A ( + ) A ( + 1) S: A ( + ) A ( + ) Node Splitting As you can see in the Figure there is a dependence relation between statement S1 and statement S and every other statement S*i+1. s the same situation that we find in statements S, S and so on. But every of the groups can be done in parallel. So, we perform a split of the original loop into two independent loops. NODE SPLTTNG. PART ONE DO = 1, ( M - 1)/ * + 1, STEP A ( ) = A ( - ) - NODE SPLTTNG. PART TWO DO =, M / *, STEP A ( ) = A ( - ) - The loop is splitted into two loops. The new loops perform each one of them half the original work. We must take care in this transformation with the indexes of the loop Loop Shrinking The purpose of Loop Shrinking or, also known in the bibliography as Cycle Shrinking, could be considered similar to partial Loop Partition. The difference is that Loop Shrinking can always give better results than partition, as we will see later. Let s consider the following example to introduce the technique and how it is performed. The following loop with K statements is involved in a dependence cycle of the type : S1 1 S... k-1 Sk k S1, and 1 is the distance of i ( i = 1,,...,k ). DO = 1, N S1 S... SK Original loop for the example Page 6 6

7 Loop Shrinking DOALL J = 1, g DO = J,J + (N-J)/g S1 S... SK ALL Modified loop after a Partition transformation. where g = GCD ( 1,,..., k) is the greatest common divisor of all K distances in the dependence cycle. The same loop will be transformed by Cycle Shrinking to the following loop, where = min ( 1,,..., k ) DO J = 1, N, DOALL = J,J S1 S... SK ALL 19 Modified loop after a Loop Shrinking transformation. 19 Comparing Loop Shrinking and Loop Partitioning n other terms, the size of the DOALL loop created by Cycle Shrinking is always greater than or equal to the size of the DOALL created by the Partition transformation. The philosophy of each transformation is different: Partition tries to group together all iterations of a DO loop that form a dependence chain. Each such group is executed serially, while different groups can execute in parallel. Dependencies are confined within the iterations of each group and dependencies across groups do not exist. Cycle Shrinking groups together independent iterations and executes them in parallel. Dependencies exist only across groups and are satisfied by executing the different groups in their natural sequential order. 0 0 Comparing Loop Shrinking and Loop Partitioning L1: DOSEQ =, N S1: A () = B (-) -1 S: B () = A (-) * K MODFED LOOP L1: DOSEQ J =, N, L: DOALL = J, J+1 S1: A () = B (-) -1 S: B () = A (-) * K An example of Loop Shrinking transformation. = 7 = teration Space Graphs for the above example 1 1 Page 7 7

8 Loop Skewing Loop Skewing extracts parallelism from multiple nested loops, in many cases where parallelism can not be found in any single loop. DOSEQ =, N-1 DOSEQ J =, N-1 A(,J) = (A(+1,J)+A(-1,J)+A(,J+1)+A(,J-1))/ J teration Space Graph for the example. n the graph you can see diagonal lines that correspond to wave fronts found in this iteration. Loop Skewing The transformation consists in a shift of the index set of the original loop, creating rhomboid teration Space out of what was a square. The corresponding restructured code is the following: MODFED LOOP DOALL =, N-1 DO J = +,+N-1 A(,J-1) = (A(+1,J-)+A(-1,J-1)+A(,J+1-)+A(,J-1-))/ J teration Space Graph for the modified program using the Loop Skewing technique. terations in a vertical line are executed concurrently. Page 8 8

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Transforming Imperfectly Nested Loops

Transforming Imperfectly Nested Loops Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015 Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015 The Problem n The Parallelism Challenge n cannot escape going parallel n parallel programming is hard

More information

Loop Modifications to Enhance Data-Parallel Performance

Loop Modifications to Enhance Data-Parallel Performance Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution. The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Class Information INFORMATION and REMINDERS Homework 8 has been posted. Due Wednesday, December 13 at 11:59pm. Third programming has been posted. Due Friday, December 15, 11:59pm. Midterm sample solutions

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/)

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Sorting. Quicksort analysis Bubble sort. November 20, 2017 Hassan Khosravi / Geoffrey Tien 1

Sorting. Quicksort analysis Bubble sort. November 20, 2017 Hassan Khosravi / Geoffrey Tien 1 Sorting Quicksort analysis Bubble sort November 20, 2017 Hassan Khosravi / Geoffrey Tien 1 Quicksort analysis How long does Quicksort take to run? Let's consider the best and the worst case These differ

More information

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi Evaluation of Relational Operations: Other Techniques Chapter 14 Sayyed Nezhadi Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer,

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors by Naraig Manjikian A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Lecture 14. Lecture 14: Joins!

Lecture 14. Lecture 14: Joins! Lecture 14 Lecture 14: Joins! Lecture 14 Announcements: Two Hints You may want to do Trigger activity for project 2. We ve noticed those who do it have less trouble with project! Seems like we re good

More information

Parallel Processing: October, 5, 2010

Parallel Processing: October, 5, 2010 Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need

More information

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms Analysis of Algorithms Unit 4 - Analysis of well known Algorithms 1 Analysis of well known Algorithms Brute Force Algorithms Greedy Algorithms Divide and Conquer Algorithms Decrease and Conquer Algorithms

More information

Fusion of Loops for Parallelism and Locality

Fusion of Loops for Parallelism and Locality Fusion of Loops for Parallelism and Locality Naraig Manjikian and Tarek S. Abdelrahman Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4 Abstract

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

Loops / Repetition Statements

Loops / Repetition Statements Loops / Repetition Statements Repetition statements allow us to execute a statement multiple times Often they are referred to as loops C has three kinds of repetition statements: the while loop the for

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako Pipeline Parallelism and the OpenMP Doacross Construct COMP515 - guest lecture October 27th, 2015 Jun Shirako Doall Parallelization (Recap) No loop-carried dependence among iterations of doall loop Parallel

More information

Lecture Notes on Loop Transformations for Cache Optimization

Lecture Notes on Loop Transformations for Cache Optimization Lecture Notes on Loop Transformations for Cache Optimization 5-: Compiler Design André Platzer Lecture Introduction In this lecture we consider loop transformations that can be used for cache optimization.

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

Implementing Joins 1

Implementing Joins 1 Implementing Joins 1 Last Time Selection Scan, binary search, indexes Projection Duplicate elimination: sorting, hashing Index-only scans Joins 2 Tuple Nested Loop Join foreach tuple r in R do foreach

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Increasing Parallelism of Loops with the Loop Distribution Technique

Increasing Parallelism of Loops with the Loop Distribution Technique Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw

More information

Lecture 15: The Details of Joins

Lecture 15: The Details of Joins Lecture 15 Lecture 15: The Details of Joins (and bonus!) Lecture 15 > Section 1 What you will learn about in this section 1. How to choose between BNLJ, SMJ 2. HJ versus SMJ 3. Buffer Manager Detail (PS#3!)

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops The Lecture Contains: Cycle Shrinking Cycle Shrinking in Distance Varying Loops Loop Peeling Index Set Splitting Loop Fusion Loop Fission Loop Reversal Loop Skewing Iteration Space of The Loop Example

More information

CompSci 516 Data Intensive Computing Systems

CompSci 516 Data Intensive Computing Systems CompSci 516 Data Intensive Computing Systems Lecture 9 Join Algorithms and Query Optimizations Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Announcements Takeaway from Homework

More information

AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014

AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014 AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 214 Q.2 a. Design an algorithm for computing gcd (m,n) using Euclid s algorithm. Apply the algorithm to find gcd (31415, 14142). ALGORITHM Euclid(m, n) //Computes

More information

O(n): printing a list of n items to the screen, looking at each item once.

O(n): printing a list of n items to the screen, looking at each item once. UNIT IV Sorting: O notation efficiency of sorting bubble sort quick sort selection sort heap sort insertion sort shell sort merge sort radix sort. O NOTATION BIG OH (O) NOTATION Big oh : the function f(n)=o(g(n))

More information

Review. Loop Fusion Example

Review. Loop Fusion Example Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Edwin H.-M. Sha Timothy W. O Neil Nelson L. Passos Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Erik

More information

Advanced Compiler Construction Theory And Practice

Advanced Compiler Construction Theory And Practice Advanced Compiler Construction Theory And Practice Introduction to loop dependence and Optimizations 7/7/2014 DragonStar 2014 - Qing Yi 1 A little about myself Qing Yi Ph.D. Rice University, USA. Associate

More information

(Refer Slide Time: 00:26)

(Refer Slide Time: 00:26) Programming, Data Structures and Algorithms Prof. Shankar Balachandran Department of Computer Science and Engineering Indian Institute Technology, Madras Module 07 Lecture 07 Contents Repetitive statements

More information

Pipeline Vectorization

Pipeline Vectorization 234 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001 Pipeline Vectorization Markus Weinhardt and Wayne Luk, Member, IEEE Abstract This paper

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

CSC D70: Compiler Optimization Memory Optimizations

CSC D70: Compiler Optimization Memory Optimizations CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and

More information

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #16 Loops: Matrix Using Nested for Loop In this section, we will use the, for loop to code of the matrix problem.

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

C Language Part 2 Digital Computer Concept and Practice Copyright 2012 by Jaejin Lee

C Language Part 2 Digital Computer Concept and Practice Copyright 2012 by Jaejin Lee C Language Part 2 (Minor modifications by the instructor) 1 Scope Rules A variable declared inside a function is a local variable Each local variable in a function comes into existence when the function

More information

COMP 250. Lecture 7. Sorting a List: bubble sort selection sort insertion sort. Sept. 22, 2017

COMP 250. Lecture 7. Sorting a List: bubble sort selection sort insertion sort. Sept. 22, 2017 COMP 250 Lecture 7 Sorting a List: bubble sort selection sort insertion sort Sept. 22, 20 1 Sorting BEFORE AFTER 2 2 2 Example: sorting exams by last name Sorting Algorithms Bubble sort Selection sort

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from

More information

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc.

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc. Algorithms Analysis Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc. Algorithms analysis tends to focus on time: Techniques for measuring

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

A Loop Transformation Theory and an Algorithm to Maximize Parallelism

A Loop Transformation Theory and an Algorithm to Maximize Parallelism 452 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991 A Loop Transformation Theory and an Algorithm to Maximize Parallelism Michael E. Wolf and Monica S. Lam, Member, IEEE

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1 CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost

More information

Lecture 11 Loop Transformations for Parallelism and Locality

Lecture 11 Loop Transformations for Parallelism and Locality Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory

More information

Implementation of Relational Operations

Implementation of Relational Operations Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows

More information

University of Waterloo Midterm Examination Sample Solution

University of Waterloo Midterm Examination Sample Solution 1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Automatic Tiling of Iterative Stencil Loops

Automatic Tiling of Iterative Stencil Loops Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation

More information

Outline. Computer Science 331. Three Classical Algorithms. The Sorting Problem. Classical Sorting Algorithms. Mike Jacobson. Description Analysis

Outline. Computer Science 331. Three Classical Algorithms. The Sorting Problem. Classical Sorting Algorithms. Mike Jacobson. Description Analysis Outline Computer Science 331 Classical Sorting Algorithms Mike Jacobson Department of Computer Science University of Calgary Lecture #22 1 Introduction 2 3 4 5 Comparisons Mike Jacobson (University of

More information

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations. Relational Operations Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )

More information

Vectorization in the Polyhedral Model

Vectorization in the Polyhedral Model Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton

More information

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache Memory Cache Memory Locality cpu cache memory Memory hierarchies take advantage of memory locality. Memory locality is the principle that future memory accesses are near past accesses. Memory hierarchies

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers

More information

Fusion of Loops for Parallelism and Locality

Fusion of Loops for Parallelism and Locality Fusion of Loops for Parallelism and Locality Naraig Manjikian and Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 email:

More information

Module 2: Classical Algorithm Design Techniques

Module 2: Classical Algorithm Design Techniques Module 2: Classical Algorithm Design Techniques Dr. Natarajan Meghanathan Associate Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Module

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Lecture 15: Iteration and Recursion

Lecture 15: Iteration and Recursion Lecture 15: and Recursion The University of North Carolina at Chapel Hill Spring 2002 Lecture 15: and Recursion Feb 13/15 1 Control Flow Mechanisms Sequencing Textual order, Precedence in Expression Selection

More information

Data-centric Transformations for Locality Enhancement

Data-centric Transformations for Locality Enhancement Data-centric Transformations for Locality Enhancement Induprakas Kodukula Keshav Pingali September 26, 2002 Abstract On modern computers, the performance of programs is often limited by memory latency

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

For searching and sorting algorithms, this is particularly dependent on the number of data elements.

For searching and sorting algorithms, this is particularly dependent on the number of data elements. Looking up a phone number, accessing a website and checking the definition of a word in a dictionary all involve searching large amounts of data. Searching algorithms all accomplish the same goal finding

More information

12/1/2016. Sorting. Savitch Chapter 7.4. Why sort. Easier to search (binary search) Sorting used as a step in many algorithms

12/1/2016. Sorting. Savitch Chapter 7.4. Why sort. Easier to search (binary search) Sorting used as a step in many algorithms Sorting Savitch Chapter. Why sort Easier to search (binary search) Sorting used as a step in many algorithms Sorting algorithms There are many algorithms for sorting: Selection sort Insertion sort Bubble

More information

Lecture 57 Dynamic Programming. (Refer Slide Time: 00:31)

Lecture 57 Dynamic Programming. (Refer Slide Time: 00:31) Programming, Data Structures and Algorithms Prof. N.S. Narayanaswamy Department of Computer Science and Engineering Indian Institution Technology, Madras Lecture 57 Dynamic Programming (Refer Slide Time:

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Sorting Pearson Education, Inc. All rights reserved.

Sorting Pearson Education, Inc. All rights reserved. 1 19 Sorting 2 19.1 Introduction (Cont.) Sorting data Place data in order Typically ascending or descending Based on one or more sort keys Algorithms Insertion sort Selection sort Merge sort More efficient,

More information

Dependence Analysis. Hwansoo Han

Dependence Analysis. Hwansoo Han Dependence Analysis Hwansoo Han Dependence analysis Dependence Control dependence Data dependence Dependence graph Usage The way to represent dependences Dependence types, latencies Instruction scheduling

More information

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013 Loops or Lather, Rinse, Repeat CS4410: Spring 2013 Program Loops Reading: Appel Ch. 18 Loop = a computation repeatedly executed until a terminating condition is reached High-level loop constructs: While

More information

6/12/2013. Introduction to Algorithms (2 nd edition) Overview. The Sorting Problem. Chapter 2: Getting Started. by Cormen, Leiserson, Rivest & Stein

6/12/2013. Introduction to Algorithms (2 nd edition) Overview. The Sorting Problem. Chapter 2: Getting Started. by Cormen, Leiserson, Rivest & Stein Introduction to Algorithms (2 nd edition) by Cormen, Leiserson, Rivest & Stein Chapter 2: Getting Started (slides enhanced by N. Adlai A. DePano) Overview Aims to familiarize us with framework used throughout

More information

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor

More information

Maximum Loop Distribution and Fusion for Two-level Loops Considering Code Size

Maximum Loop Distribution and Fusion for Two-level Loops Considering Code Size Maximum Loop Distribution and Fusion for Two-level Loops Considering Code Size Meilin Liu Qingfeng Zhuge Zili Shao Chun Xue Meikang Qiu Edwin H.-M. Sha Department of Computer Science Department of Computing

More information

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer COMP 506 Rice University Spring 2017 Loops and Locality with an introduc-on to the memory hierarchy source code Front End IR OpJmizer IR Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon,

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations: Other Techniques Evaluation of Relational Operations: Other Techniques [R&G] Chapter 14, Part B CS4320 1 Using an Index for Selections Cost depends on #qualifying tuples, and clustering. Cost of finding qualifying data

More information

Lecture 19 Sorting Goodrich, Tamassia

Lecture 19 Sorting Goodrich, Tamassia Lecture 19 Sorting 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 2004 Goodrich, Tamassia Outline Review 3 simple sorting algorithms: 1. selection Sort (in previous course) 2. insertion Sort (in previous

More information