Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector
|
|
- Tobias Lyons
- 5 years ago
- Views:
Transcription
1 Parallelization techniques The parallelization techniques for loops normally follow the following three steeps: 1. Perform a Data Dependence Test to detect potential parallelism. These tests may be performed over the statements in a loop, over consecutive iterations of the loop, or even, over more than one loop of the original program.. Restructure the loop into one of the possible forms that represent total, partial or no parallelism: DOALL, DOACROSS or DOSEQ. We may want to restructure the internal layout of the statements to find a more suitable order for other transformations like for example vectorization. Using different transformation we can obtain the greatest degree of parallelism in a program.. Generate parallel code for a particular computer and/or architecture by scheduling the iterations on specific processors and then synthesizing a convenient mechanism for achieving parallelism in a Shared Memory system or in a Distributed Memory system. 1 1 Dependence graph A dependence graph is a precedence graph where nodes are statements and arcs are dependences. A directed graph G (V,E) where V is a set of nodes V = { S1, S... Sn} corresponding to statements in a program, E is a set of arcs E={eij=(Si,Sj) Si,Sj in V} representing data dependencies between statements S1: X = Y + 1 L1: DO =,0 S: C() = X + B() S: A() = C( - 1) + Z S: C( + 1) = B() + A() L: DO J =, 0 S: F(, J) = F(, J - 1) + X S6: Z = Y + Dependence Distance and Distance Vector Suppose a statement S is a nested loop L. Let the first instance of S occur when the loop index is 1, and the second instance occur when the index is, where iteration 1 is the source of a dependence and is the sink of the dependence relation. The dependence distance is 1 - for this dependence. L1: DO = 1, L: DO J = 1, S1: A(, J) = B(, J) + C(, J) S: B(, J + 1) = A(, J) + B(, J) The distance for above example are 0 and 1 for the and J loops, respectively Page 1 1
2 Vectorization The aim of the Vectorization is the automatic transformation of a sequential structure into code suitable for vector machines. To do this, the compiler must check all the dependencies existing inside the loop. n the most simple case, when dependencies do not exist, the compiler must distribute the loop around each statement of the loop, and create a vector statement for each case. DO = 1, N S1: A() = B() * C() S: D() = B() * K MODFED LOOP S1: A(:N) = B(:N) * C(:N) S: D(:N) = K * B(:N) Vectorization of the simple loop (without data dependence) Vectorization S1 DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K S: C()=B() -1 t S a t a S Program and corresponding Dependence Graph A simple analysis of the graph, shows that statements S1 and S are strongly connected, because data dependence exists in more than one direction, and can not be vectorized. Otherwise statement S can be vectorized, because the mentioned condition is not found between S and the other statements. DO = 1, N S1: A(+1) = B(-1) + C() S: B() = A() * K Program after the Vectorization transformation S : C(:N)=B(:N) -1 Vectorization (1/) Loop Reordering DO = 1, 100 S: D() = A(-1) * D() S : A() = B() + C() S S t Program and correspondent Dependence Graph After the Statement Reordering transformation, the program and the Dependence Graph, are as follows: DO = 1, 100 S : A() = B() + C() S: D() = A(-1) * D() S S t Modified program and correspondent Dependence Graph Note that now a dependence relation exists between S and S statements, but it is only in one direction and it does not cross the iteration boundary. 6 6 Page
3 Vectorization (/) - Loop Reordering L :DO = 1, 100 S : A() = B() + C() L :DO = 1, 100 S: D() = A(-1) * D() The next step of transformation - Loop Distribution Now, there are not dependences within L and L and vectorization can be done. Loop after Vectorization A ( 1:100 ) = B ( 1:100 ) + C (1:100) D ( 1:100 ) = A ( 0: 99 ) * D ( 1:100) Program after vectorisation 7 7 Loop Fusion ( also known as Loop Jamming ) (1/) Loop fusion merges two separate loops into a single one. A test of Data Dependence must be performed between the statements inside the two loops to ensure that no dependence relation is being created with the fusion of the loops. S L1: DO = 1, N S1: A() = B() + C(+1) L: DO = 1, N S: C() = A(+) FUSED LOOP DO = 1, N S1: A() = B() + C(+1) S: C() = A(+) The two original loops in a program that can not be fused into one loop. 8 8 Loop Fusion ( also known as Loop Jamming ) (/) S L1: DOALL = 1, N S1: D() = E() + F() + X() L: DOALL J = 1, N S: E(J) = D(J) * F(J) FUSED LOOP L1: DOALL = 1, N S1: D() = E() + F() + X() S: E() = D() * F() The two original loops in a program that can be fused into one loop. 9 9 Page
4 Loop Distribution The idea of this transformation, is to distribute or separate a complete loop around each statement in its body, or around modules inside the loop. n general, this distribution of statements is legal, if there is no data dependence between each pair of statements, or if there are data dependencies in only one direction. n the example, a loop with three statements placed inside, is analyzed and some dependencies are found. The distribution is made taking into account this dependencies found in the anterior phase. n the original loop there exist a flow dependence relation between S and S1. n each iteration the value used by B ( i - 1 ) in S1, is the value calculated in the anterior iteration by S. There is also a flow dependence relation between S1 and S, because the values used in S of A() can not be modified before, by the assignment to A ( i + 1 ) in S1. An antidependence exists between S1 and S, and between S and S we can find a flow dependence. But in this case, these two latter dependencies are towards S, and only in one direction. Thus, we can make the separation. DOSEQ = 1, N S1: A( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K S: C ( ) = B ( ) - 1 END DO TRANSFORMED LOOP DOSEQ = 1, N S1: A ( + 1 ) = B ( - 1 ) + C ( ) S: B ( ) = A ( ) * K DOALL = 1, N S: C ( ) = B ( ) - 1 A loop with three statements placed inside and program after transformation Loop nterchange (1/) The loop interchange between two nested loops, is a permutation of the loop statements so that the outer loop becomes the inner loop and viceversa. Naturally, the transformation can be applied repeatedly to interchange more than two loops when the program is composed by a set of nested loops. As a difference, when we are using it for parallelization purposes, but not in a vector machine, the most suitable idea is to bring the parallelizable loop to the outermost position, to achieve maximum parallelism. The rule in this case, on the contrary of the anterior one, is that the outermost position we bring the loop, the most iterations can be launched in parallel. L1:DOALL =, N L: DOSEQ J =, M S1:A (, J ) = A (, J-1 ) + 1 TRANSFORMED LOOP L1:DOSEQ J =, M L: DOALL =, N S1:A (, J ) = A (, J-1 ) Loop nterchange (/) The loop is not vectorizable since the innermost loop must be executed serially, but the exterior loop is a parallel one. nterchanging makes vectorization possible. LOOP AFTER VECTORZATON L1: DOSEQ J =, M VS: A( :N, J ) = A ( :N, J - 1 ) + 1 The inner loop is transformed into one vector statement. 1 1 Page
5 Loop nterchange (/) The example shows how to use Loop nterchanging to achieve maximum parallelism. The interchange here is done to place the DOALL loop in the outermost position, so that, all what is inside this loop can be launched in parallel. Bringing this loop to the outermost position, increases the grain of what is going to be executed in parallel. As outer as we bring the parallel loop, more statements and loops will be inside the loop and much more code can be executed in parallel. L1:DOSEQ =, N L: DOALL J = 1, N S1: A(,J) = A(-1,J) + B() TRANSFORMED LOOP L1:DOALL J = 1, N L: DOSEQ = 1, N S1: A(,J) = A(-1,J) + B() 1 1 Node Splitting - Loop Partitioning Two ideas are important when considering the partitioning of a loop. One possibility is to separate some statements that form a loop into parts to eliminate any kind of data dependence existing between them. The other idea consists in simply partition the statements of a loop to convert the problem into independent smaller problems. L1:DOSEQ = 1,N S1: A() = B() + C() S: D() = A(-1) * A(+1) A loop with two statements inside where a dependence cycle exists. n every iteration A is updated, and the value is used in the next iteration. The distance of the dependence is 1. NTRODUCNG TEMPORARY REORDERNG STATEMENTS VARABLE AND RENAMNG L1:DOSEQ = 1,N L1:DOSEQ = 1,N S: TEMP() = A (+1) S1: A() = B() + C() S1: A() = B() + C() S: TEMP() = A (+1) S: D() = A(-1) * TEMP() S: D() = A(-1) * TEMP() 1 1 Node Splitting - Loop Partitioning Two steps of the transformation. n the first step a new variable is introduced and a reordering is done. n the second step a second reordering is performed. After these steps we can arrive to a suitable form to perform the distribution of the loop. The distribution creates three DOALL loops, it means, three loops with possibility of total parallel execution. AFTER DSTRBUTON L1:DOALL = 1, N S: TEMP() = A (+1) L1:DOALL = 1, N S1: A() = B() + C() L1:DOALL = 1, N S: D() = A(-1) * TEMP() 1 1 Page
6 Node Splitting The next example presents a special possibility of Node Spliting. Here, we have a dependence between a pair of statements with a distance of two between them. L1:DO = 1, M S1:A() = A( - ) - The loop with a dependence cycle of distance. The iterations where a dependence exists must be performed serially, but we can perform in parallel the two groups of independent-operation statements that exist in the loop. S1: A ( ) A ( - ) S: A ( + 1 ) A ( - 1 ) S: A ( + ) A ( ) S: A ( + ) A ( + 1) S: A ( + ) A ( + ) Node Splitting As you can see in the Figure there is a dependence relation between statement S1 and statement S and every other statement S*i+1. s the same situation that we find in statements S, S and so on. But every of the groups can be done in parallel. So, we perform a split of the original loop into two independent loops. NODE SPLTTNG. PART ONE DO = 1, ( M - 1)/ * + 1, STEP A ( ) = A ( - ) - NODE SPLTTNG. PART TWO DO =, M / *, STEP A ( ) = A ( - ) - The loop is splitted into two loops. The new loops perform each one of them half the original work. We must take care in this transformation with the indexes of the loop Loop Shrinking The purpose of Loop Shrinking or, also known in the bibliography as Cycle Shrinking, could be considered similar to partial Loop Partition. The difference is that Loop Shrinking can always give better results than partition, as we will see later. Let s consider the following example to introduce the technique and how it is performed. The following loop with K statements is involved in a dependence cycle of the type : S1 1 S... k-1 Sk k S1, and 1 is the distance of i ( i = 1,,...,k ). DO = 1, N S1 S... SK Original loop for the example Page 6 6
7 Loop Shrinking DOALL J = 1, g DO = J,J + (N-J)/g S1 S... SK ALL Modified loop after a Partition transformation. where g = GCD ( 1,,..., k) is the greatest common divisor of all K distances in the dependence cycle. The same loop will be transformed by Cycle Shrinking to the following loop, where = min ( 1,,..., k ) DO J = 1, N, DOALL = J,J S1 S... SK ALL 19 Modified loop after a Loop Shrinking transformation. 19 Comparing Loop Shrinking and Loop Partitioning n other terms, the size of the DOALL loop created by Cycle Shrinking is always greater than or equal to the size of the DOALL created by the Partition transformation. The philosophy of each transformation is different: Partition tries to group together all iterations of a DO loop that form a dependence chain. Each such group is executed serially, while different groups can execute in parallel. Dependencies are confined within the iterations of each group and dependencies across groups do not exist. Cycle Shrinking groups together independent iterations and executes them in parallel. Dependencies exist only across groups and are satisfied by executing the different groups in their natural sequential order. 0 0 Comparing Loop Shrinking and Loop Partitioning L1: DOSEQ =, N S1: A () = B (-) -1 S: B () = A (-) * K MODFED LOOP L1: DOSEQ J =, N, L: DOALL = J, J+1 S1: A () = B (-) -1 S: B () = A (-) * K An example of Loop Shrinking transformation. = 7 = teration Space Graphs for the above example 1 1 Page 7 7
8 Loop Skewing Loop Skewing extracts parallelism from multiple nested loops, in many cases where parallelism can not be found in any single loop. DOSEQ =, N-1 DOSEQ J =, N-1 A(,J) = (A(+1,J)+A(-1,J)+A(,J+1)+A(,J-1))/ J teration Space Graph for the example. n the graph you can see diagonal lines that correspond to wave fronts found in this iteration. Loop Skewing The transformation consists in a shift of the index set of the original loop, creating rhomboid teration Space out of what was a square. The corresponding restructured code is the following: MODFED LOOP DOALL =, N-1 DO J = +,+N-1 A(,J-1) = (A(+1,J-)+A(-1,J-1)+A(,J+1-)+A(,J-1-))/ J teration Space Graph for the modified program using the Loop Skewing technique. terations in a vertical line are executed concurrently. Page 8 8
Compiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationCoarse-Grained Parallelism
Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationTransforming Imperfectly Nested Loops
Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationRevisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015
Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015 The Problem n The Parallelism Challenge n cannot escape going parallel n parallel programming is hard
More informationLoop Modifications to Enhance Data-Parallel Performance
Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent
More informationData Dependences and Parallelization
Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]
More informationModule 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.
The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationLinear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationIterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationClass Information INFORMATION and REMINDERS Homework 8 has been posted. Due Wednesday, December 13 at 11:59pm. Third programming has been posted. Due Friday, December 15, 11:59pm. Midterm sample solutions
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationParallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.
Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/)
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationSorting. Quicksort analysis Bubble sort. November 20, 2017 Hassan Khosravi / Geoffrey Tien 1
Sorting Quicksort analysis Bubble sort November 20, 2017 Hassan Khosravi / Geoffrey Tien 1 Quicksort analysis How long does Quicksort take to run? Let's consider the best and the worst case These differ
More informationEvaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi
Evaluation of Relational Operations: Other Techniques Chapter 14 Sayyed Nezhadi Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer,
More informationParallelisation. Michael O Boyle. March 2014
Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD
More informationProgram Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian
Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors by Naraig Manjikian A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationLecture 14. Lecture 14: Joins!
Lecture 14 Lecture 14: Joins! Lecture 14 Announcements: Two Hints You may want to do Trigger activity for project 2. We ve noticed those who do it have less trouble with project! Seems like we re good
More informationParallel Processing: October, 5, 2010
Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need
More informationAnalysis of Algorithms. Unit 4 - Analysis of well known Algorithms
Analysis of Algorithms Unit 4 - Analysis of well known Algorithms 1 Analysis of well known Algorithms Brute Force Algorithms Greedy Algorithms Divide and Conquer Algorithms Decrease and Conquer Algorithms
More informationFusion of Loops for Parallelism and Locality
Fusion of Loops for Parallelism and Locality Naraig Manjikian and Tarek S. Abdelrahman Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4 Abstract
More informationEnhancing Parallelism
CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize
More informationLoops / Repetition Statements
Loops / Repetition Statements Repetition statements allow us to execute a statement multiple times Often they are referred to as loops C has three kinds of repetition statements: the while loop the for
More informationProgram Transformations for the Memory Hierarchy
Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California
More informationPipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako
Pipeline Parallelism and the OpenMP Doacross Construct COMP515 - guest lecture October 27th, 2015 Jun Shirako Doall Parallelization (Recap) No loop-carried dependence among iterations of doall loop Parallel
More informationLecture Notes on Loop Transformations for Cache Optimization
Lecture Notes on Loop Transformations for Cache Optimization 5-: Compiler Design André Platzer Lecture Introduction In this lecture we consider loop transformations that can be used for cache optimization.
More informationModule 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching
The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence
More informationImplementing Joins 1
Implementing Joins 1 Last Time Selection Scan, binary search, indexes Projection Duplicate elimination: sorting, hashing Index-only scans Joins 2 Tuple Nested Loop Join foreach tuple r in R do foreach
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationIncreasing Parallelism of Loops with the Loop Distribution Technique
Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw
More informationLecture 15: The Details of Joins
Lecture 15 Lecture 15: The Details of Joins (and bonus!) Lecture 15 > Section 1 What you will learn about in this section 1. How to choose between BNLJ, SMJ 2. HJ versus SMJ 3. Buffer Manager Detail (PS#3!)
More informationCS 293S Parallelism and Dependence Theory
CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law
More informationModule 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops
The Lecture Contains: Cycle Shrinking Cycle Shrinking in Distance Varying Loops Loop Peeling Index Set Splitting Loop Fusion Loop Fission Loop Reversal Loop Skewing Iteration Space of The Loop Example
More informationCompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 9 Join Algorithms and Query Optimizations Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Announcements Takeaway from Homework
More informationAC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014
AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 214 Q.2 a. Design an algorithm for computing gcd (m,n) using Euclid s algorithm. Apply the algorithm to find gcd (31415, 14142). ALGORITHM Euclid(m, n) //Computes
More informationO(n): printing a list of n items to the screen, looking at each item once.
UNIT IV Sorting: O notation efficiency of sorting bubble sort quick sort selection sort heap sort insertion sort shell sort merge sort radix sort. O NOTATION BIG OH (O) NOTATION Big oh : the function f(n)=o(g(n))
More informationReview. Loop Fusion Example
Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is
More informationLegal and impossible dependences
Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral
More informationEfficient Polynomial-Time Nested Loop Fusion with Full Parallelism
Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Edwin H.-M. Sha Timothy W. O Neil Nelson L. Passos Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Erik
More informationAdvanced Compiler Construction Theory And Practice
Advanced Compiler Construction Theory And Practice Introduction to loop dependence and Optimizations 7/7/2014 DragonStar 2014 - Qing Yi 1 A little about myself Qing Yi Ph.D. Rice University, USA. Associate
More information(Refer Slide Time: 00:26)
Programming, Data Structures and Algorithms Prof. Shankar Balachandran Department of Computer Science and Engineering Indian Institute Technology, Madras Module 07 Lecture 07 Contents Repetitive statements
More informationPipeline Vectorization
234 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001 Pipeline Vectorization Markus Weinhardt and Wayne Luk, Member, IEEE Abstract This paper
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationCSC D70: Compiler Optimization Memory Optimizations
CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and
More informationIntroduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop
Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #16 Loops: Matrix Using Nested for Loop In this section, we will use the, for loop to code of the matrix problem.
More informationProfiling Dependence Vectors for Loop Parallelization
Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw
More informationC Language Part 2 Digital Computer Concept and Practice Copyright 2012 by Jaejin Lee
C Language Part 2 (Minor modifications by the instructor) 1 Scope Rules A variable declared inside a function is a local variable Each local variable in a function comes into existence when the function
More informationCOMP 250. Lecture 7. Sorting a List: bubble sort selection sort insertion sort. Sept. 22, 2017
COMP 250 Lecture 7 Sorting a List: bubble sort selection sort insertion sort Sept. 22, 20 1 Sorting BEFORE AFTER 2 2 2 Example: sorting exams by last name Sorting Algorithms Bubble sort Selection sort
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from
More informationAlgorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc.
Algorithms Analysis Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc. Algorithms analysis tends to focus on time: Techniques for measuring
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationA Loop Transformation Theory and an Algorithm to Maximize Parallelism
452 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991 A Loop Transformation Theory and an Algorithm to Maximize Parallelism Michael E. Wolf and Monica S. Lam, Member, IEEE
More informationComputing and Informatics, Vol. 36, 2017, , doi: /cai
Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering
More informationCAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1
CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost
More informationLecture 11 Loop Transformations for Parallelism and Locality
Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory
More informationImplementation of Relational Operations
Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows
More informationUniversity of Waterloo Midterm Examination Sample Solution
1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationAutomatic Tiling of Iterative Stencil Loops
Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation
More informationOutline. Computer Science 331. Three Classical Algorithms. The Sorting Problem. Classical Sorting Algorithms. Mike Jacobson. Description Analysis
Outline Computer Science 331 Classical Sorting Algorithms Mike Jacobson Department of Computer Science University of Calgary Lecture #22 1 Introduction 2 3 4 5 Comparisons Mike Jacobson (University of
More informationEvaluation of Relational Operations. Relational Operations
Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )
More informationVectorization in the Polyhedral Model
Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton
More informationMemory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache
Memory Cache Memory Locality cpu cache memory Memory hierarchies take advantage of memory locality. Memory locality is the principle that future memory accesses are near past accesses. Memory hierarchies
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationIntroduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations
Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers
More informationFusion of Loops for Parallelism and Locality
Fusion of Loops for Parallelism and Locality Naraig Manjikian and Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 email:
More informationModule 2: Classical Algorithm Design Techniques
Module 2: Classical Algorithm Design Techniques Dr. Natarajan Meghanathan Associate Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Module
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationLecture 15: Iteration and Recursion
Lecture 15: and Recursion The University of North Carolina at Chapel Hill Spring 2002 Lecture 15: and Recursion Feb 13/15 1 Control Flow Mechanisms Sequencing Textual order, Precedence in Expression Selection
More informationData-centric Transformations for Locality Enhancement
Data-centric Transformations for Locality Enhancement Induprakas Kodukula Keshav Pingali September 26, 2002 Abstract On modern computers, the performance of programs is often limited by memory latency
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationFor searching and sorting algorithms, this is particularly dependent on the number of data elements.
Looking up a phone number, accessing a website and checking the definition of a word in a dictionary all involve searching large amounts of data. Searching algorithms all accomplish the same goal finding
More information12/1/2016. Sorting. Savitch Chapter 7.4. Why sort. Easier to search (binary search) Sorting used as a step in many algorithms
Sorting Savitch Chapter. Why sort Easier to search (binary search) Sorting used as a step in many algorithms Sorting algorithms There are many algorithms for sorting: Selection sort Insertion sort Bubble
More informationLecture 57 Dynamic Programming. (Refer Slide Time: 00:31)
Programming, Data Structures and Algorithms Prof. N.S. Narayanaswamy Department of Computer Science and Engineering Indian Institution Technology, Madras Lecture 57 Dynamic Programming (Refer Slide Time:
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness
More informationMemories. CPE480/CS480/EE480, Spring Hank Dietz.
Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationSorting Pearson Education, Inc. All rights reserved.
1 19 Sorting 2 19.1 Introduction (Cont.) Sorting data Place data in order Typically ascending or descending Based on one or more sort keys Algorithms Insertion sort Selection sort Merge sort More efficient,
More informationDependence Analysis. Hwansoo Han
Dependence Analysis Hwansoo Han Dependence analysis Dependence Control dependence Data dependence Dependence graph Usage The way to represent dependences Dependence types, latencies Instruction scheduling
More informationLoops. Lather, Rinse, Repeat. CS4410: Spring 2013
Loops or Lather, Rinse, Repeat CS4410: Spring 2013 Program Loops Reading: Appel Ch. 18 Loop = a computation repeatedly executed until a terminating condition is reached High-level loop constructs: While
More information6/12/2013. Introduction to Algorithms (2 nd edition) Overview. The Sorting Problem. Chapter 2: Getting Started. by Cormen, Leiserson, Rivest & Stein
Introduction to Algorithms (2 nd edition) by Cormen, Leiserson, Rivest & Stein Chapter 2: Getting Started (slides enhanced by N. Adlai A. DePano) Overview Aims to familiarize us with framework used throughout
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationMaximum Loop Distribution and Fusion for Two-level Loops Considering Code Size
Maximum Loop Distribution and Fusion for Two-level Loops Considering Code Size Meilin Liu Qingfeng Zhuge Zili Shao Chun Xue Meikang Qiu Edwin H.-M. Sha Department of Computer Science Department of Computing
More informationLoops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer
COMP 506 Rice University Spring 2017 Loops and Locality with an introduc-on to the memory hierarchy source code Front End IR OpJmizer IR Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon,
More informationLecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time
Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B
More informationEvaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques [R&G] Chapter 14, Part B CS4320 1 Using an Index for Selections Cost depends on #qualifying tuples, and clustering. Cost of finding qualifying data
More informationLecture 19 Sorting Goodrich, Tamassia
Lecture 19 Sorting 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 2004 Goodrich, Tamassia Outline Review 3 simple sorting algorithms: 1. selection Sort (in previous course) 2. insertion Sort (in previous
More information