Parallel Processing: October, 5, 2010

Similar documents
Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea

Enhancing Parallelism

Exploring Parallelism At Different Levels

Coarse-Grained Parallelism

The Polyhedral Model (Transformations)

Loop Transformations! Part II!

Lecture 4: Graph Algorithms

Dr. Joe Zhang PDC-3: Parallel Platforms

Unit 9 : Fundamentals of Parallel Processing

Review. Loop Fusion Example

High Performance Computing Systems

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Theorem 2.9: nearest addition algorithm

Data Dependence Analysis

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Embedded Systems Design with Platform FPGAs

High Performance Computing in C and C++

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Problem 1. Which of the following is true of functions =100 +log and = + log? Problem 2. Which of the following is true of functions = 2 and =3?

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

10th August Part One: Introduction to Parallel Computing

Parallel Processing SIMD, Vector and GPU s

AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014

Supercomputing in Plain English Part IV: Henry Neeman, Director

Lect. 2: Types of Parallelism

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

SHARED MEMORY VS DISTRIBUTED MEMORY

NP Completeness. Andreas Klappenecker [partially based on slides by Jennifer Welch]

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

6. Algorithm Design Techniques

More NP-complete Problems. CS255 Chris Pollett May 3, 2006.

Increasing Parallelism of Loops with the Loop Distribution Technique

Material handling and Transportation in Logistics. Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena

DESIGN AND ANALYSIS OF ALGORITHMS GREEDY METHOD

Lecture 7: Parallel Processing

Introduction to Parallel Programming

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Control flow graphs and loop optimizations. Thursday, October 24, 13

Introduction to parallel computing. Seminar Organization

CS 293S Parallelism and Dependence Theory

Algorithm Analysis and Design

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

CS 351 Final Review Quiz

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

IV/IV B.Tech (Regular) DEGREE EXAMINATION. Design and Analysis of Algorithms (CS/IT 414) Scheme of Evaluation

Auto-Vectorization with GCC

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

Algorithms and Complexity

Numerical Algorithms

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CSC 8301 Design & Analysis of Algorithms: Warshall s, Floyd s, and Prim s algorithms

Transforming Imperfectly Nested Loops

Elementary maths for GMT. Algorithm analysis Part II

PIPELINE AND VECTOR PROCESSING

EE/CSCI 451: Parallel and Distributed Computation

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Architecture

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Parallelization of Graph Isomorphism using OpenMP

Outline. Computer Science 331. Three Classical Algorithms. The Sorting Problem. Classical Sorting Algorithms. Mike Jacobson. Description Analysis

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

Introduction to Approximation Algorithms

MA 511: Computer Programming Lecture 3: Partha Sarathi Mandal

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

EE/CSCI 451: Parallel and Distributed Computation

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Linear Programming in Small Dimensions

Graph Definitions. In a directed graph the edges have directions (ordered pairs). A weighted graph includes a weight function.

Lecture 3: Sorting 1

APPROXIMATION ALGORITHMS FOR GEOMETRIC PROBLEMS

Introduction to Algorithms: Brute-Force Algorithms

EE/CSCI 451 Midterm 1

Approximation Algorithms for Geometric Intersection Graphs

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

Parallel Graph Algorithms

Dynamic Programming II

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

Computer Science 385 Design and Analysis of Algorithms Siena College Spring Topic Notes: Dynamic Programming

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Math 776 Graph Theory Lecture Note 1 Basic concepts

Artificial Intelligence

Compiling for Advanced Architectures

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

EE/CSCI 451: Parallel and Distributed Computation

15CS43: DESIGN AND ANALYSIS OF ALGORITHMS

Parallel Auction Algorithm for Linear Assignment Problem

Introduction to parallel computing

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Dynamic Programming

Gestão e Tratamento da Informação

Additional Slides to De Micheli Book

Representations of Weighted Graphs (as Matrices) Algorithms and Data Structures: Minimum Spanning Trees. Weighted Graphs

Computer Science 385 Design and Analysis of Algorithms Siena College Spring Topic Notes: Brute-Force Algorithms

Transcription:

Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea

Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need of results on specific (or reasonable) time When? Are there any sequences which are better suited for parallel implementation Are there situations when is better NOT to parallelize How? Which h are the constraints t (if any) when we need to pass from sequential to parallel implementation 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 2

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 3

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 4

Computation paradigms Sequential model referred to as Von Neumann architecture basis of comparison he simultaneously proposed other models due to the technological constraints, at that time, only the sequential model has been implemented Von Neumann could be considered as the founder of parallel computers as well (conceptual view) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 5

Computation paradigms contd. Flynn s taxonomy (according to instruction/data flow single/multiple) SISD SIMD single control unit dispatches instructions to each processing unit MISD the same data is analyzed from different points of view (MI) MIMD different instructions run on different data 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 6

Parallel computation Why parallel solutions? Increase speed Process huge amount of data Solve problems in real time Solve problems in due time 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 7

Parallel computation Efficiency Parameters to be evaluated (in sequential view) Time Memory Cases to be considered Best Worst Average Optimal algorithms Ω() = O() An algorithm is optimal if the running time in the worst case is of the same order of magnitude with function Ω of the problem to be solved 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 8

Parallel computation Efficiency contd. Order of magnitude: a function which expresses the running time and Ω respectively if polynomial function, the polynomial should have the same maximum degree, regardless the multiplicative constant Ex: Ω() = c 1 n p and O()= c 2 n p it is considered Ω = O, written as Ω(n 2 ) = O(n 2 ) Algorithms have O() >=Ω Ω in the worst case O() <= Ω could be true for best case scenario, or even for the average case Ex Sorting: Ω(n lgn) and O(1) best case scenario for some algs. 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 9

Parallel computation Efficiency contd. Do we really need parallel computers? Any computing greedy algorithms better suited for parallel implementation? Consider 2 classes of algorithms: Alg1: polynomial Alg2: exponential Assume the design of a new system which increases the speed of execution V times? How does the maximum dimension of the problem that can be solved in the new system increases? i.e. express n=f(v) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 10

Parallel computation Efficiency contd. Alg1: O(n k ) Oper. M1(old): n k T Time M2(new): n k T/V Vn k T (n k 2 ) k= Vn 1k =(v 1/k n 1 ) k n = v 1/k *n => n = c*n 2 1 2 1 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 11

Parallel computation Efficiency contd. Alg2: O(2 n ) Oper. M1(old): 2 n T Time M2(new): 2 n T/V V2 n T (2) n 2 = V2 1 =2 lgv + n 1 n 2 = n 1 + lgv => n 2 = c + n 1 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 12

Parallel computation Efficiency contd. Speed of the new machine in terms of the speed of the old machine: V 2 =V V 1 Alg1: O(n k ): n 2 = v 1/k *n Alg2: O(2 n ): n 2 = n + lgv Conclusion: BAD NEWS!! 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 13

Parallel computation Efficiency contd. Are there algorithms that really need exponential running time? backtracking avoid it! Never use! Are there problems that are inherently exponential? NP-complete, NP-hard problems difficult problems we do not face them in real-world situations often? Hamiltonian cycle, sum problem, 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 14

Parallel computation Efficiency contd. Parameters to be evaluated for parallel processing Time Memory Number of processors involved Cost = Time * Number of processors (c = t * n) Efficiency E = O()/cost E 1 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 15

Parallel computation Efficiency contd. Speedup Δ = best sequential execution time/time of parallel execution Δ n Optimal algorithms E=1, Δ = n Cost = O() A parallel algorithm is optimal if the cost is of the same order of magnitude with the best known algorithm to solve the problem in the worst case Other 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 16

Efficiency cont Parallel running time Computation time Communication time (data transfer, partial results transfer, information communication) Computation time As the number of processors involved increases, the computation time decreases Communication time Quite the opposite!!! 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 17

Efficiency cont T Diagram of running time N 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 18

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 19

Constraints in parallel computation Data/code to be distributed among the different processing units in the system Observe the precedence relations between statements/variables Avoid conflicts (data and control dependencies) Remove constraints (where possible) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 20

Constraints in parallel computation contd. Data parallel Input data split to be processed by different CPU Characterizes the SIMD model of computation Control parallel l Different sequences processed by different CPU Characterizes the MISD and MIMD models of computation Conflicts should be avoided if constraint removed then parallel sequence of code else sequential sequence of code 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 21

Outline Computation paradigms Constraints in parallel computation Dependence analysis Systolic architectures & simulations Techniques for dependences d removal Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 22

Dependence analysis Dependences represent the major limitation in parallelism Data dependencies Refers to read/write conflicts Data Flow dependence Data Antidependence Output dependence Control dependencies Exist between statements 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 23

Dependence analysis Data Flow dependence Indicates a write before read ordering of operations which must be satisfied Can NOT be avoided therefore represents the limitation of the parallelism s potential Ex: (assume sequence in a loop on i) S1: a(i) = x(i) 3 //assign a(i), i.e. write in its memory location S2: b(i) = a(i) /c(i)//use a(i), i.e. read from the cor. mem loc. a(i) should be first calculated in S1, and only then used in S2. Therefore, S1 and S2 cannot be processed in parallel!!! 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 24

Parallel Algorithms cont. Dependencies analysis Data Antidependence Indicates a read before write ordering of operations which h must be satisfied Can be avoided (techniques to remove this type of dependence) Ex: (assume sequence in a loop on i) S1: a(i) = x(i) 3 // write later S2: b(i) = c(i) + a(i+1) // read first a(i+1) should be first used (with its former value) in S2, and only then computed in S1, at the next execution of the loop. Therefore, several iterations of the loop cannot be processed in parallel! 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 25

Parallel Algorithms cont. Dependencies analysis Output dependence Indicates a write before write ordering of operations which h must be satisfied Can be avoided (techniques to remove this type of dependence) Ex: (assume sequence in a loop on i) S1: c(i+4) = b(i) + a(i+1) //assign first here S2: c(i+1) = x(i) //and here after 3 //executions of the loop c(i+4) is assigned in S1; after 3 executions of the loop, the same element in the array is assigned in S2. Therefore, several iterations of the loop cannot be processed in parallel 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 26

Parallel Algorithms cont. Dependencies analysis Each dependence defines a dependence vector They are defined by the distance between the loop where they interfere (data flow) S1: a(i) = x(i) 3 d = 0 on var a S2: b(i) = a(i) /c(i) (data antidep.) S1: a(i) = x(i) 3 d = 1 on var a S2: b(i) = c(i) + a(i+1) (output dep.) S1: c(i+4) = b(i) + a(i+1) d = 3 on var c S2: c(i+1) =x(i) All dependence vectors within a sequence define a dependence matrix 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 27

Parallel Algorithms cont. Dependencies analysis Data dependencies Data Flow dependence (wbr) cannot be removed Data Antidependence (rbw) can be removed Output dependence (wbw) can be removed 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 28

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 29

Techniques for dependencies removal Many dependencies are introduced artificially by programmers Output and antidependencies = false dependencies Occur not because data is passed but because the same memory location is used in more than one place Removal of (some of) those dependencies increase the parallelization potential 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 30

Techniques for dependencies removal contd. Renaming S1: a=b* c S2: d = a + 1 S3: a = a * d Dependence analysis: S1->S2 (data flow on a) S2->S3 (data flow on d) S1->S3 (data flow on a) S1->S3 (output dep. on a) S2->S3 (antidependence d on a) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 31

Techniques for dependencies removal contd. Renaming S1: a=b* c S2: d = a + 1 S 3: newa = a * d Dependence analysis: S1->S2 (data flow on a) S2->S 3(data flow on d) S1->S'3 (data flow on a) S1->S 3 (output dep. on a) S2->S 3 (antidependence d on a) holds holds holds 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 32

Techniques for dependencies removal contd. Expansion for i=1 ton S1: x = b(i) - 2 S2: y = c(i) * b(i) S3: a(i) = x + y Dependence analysis: S1->S3 (data flow on x same iteration) S2->S3 S3 (data flow on y same iteration) S1->S1 (output dep. on x consecutive iterations) S2->S2 (output dep. on y consecutive iterations) S3->S1 (antidep. on x consecutive iterations) S3->S2 (antidep. on y consecutive iterations) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 33

Techniques for dependencies removal contd. Expansion (loop iterations become independent) for i=1 to n S 1: x(i) = b(i) - 2 S 2: y(i) = c(i) * b(i) S 3: a(i) = x(i) + y(i) Dependence analysis: S1->S3 (data flow on x) holds S2->S3 (data flow on y) holds S1->S1 (output dep. on x) S2->S2 (output dep. on y) S3->S1 (antidep. on x) replaced by the same data flow on x S3->S2 (antidep. on y) replaced by the same data flow on y 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 34

Techniques for dependencies removal contd. Splitting for i=1 to n S1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2 S3: f(i) () = d(i) () + a(i+1) Dependence analysis: S1->S2 (data flow on a(i) same iteration) S2->S3 (data flow on d(i) same iteration) S3->S1 (antidep.( on a(i+1) consecutive iterations) ) We have a cycle of dependencies: S1->S2 ->S3->S1 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 35

Techniques for dependencies removal contd. Splitting (cycle removed) for i=1 to n S0: newa(i) = a(i+1) //copy the value obtained at the previous step S1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 2 S 3: S3: f(i) = d(i) + newa(i) Dependence analysis: S1->S2 (data flow on a(i) same iteration) S2->S 3 (data flow on d(i) same iteration) S 3->S1 S3 (antidep. on a(i+1) consec. iterations) S0->S 3 (data flow on a(i+1) same iteration) S0->S1 antidep. on a(i+1) consecutive iterations) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 36

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 37

Systolic architectures & simulations Systolic architecture Special purpose architecture pipe of processing units called cells Advantages Faster Scalable Limitations Expensive Highly specialized for particular applications Difficult to build => simulations 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 38

Systolic architectures & simulations Goals for simulation: Verification and validation Tests and availability of solutions Efficiency Proof of concept 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 39

Systolic architectures & simulations Models of systolic architectures Many problems in terms of systems of liner equations Linear arrays (from special purpose architectures to multi-purpose architectures) Computing values of a polynomial Convolution product Vector-matrix multiplication Mesh arrays Matrix multiplication Speech recognition (simplified mesh array) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 40

Systolic architectures & simulations All but one simulated within a software model in PARLOG Easy to implement Ready for run Specification language Speech recognition i Pascal/C submesh array Vocabulary of thousands of words Real time recognition 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 41

Systolic architectures & simulations PARLOG example (convolution( product) ) cel([],[],_,_,[],[]). cel([x Tx],[Y Ty],B,W,[Xo Txo],[Yo Tyo])<-, y ]) Yo is Y+W*x, Xo is B, cel(tx,ty,x,w,txo,tyo). 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 42

Outline Computation paradigms Constraints in parallel computation Dependence analysis Techniques for dependences removal Systolic architectures & simulations Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 43

Program transformations Loop transformation To allow execution of parallel loops Techniques borrowed from compilers techniques 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 44

Program transformations contd. Cycle shrinking Applies when dependencies d cycle of a loop involves dependencies between loops far apart (say distance d) The loop is executed on parallel on that distance d for i = 1 to n a(i+d) = b(i) 1 b(i+d) = a(i) + c(i) Transforms into for i1 = 1 to n step d parallel for i=i1 to i1+d-1 a(i+d) = b(i) 1 b(i+d) = a(i) + c(i) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 45

Program transformations contd. Loop interchanging Consist of interchanging of indexes It requires the new dep vector to be positive for j = 1 to n for i = 1 to n a(i, j) = a(i-1, j) +b(i-2,j) Inner loop cannot be vectorised (executed in parallel) for i = 1ton parallel for j = 1 to n a(i, j) = a(i-1, j) +b(i-2,j) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 46

Program transformations contd. Loop fusion Transform 2 adjacent loops into a single one Reduces the overhead of loops It requires es not to violate the ordering imposed by data a dependencies for i = 1 to n a(i) = b(i) + c(i+1) for i = 1 to n c(i) = a(i-1) //write before read for i = 1 to n a(i) = b(i) + c(i+1) c(i) = a(i-1) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 47

Program transformations contd. Loop collapsing Combines 2 nested loops Increases the vector length for i = 1 to n for j = 1 to m a(i,j) = a(i,j-1) + 1 for ij = 1 to n*m i=i*j div m j=ij mod m a(i,j) = a(i,j-1) + 1 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 48

Program transformations contd. Loop partitioning Partition into independent problems fori=1ton a(i) = a(i-g) + 1 Loop can be split into g independent problems for i = 1 to (n div g) *g +k step g Where k=1,g a(i) = a(i-g) + 1 Each loop could be executed in parallel 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 49

Outline Computation paradigms Constraints in parallel computation Dependence analysis Systolic architectures & simulations Techniques for dependences removal Increase the parallel potential Examples on GRID 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 50

GRID It provides access to computational power storage capacity 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 51

Examples on GRID Matrix multiplication li Matrix power Gaussian elimination Sorting Graph algorithms 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 52

Examples on GRID Matrix multiplication cont. Analysis 1D (absolute analysis) ) Matrix Multiplication 1D 100000 10000 Time 1000 100 100 300 700 1000 10 1 0 5 10 15 20 25 30 35 40 Processors Optimal sol: p=f(n) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 53

Examples on GRID Matrix multiplication cont. Analysis 1D (relative analysis) Matrix Multiplication 1D Time % 100 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 Processors 100 300 700 1000 Computation time (for small dim) is small compared to communication time: hence parallel implementation with 1D efficient on very large matrix dim. 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 54

Examples on GRID Matrix multiplication cont. Analysis 2D (absolute analysis) Matrix Multiplication 2D 100000 10000 Time 1000 100 1000x1000 700x700 300x300 100x100 10 1 1 4 9 16 25 36 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 55

Examples on GRID Matrix multiplication cont. Analysis 2D (relative analysis) Matrix Multiplication li i 2D 100 90 80 Time % 70 60 50 40 30 20 10 1000 700 300 100 0 0 5 10 15 20 25 30 35 40 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 56

Examples on GRID Matrix multiplication cont. Comparison 1D- 2D Comparison 1D - 2D 100000 Time 10000 1000 100 10 100-1D 300-1D 700-1D 1000-1D 100-2D 300-2D 700-2D 1000-2D 1 0 5 10 15 20 25 30 35 40 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 57

Examples on GRID Graph algorithms Minimum Spanning Trees 7 2 3 15 1 8 10 5 7 2 5 Graph with 7 vertices: MST will have 6 edges with the sum of their weight minimum Possible solutions: MST weight = 20 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 58

Examples on GRID Graph algorithms cont. MST Absolute Analysis Minimum Spannig Tree MST Relative Analysis Minumum Spanning Tree 100000 100 90 10000 80 Time 1000 100 5000 2000 1000 500 Time % 70 60 50 40 30 500 1000 2000 5000 10 20 10 1 1 2 4 8 10 20 Processors 0 1 2 4 8 10 20 25 Processors Observe the min value shifts to the right as graph s dimension increases Observe the running time improvement with one order of magnitude (compared to sequential approach) for graphs with n = 5000 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 59

Examples on GRID Graph algorithms cont. Transitive Closure: source-partitioned Each vertex v i is assigned to a process, and each process computes shortest path from each vertex in its set to any other vertex in the graph, by using the sequential algorithm (hence, data partitioning) Requires es no inter process communication Transitive closure - source partition 1000000 100000 Time 10000 1000 100 200 500 1000 2000 10 1 1 2 4 8 16 20 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 60

Examples on GRID Graph algorithms cont. Transitive Closure: source-parallel assigns each vertex to a set of processes and uses the parallel l formulation of the single-source algorithm (similar to MST) Requires inter process communication Transitive closure - source parallel 10000000 1000000 100000 Time 10000 1000 100 200 500 1000 2000 10 1 1 2 4 8 16 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 61

Examples on GRID Graph algorithms cont. Graph Isomorphism: exponential running time! search if 2 graphs matches trivial solution: generate all permutations of a graph (n!) and match with the other graph n! (n-1)! (n-1)! (n-1)! Each process is responsible for computing (n-1)! (data parallel approach) 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 62

Examples on GRID Graph algorithms cont. Graph Isomorphism 1000000 le Time - loga arithmic sca 100000 10000 1000 100 10 9 vertices 10 vertices 11 vertices 12 vertices 1 0 2 4 6 8 10 12 14 Processors 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 63

Conclusions The need for parallel processing (why) Existence of problems that require large amount of time to be solved (like NPcomplete and NP hard problems) The constraints in parallel processing (how) Dependence analysis: define the boundaries of the problem The particularities of the solution give the amount of parallelism (when) The parabolic shape: more significant when large communication is involved Shift of optimal number of processors How parallelization is done matters Time up to 2 orders of magnitude decrease 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 64

Thank you for your attention! Questions? 5 Oct. 10 Rodica Potolea - CS Dept., T.U. Cluj-Napoca, Romania 65