Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Size: px
Start display at page:

Download "Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program"


1 SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous operations on different data Fine grain parallelism Systolic arrays Parallel SIMD machines 10k++ processors /Pipeline units 1 2 Systolic Array SIMD Machine Network of processors, memory around Performance by doing all computations before restoring Often hardware implementations solving one problem Special topologies Memory 3 Front End Normal von Neuman Runs the application program Processor array Synchronous The same operation at the same time or idle Extends the FPU:s instructions Small memory/processor Smart memory I/O Example ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) Host Controller 4 Data Parallell Programming Idea: update the elements of an array at the same time Divides the work between the programmer and the compiler The programmers solves the problem in their mel Concentrates on structure and concepts on a hight level Collective operations on large data structures Keeps data in large arrays with mapping information The compiler maps the program on a physical machine Fills in all the details (gladly receives hints from the user) Optimizes computations and communications 5 Building Blocks in Data Parallel Programming The user controls the placing of data on processors Minimize communication: keep all processors busy Operations on whole arrays Apply one operation on each element in the array in parallel Meths to access parts of an array Operations can operate on these parts Example: element < 0 element := 1 Reduction operations on arrays pruces a result from a combination of many array elements: sum, max, min,... Shift operations along the axis on multidimensional arrays Scan-operations prefix/suffix-operations Generalized communication 6

2 C* C* Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape shape defines number of elements and their organization shape [16384] employees /* 1-D */ shape [512] [512] image /* 2-D */ left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc int: employees employee_id [2]employee_id: refers to the 3:rd element in employee_id shape [16384] employees; struct date{ int month; int day; int year; }; struct date: employees birthday Each element in the parallel variable birthday contains a date. birthday.month specifies all month fields in the parallel variable birthday. 7 8 C* - Parallel Operations Overloading x = y + z (adds y and z in each position in the shape) New operations a, b scalar or parallel a <? b - min of two variables a >? b - max of two variables Selection of shape (with) shape [16384] numbers; int: numbers x, y, z; with (numbers) x = y + z where C* setting the context Limits the area where the operation is performed with (numbers) where (z!= 0) /* sets active positions */ x = y/z else /* reverses active positions */ x = y everywhere all positions active independently of earlier context 9 10 Grid communication C* - Communication pcoord (~myne) gives my index along axis in shape Example: Send the value of source to element dest that is one position higher up [pcoord(0) + 1]dest = source dot (.) is sometimes used instead of pcoord [. + 1]dest = source [. + 1][. -2]dest = source Compute Pi in C* Pi = 1/N * Σ (Ν 1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/n #define N = shape[n] chunk double: chunk x; main() { double sum; double width; width = 1.0/N in parallel with (chunk) { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); } sum =sum * width; printf( Estimate of Pi = %14.12f\n, sum); } 11 12

3 Compute Partial Sums in Array (C*) #define N = 1024 Select shape shape [N] ArrayShape int: ArrayShape x; int i; Active positions main() { with (ArrayShape) for (i = 0; i < log(n); i++) where (pcoord(0) >= pow(2, i-1) x += [pcoord(0) - pow(2, i-1)]x } Left indexing 13 High Performance Fortran Data parallel language (Many similarities to CM FORTRAN) For SIMD and MIMD (NUMA) machines Based on F90 (F77) Array operations HPF User defined data types Recursion and dynamic memory allocation Pointers F77 + Mess. Pass Control of data distribution SPMD Parallel constructs Data mapping directives FORALL statements and constructs Exe-file INDEPENDENT directive, etc 14 The PROCESSOR directive The DISTRIBUTE directive Declares an abstract processor arrangement on which data is mapped Each element of this arrangement corresponds to a ne on the physical machine The declarations are often parametrized with the intrinsic function NUMBER_OF_PROCESSORS!hpf$ processors p(number_of_processors()/2,2) Comment 15 Controls the mapping of data onto processors BLOCK distribution Each processor stores a consecutive block of the array REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(block) ONTO p BLOCK, BLOCK distribution For multidimensional arrays, separate blocking in each dimension. REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(block, BLOCK) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC distribution REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(cyclic) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC,BLOCK distribution It is not necessary to have the same distribution in all dimensions REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, BLOCK) ONTO p CYCLIC,CYCLIC distribution REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, CYCLIC) ONTO p 17!HPF$ DISTRIBUTE a(block, CYCLIC) ONTO 18p

4 The ALIGN directive Example: Simple Matrix Multiplication Describes mapping relations between interacting objects Both objects are allocated on the same processor REAL a(6), b(6)!hpf$ ALIGN a(i) WITH b(i) REAL a(4,4), b(4,10)!hpf$ ALIGN a(i,j) WITH b(i, 2*J+1) a b(1,3) b(1,5) b(1,7) b(1,9) b(2,3) b(2,5) b(2,7) b(2,9) b(3,3) b(3,5) b(3,7) b(3,9) b(4,3) b(4,5) b(4,7) b(4,9) a(1) a(2) a(3) a(4) a(5) a(6) b(1) b(2) b(3) b(4) b(5) b(6) 19 PROGRAM ABmult INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j!hpf$ PROCESSORS SQ(2,2)!HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ!HPF$ ALIGN A(i,*) WITH C(i,*)! replicate copies of row A(i,*)! onto processors which compute C(i,j)!HPF$ ALIGN B(*,j) WITH C(*,j)! replicate copies of column B(*,j))! onto processors which compute C(i,j) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) END C A B 20 The FORALL statement Generalization of array assignment and masked array assignment (NOT a loop) Single statement FORALL FORALL (index, mask) forall-assignment Equivalent to array assignment in F90 For every index, controll the mask Compute right hand side for unmasked values Carry out the assignments to the left hand side Multiple statement FORALL-semantics FORALL (index, mask) forall-by-list END FORALL forall-by can be FORALL, WHERE, or ordinary forallassignments Abbreviation of a series of single statement FORALLs The INDEPENDENT directive States that no iteration affects any other iteration in any way Is used to give the compiler extra information about the execution of a DO or FORALL Applied on DO: states that there are no loop carried dependencies Applied on FORALL: states that no index points to an address used by any other object!hpf$ INDEPENDENT DO I = 1, N A(INDX(I)) = B(I) The INDEPENDENT directive Game of LIFE FORALL (I=1:3)!HPF$ L1(I) = R1(I) L2(I) = R2(I) END FORALL Assume that R1(3) & R2(1) takes longer time due to communication R1(1) R1(2) R1(3) L1(1) L1(2) L1(3) R2(1) R2(2) R2(3) L2(1) L2(2) L2(3) Sync Sync Sync R1(1) L1(1) R2(1) L2(1) INDEPENDENT FORALL (I=1:3) L1(I) = R1(I) L2(I) = R2(I) END FORALL R1(2) L1(2) Time gained R2(2) L2(2) R1(3) L1(3) R2(3) L2(3) 23 INTEGER LIFE(64, 64), NCOUNT(64, 64)!HPF$ ALIGN LIFE WITH NCOUNT!HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK)... INIT LIFE... NCOUNT = 0 DO M = 1, NUMBER_OF_GENERATIONS FORALL (I=2:63, J=2:63) NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL! Create next generation WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) LIFE=1 END WHERE WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) LIFE = 0 END WHERE END 24

5 Summation Data Parallelism Scalable Data parallel programming simpler than messagepassing Data parallel languages C*, CM Fortran, HPF SIMD-style: Single Program, Single instruction flow SPMD-style: Single Program, multiple data different instruction flows locally Machines: SIMD (CM2, Maspar,..) or MIMD and SPMD programming Lecture 12b: Dependency analysis for automatic vectorization and parallelization of serial programs Automatic // Loops are the largest source for parallelism Loop parallelization Different iterations on different processors Different tasks within an iteration on different processors ization/pipelining Pipeline: breaks down instructions intp substeps that are being overlapped : the piped instructions are carried out on a vector register of fixed length 27 Content hardware Data dependency analysis dependency graphs dependency tests ization standard transformations vector ce generation Parallelization loop scheduling 28 Supercomputer (Register-to-Register) Mass storage I/O data pipes instr instr Control unit instructions Main Memory (Program & data) Host Computer data registers Control unit pipe pipe Transformation of a loop to a sequence of vector instructions instructions do I = 1, N C(I) = A(I) + B(I) ization C[1:N]= A[1:N] + B[1:N] L G0, N Load vector length N LA G3, C Load addr for C LA G2, B Load addr for B LA G1, A Load addr for A LOOP VLVCU G0 Set up loop for 128 elements VLD V1, G1 Load 128 A in V1 VLD V2, G2 Load 128 B in V2 VAD V3, V1, V2 A + B -> V3 VSTD V3, G3 V3 -> C BC 2, LOOP If more elements, Loop 29 30

6 Speedup, Expected speedup do I = 1, N C(I)= A(I) + B(I) instruction cycles Load A(I) in i register 1 Load B(I) in i register 1 ADD A(I) + B(I) 3 Store C(I) from register 1 Decr counter by length 128 -> 7*128 C[1:N]= A[1:N] + B[1:N] instruction cycles Load A(1:128) 128 Load B(1:128) 128 ADD A(1:128)+B(1:128) 128 Store C(1:128) * What can be ized? Only Do (For) loops can be vectorized Only one loop in a loop nest can be vectorized izable loops may NOT contain Data dependencies jump in/out/entry/stop loop variables other that integers I/O statements Side effects calls to external subprograms In same cases the compiler can rewrite the loop and then vectorize partially Speedup = 7/4= Different Types of Dependencies True/Flow dependence, is defined before use (DEF USE) S1: A = B + C S2: D = A + 2 S3: E = A * 3 (S1 δ t S2, S1 δ t S3) Anti dependence, is used before defined S1: A = B + C S2: B = X * 3 (S1 δ a S2) Output dependence, is allocated a value several times S1: A = B + C S2: A = X * 3 (S1 δ o S2) 33 Execution order Data Dependency S(i, j, k) << S(i, j, k ) iff (i, j, k) < (i, j, k ) Input & output sets DEF(S) = the set of all variables defined by the statement S USE(S) = the set of all variables used by the statement S Data dependency between two statements S and T ( S δ T) if S << T it exists a variable, v such that v is in both DEF(S) and USE(T) or v is in both USE(S) and DEF(T) or v id in both DEF(S) and DEF(T) it does not exist a statement SI such that S << SI << T and v is in DEF(SI) 34 Data Dependency in Loops Independent loops no iteration depends on data from any other iteration Dependent loops statement S is depenent on statement S k if the execution of S k must occur before the execution of S Loop carried dependency if the dependency depends on a loop index Loop independent dependency if the dependency does not depend on a loop index 35 Basic Concept Iteration vector points to specific iteration of loop (i = i 1, i 2,.., i n ) where i 1 is outermost Distance vector the distance between two iteration vectors i - i Dependency distance vectors if S and S are instances of statements in a loop nest and S(i) δ S (i ) then the dependency distance vector dist(i, i ) = i - i Dependency direction vectors the same as dependency distance vectors but only the direction is shown (<, =, >) corresponds to (+, 0, -) 36

7 Dependency Distance, Distance & Direction s S2: D(i) = A(i-1) Loop carried dependency i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(1) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(2) DEF, USE -> S1 δ t S2, distance 3-2 = 1, direction > Representation of Data Dependency Dependency graph directed graph G(V, E) where V is a set of statements, and E edges representing dependencies Dependency cycles Dependencies starting and ending at the statement S Loop independent dependency S2: D(i) = A(i) i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(2) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(3) DEF, USE-> S1 δ t S2, distance 2-2= 0, direction = S1: A = B + E S2: B = C S3: C = A V = {S1, S2, S3} E = {(S1, S2), (S1, S3), (S2, S3)} S1 δ a S2 δ a S3 δ t Loop Dependencies Kontrollfrågor S2: D(i) = A(i-1) S2: D(i) = A(i+1) Vilka beroenden finns i ksnuttarna på föregående sida? Riktningsvektorer? Hur ser beroendegraferna ut? do j = 2, 99 S1: A(i+1,j-1) =A(i, j) + C(i,j) 39 40

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming

More information

Parallel Programming. March 15,

Parallel Programming. March 15, Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010

More information

High Performance Fortran Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a

More information

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90 149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which

More information

High Performance Fortran. James Curry

High Performance Fortran. James Curry High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of

More information

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen

Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen Synchronous Computation Examples HPC Fall 2008 Prof. Robert van Engelen Overview Data parallel prefix sum with OpenMP Simple heat distribution problem with OpenMP Iterative solver with OpenMP Simple heat

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! Loop Unswitching Hoist invariant control-flow

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution

More information

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution. The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information


Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

CSL 730: Parallel Programming. Algorithms

CSL 730: Parallel Programming. Algorithms CSL 73: Parallel Programming Algorithms First 1 problem Input: n-bit vector Output: minimum index of a 1-bit First 1 problem Input: n-bit vector Output: minimum index of a 1-bit Algorithm: Divide into

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Light HPF for PC Clusters

Light HPF for PC Clusters Light HPF for PC Clusters Hidetoshi Iwashita Fujitsu Limited November 12, 2004 2 Background Fujitsu had developed HPF compiler product. For VPP5000, a distributed-memory vector computer.

More information

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction

More information

Parallel Processing: October, 5, 2010

Parallel Processing: October, 5, 2010 Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space. The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian University of Southern California 1 Outline From last class

More information


Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy Where Does Vectorization Fail? procedure vectorize

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors 1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of

More information

Data Dependence Analysis

Data Dependence Analysis CSc 553 Principles of Compilation 33 : Loop Dependence Data Dependence Analysis Department of Computer Science University of Arizona Copyright c 2011 Christian Collberg Data Dependence

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Introduction to parallel computing. Seminar Organization

Introduction to parallel computing. Seminar Organization Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -

More information

Transportation problem

Transportation problem Transportation problem It is a special kind of LPP in which goods are transported from a set of sources to a set of destinations subjects to the supply and demand of the source and destination, respectively,

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

HPF High Performance Fortran

HPF High Performance Fortran Table of Contents 270 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF -- influential but failed Chapel

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

CSL 730: Parallel Programming

CSL 730: Parallel Programming CSL 73: Parallel Programming General Algorithmic Techniques Balance binary tree Partitioning Divid and conquer Fractional cascading Recursive doubling Symmetry breaking Pipelining 2 PARALLEL ALGORITHM

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian University of Southern California 1 Announcements PA #1

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

Auto-Vectorization with GCC

Auto-Vectorization with GCC Auto-Vectorization with GCC Hanna Franzen Kevin Neuenfeldt HPAC High Performance and Automatic Computing Seminar on Code-Generation Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

In context with optimizing Fortran 90 code it would be very helpful to have a selection of

In context with optimizing Fortran 90 code it would be very helpful to have a selection of 1 ISO/IEC JTC1/SC22/WG5 N1186 03 June 1996 High Performance Computing with Fortran 90 Qualiers and Attributes In context with optimizing Fortran 90 code it would be very helpful to have a selection of

More information

Embedded Systems Design with Platform FPGAs

Embedded Systems Design with Platform FPGAs Embedded Systems Design with Platform FPGAs Spatial Design Ron Sass and Andrew G. Schmidt rsass University of North Carolina at Charlotte Spring 2011 Embedded Systems Design with

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Programming for Electrical and Computer Engineers. Pointers and Arrays

Programming for Electrical and Computer Engineers. Pointers and Arrays Programming for Electrical and Computer Engineers Pointers and Arrays Dr. D. J. Jackson Lecture 12-1 Introduction C allows us to perform arithmetic addition and subtraction on pointers to array elements.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

Computer Science & Engineering 150A Problem Solving Using Computers

Computer Science & Engineering 150A Problem Solving Using Computers Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using

More information

Chapter 3. Fortran Statements

Chapter 3. Fortran Statements Chapter 3 Fortran Statements This chapter describes each of the Fortran statements supported by the PGI Fortran compilers Each description includes a brief summary of the statement, a syntax description,

More information

Continuations provide a novel way to suspend and reexecute

Continuations provide a novel way to suspend and reexecute Continuations provide a novel way to suspend and reexecute computations. 2. ML ( Meta Language ) Strong, compile-time type checking. Types are determined by inference rather than declaration. Naturally

More information

Dr. Joe Zhang PDC-3: Parallel Platforms

Dr. Joe Zhang PDC-3: Parallel Platforms CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model

More information

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009. CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following

More information

Extrinsic Procedures. Section 6

Extrinsic Procedures. Section 6 Section Extrinsic Procedures 1 1 1 1 1 1 1 1 0 1 This chapter defines the mechanism by which HPF programs may call non-hpf subprograms as extrinsic procedures. It provides the information needed to write

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Data parallel algorithms 1

Data parallel algorithms 1 Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize

More information

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 32: Partitioned Global Address Space (PGAS) programming models COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner} COMP

More information

John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University

John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University Co-Array Fortran and High Performance Fortran John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University LACSI Symposium October 2006 The Problem Petascale

More information

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written

More information

Arrays, Vectors Searching, Sorting

Arrays, Vectors Searching, Sorting Arrays, Vectors Searching, Sorting Arrays char s[200]; //array of 200 characters different type than class string can be accessed as s[0], s[1],..., s[199] s[0]= H ; s[1]= e ; s[2]= l ; s[3]= l ; s[4]=

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers

More information

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Compiler techniques for leveraging ILP

Compiler techniques for leveraging ILP Compiler techniques for leveraging ILP Purshottam and Sajith October 12, 2011 Purshottam and Sajith (IU) Compiler techniques for leveraging ILP October 12, 2011 1 / 56 Parallelism in your pocket LINPACK

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information


EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

Lecture 5: Outline. I. Multi- dimensional arrays II. Multi- level arrays III. Structures IV. Data alignment V. Linked Lists

Lecture 5: Outline. I. Multi- dimensional arrays II. Multi- level arrays III. Structures IV. Data alignment V. Linked Lists Lecture 5: Outline I. Multi- dimensional arrays II. Multi- level arrays III. Structures IV. Data alignment V. Linked Lists Multidimensional arrays: 2D Declaration int a[3][4]; /*Conceptually 2D matrix

More information

Name: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)

Name: PID:   CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Name: PID: Email: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Cache Performance (Questions from 15-213 @ CMU. Thanks!) 1. This problem requires you to analyze the cache behavior of a function that sums

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software

More information

Practice problems Set 2

Practice problems Set 2 Practice problems Set 2 1) Write a program to obtain transpose of a 4 x 4 matrix. The transpose of matrix is obtained by exchanging the elements of each row with the elements of the corresponding column.

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:

More information

Chapel Introduction and

Chapel Introduction and Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware But before that Created a simple

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

J. E. Smith. Automatic Parallelization Vector Architectures Cray-1 case study. Data Parallel Programming CM-2 case study

J. E. Smith. Automatic Parallelization Vector Architectures Cray-1 case study. Data Parallel Programming CM-2 case study Outline SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in ECE/CS 757, no part of these notes may be

More information

Declaration and Initialization

Declaration and Initialization 6. Arrays Declaration and Initialization a1 = sqrt(a1) a2 = sqrt(a2) a100 = sqrt(a100) real :: a(100) do i = 1, 100 a(i) = sqrt(a(i)) Declaring arrays real, dimension(100) :: a real :: a(100) real :: a(1:100)!

More information

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections ) Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures

More information

G Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University

G Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University G22.2110-001 Programming Languages Spring 2010 Lecture 4 Robert Grimm, New York University 1 Review Last week Control Structures Selection Loops 2 Outline Subprograms Calling Sequences Parameter Passing

More information

Parallel Sorting. Sathish Vadhiyar

Parallel Sorting. Sathish Vadhiyar Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information