Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program
|
|
- Aubrey Wilson
- 6 years ago
- Views:
Transcription
1 SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous operations on different data Fine grain parallelism Systolic arrays Parallel SIMD machines 10k++ processors /Pipeline units 1 2 Systolic Array SIMD Machine Network of processors, memory around Performance by doing all computations before restoring Often hardware implementations solving one problem Special topologies Memory 3 Front End Normal von Neuman Runs the application program Processor array Synchronous The same operation at the same time or idle Extends the FPU:s instructions Small memory/processor Smart memory I/O Example ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) Host Controller 4 Data Parallell Programming Idea: update the elements of an array at the same time Divides the work between the programmer and the compiler The programmers solves the problem in their mel Concentrates on structure and concepts on a hight level Collective operations on large data structures Keeps data in large arrays with mapping information The compiler maps the program on a physical machine Fills in all the details (gladly receives hints from the user) Optimizes computations and communications 5 Building Blocks in Data Parallel Programming The user controls the placing of data on processors Minimize communication: keep all processors busy Operations on whole arrays Apply one operation on each element in the array in parallel Meths to access parts of an array Operations can operate on these parts Example: element < 0 element := 1 Reduction operations on arrays pruces a result from a combination of many array elements: sum, max, min,... Shift operations along the axis on multidimensional arrays Scan-operations prefix/suffix-operations Generalized communication 6
2 C* C* Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape shape defines number of elements and their organization shape [16384] employees /* 1-D */ shape [512] [512] image /* 2-D */ left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc int: employees employee_id [2]employee_id: refers to the 3:rd element in employee_id shape [16384] employees; struct date{ int month; int day; int year; }; struct date: employees birthday Each element in the parallel variable birthday contains a date. birthday.month specifies all month fields in the parallel variable birthday. 7 8 C* - Parallel Operations Overloading x = y + z (adds y and z in each position in the shape) New operations a, b scalar or parallel a <? b - min of two variables a >? b - max of two variables Selection of shape (with) shape [16384] numbers; int: numbers x, y, z; with (numbers) x = y + z where C* setting the context Limits the area where the operation is performed with (numbers) where (z!= 0) /* sets active positions */ x = y/z else /* reverses active positions */ x = y everywhere all positions active independently of earlier context 9 10 Grid communication C* - Communication pcoord (~myne) gives my index along axis in shape Example: Send the value of source to element dest that is one position higher up [pcoord(0) + 1]dest = source dot (.) is sometimes used instead of pcoord [. + 1]dest = source [. + 1][. -2]dest = source Compute Pi in C* Pi = 1/N * Σ (Ν 1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/n #define N = shape[n] chunk double: chunk x; main() { double sum; double width; width = 1.0/N in parallel with (chunk) { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); } sum =sum * width; printf( Estimate of Pi = %14.12f\n, sum); } 11 12
3 Compute Partial Sums in Array (C*) #define N = 1024 Select shape shape [N] ArrayShape int: ArrayShape x; int i; Active positions main() { with (ArrayShape) for (i = 0; i < log(n); i++) where (pcoord(0) >= pow(2, i-1) x += [pcoord(0) - pow(2, i-1)]x } Left indexing 13 High Performance Fortran Data parallel language (Many similarities to CM FORTRAN) For SIMD and MIMD (NUMA) machines Based on F90 (F77) Array operations HPF User defined data types Recursion and dynamic memory allocation Pointers F77 + Mess. Pass Control of data distribution SPMD Parallel constructs Data mapping directives FORALL statements and constructs Exe-file INDEPENDENT directive, etc 14 The PROCESSOR directive The DISTRIBUTE directive Declares an abstract processor arrangement on which data is mapped Each element of this arrangement corresponds to a ne on the physical machine The declarations are often parametrized with the intrinsic function NUMBER_OF_PROCESSORS!hpf$ processors p(number_of_processors()/2,2) Comment 15 Controls the mapping of data onto processors BLOCK distribution Each processor stores a consecutive block of the array REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(block) ONTO p BLOCK, BLOCK distribution For multidimensional arrays, separate blocking in each dimension. REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(block, BLOCK) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC distribution REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(cyclic) ONTO p P1 P2 P3 P The DISTRIBUTE directive CYCLIC,BLOCK distribution It is not necessary to have the same distribution in all dimensions REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, BLOCK) ONTO p CYCLIC,CYCLIC distribution REAL a(7,7)!hpf$ PROCESSORS p(2,2)!hpf$ DISTRIBUTE a(cyclic, CYCLIC) ONTO p 17!HPF$ DISTRIBUTE a(block, CYCLIC) ONTO 18p
4 The ALIGN directive Example: Simple Matrix Multiplication Describes mapping relations between interacting objects Both objects are allocated on the same processor REAL a(6), b(6)!hpf$ ALIGN a(i) WITH b(i) REAL a(4,4), b(4,10)!hpf$ ALIGN a(i,j) WITH b(i, 2*J+1) a b(1,3) b(1,5) b(1,7) b(1,9) b(2,3) b(2,5) b(2,7) b(2,9) b(3,3) b(3,5) b(3,7) b(3,9) b(4,3) b(4,5) b(4,7) b(4,9) a(1) a(2) a(3) a(4) a(5) a(6) b(1) b(2) b(3) b(4) b(5) b(6) 19 PROGRAM ABmult INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j!hpf$ PROCESSORS SQ(2,2)!HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ!HPF$ ALIGN A(i,*) WITH C(i,*)! replicate copies of row A(i,*)! onto processors which compute C(i,j)!HPF$ ALIGN B(*,j) WITH C(*,j)! replicate copies of column B(*,j))! onto processors which compute C(i,j) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) END C A B 20 The FORALL statement Generalization of array assignment and masked array assignment (NOT a loop) Single statement FORALL FORALL (index, mask) forall-assignment Equivalent to array assignment in F90 For every index, controll the mask Compute right hand side for unmasked values Carry out the assignments to the left hand side Multiple statement FORALL-semantics FORALL (index, mask) forall-by-list END FORALL forall-by can be FORALL, WHERE, or ordinary forallassignments Abbreviation of a series of single statement FORALLs The INDEPENDENT directive States that no iteration affects any other iteration in any way Is used to give the compiler extra information about the execution of a DO or FORALL Applied on DO: states that there are no loop carried dependencies Applied on FORALL: states that no index points to an address used by any other object!hpf$ INDEPENDENT DO I = 1, N A(INDX(I)) = B(I) The INDEPENDENT directive Game of LIFE FORALL (I=1:3)!HPF$ L1(I) = R1(I) L2(I) = R2(I) END FORALL Assume that R1(3) & R2(1) takes longer time due to communication R1(1) R1(2) R1(3) L1(1) L1(2) L1(3) R2(1) R2(2) R2(3) L2(1) L2(2) L2(3) Sync Sync Sync R1(1) L1(1) R2(1) L2(1) INDEPENDENT FORALL (I=1:3) L1(I) = R1(I) L2(I) = R2(I) END FORALL R1(2) L1(2) Time gained R2(2) L2(2) R1(3) L1(3) R2(3) L2(3) 23 INTEGER LIFE(64, 64), NCOUNT(64, 64)!HPF$ ALIGN LIFE WITH NCOUNT!HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK)... INIT LIFE... NCOUNT = 0 DO M = 1, NUMBER_OF_GENERATIONS FORALL (I=2:63, J=2:63) NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL! Create next generation WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) LIFE=1 END WHERE WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) LIFE = 0 END WHERE END 24
5 Summation Data Parallelism Scalable Data parallel programming simpler than messagepassing Data parallel languages C*, CM Fortran, HPF SIMD-style: Single Program, Single instruction flow SPMD-style: Single Program, multiple data different instruction flows locally Machines: SIMD (CM2, Maspar,..) or MIMD and SPMD programming Lecture 12b: Dependency analysis for automatic vectorization and parallelization of serial programs Automatic // Loops are the largest source for parallelism Loop parallelization Different iterations on different processors Different tasks within an iteration on different processors ization/pipelining Pipeline: breaks down instructions intp substeps that are being overlapped : the piped instructions are carried out on a vector register of fixed length 27 Content hardware Data dependency analysis dependency graphs dependency tests ization standard transformations vector ce generation Parallelization loop scheduling 28 Supercomputer (Register-to-Register) Mass storage I/O data pipes instr instr Control unit instructions Main Memory (Program & data) Host Computer data registers Control unit pipe pipe Transformation of a loop to a sequence of vector instructions instructions do I = 1, N C(I) = A(I) + B(I) ization C[1:N]= A[1:N] + B[1:N] L G0, N Load vector length N LA G3, C Load addr for C LA G2, B Load addr for B LA G1, A Load addr for A LOOP VLVCU G0 Set up loop for 128 elements VLD V1, G1 Load 128 A in V1 VLD V2, G2 Load 128 B in V2 VAD V3, V1, V2 A + B -> V3 VSTD V3, G3 V3 -> C BC 2, LOOP If more elements, Loop 29 30
6 Speedup, Expected speedup do I = 1, N C(I)= A(I) + B(I) instruction cycles Load A(I) in i register 1 Load B(I) in i register 1 ADD A(I) + B(I) 3 Store C(I) from register 1 Decr counter by length 128 -> 7*128 C[1:N]= A[1:N] + B[1:N] instruction cycles Load A(1:128) 128 Load B(1:128) 128 ADD A(1:128)+B(1:128) 128 Store C(1:128) * What can be ized? Only Do (For) loops can be vectorized Only one loop in a loop nest can be vectorized izable loops may NOT contain Data dependencies jump in/out/entry/stop loop variables other that integers I/O statements Side effects calls to external subprograms In same cases the compiler can rewrite the loop and then vectorize partially Speedup = 7/4= Different Types of Dependencies True/Flow dependence, is defined before use (DEF USE) S1: A = B + C S2: D = A + 2 S3: E = A * 3 (S1 δ t S2, S1 δ t S3) Anti dependence, is used before defined S1: A = B + C S2: B = X * 3 (S1 δ a S2) Output dependence, is allocated a value several times S1: A = B + C S2: A = X * 3 (S1 δ o S2) 33 Execution order Data Dependency S(i, j, k) << S(i, j, k ) iff (i, j, k) < (i, j, k ) Input & output sets DEF(S) = the set of all variables defined by the statement S USE(S) = the set of all variables used by the statement S Data dependency between two statements S and T ( S δ T) if S << T it exists a variable, v such that v is in both DEF(S) and USE(T) or v is in both USE(S) and DEF(T) or v id in both DEF(S) and DEF(T) it does not exist a statement SI such that S << SI << T and v is in DEF(SI) 34 Data Dependency in Loops Independent loops no iteration depends on data from any other iteration Dependent loops statement S is depenent on statement S k if the execution of S k must occur before the execution of S Loop carried dependency if the dependency depends on a loop index Loop independent dependency if the dependency does not depend on a loop index 35 Basic Concept Iteration vector points to specific iteration of loop (i = i 1, i 2,.., i n ) where i 1 is outermost Distance vector the distance between two iteration vectors i - i Dependency distance vectors if S and S are instances of statements in a loop nest and S(i) δ S (i ) then the dependency distance vector dist(i, i ) = i - i Dependency direction vectors the same as dependency distance vectors but only the direction is shown (<, =, >) corresponds to (+, 0, -) 36
7 Dependency Distance, Distance & Direction s S2: D(i) = A(i-1) Loop carried dependency i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(1) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(2) DEF, USE -> S1 δ t S2, distance 3-2 = 1, direction > Representation of Data Dependency Dependency graph directed graph G(V, E) where V is a set of statements, and E edges representing dependencies Dependency cycles Dependencies starting and ending at the statement S Loop independent dependency S2: D(i) = A(i) i = 2: S1: A(2) = B(i) + C(i) S2: D(2) = A(2) i = 3: S1: A(3) = B(i) + C(i) S2: D(3) = A(3) DEF, USE-> S1 δ t S2, distance 2-2= 0, direction = S1: A = B + E S2: B = C S3: C = A V = {S1, S2, S3} E = {(S1, S2), (S1, S3), (S2, S3)} S1 δ a S2 δ a S3 δ t Loop Dependencies Kontrollfrågor S2: D(i) = A(i-1) S2: D(i) = A(i+1) Vilka beroenden finns i ksnuttarna på föregående sida? Riktningsvektorer? Hur ser beroendegraferna ut? do j = 2, 99 S1: A(i+1,j-1) =A(i, j) + C(i,j) 39 40
CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming
CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming
More informationParallel Programming. March 15,
Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010
More informationHigh Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a
More informationHPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90
149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which
More informationHigh Performance Fortran. James Curry
High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of
More informationSynchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen
Synchronous Computation Examples HPC Fall 2008 Prof. Robert van Engelen Overview Data parallel prefix sum with OpenMP Simple heat distribution problem with OpenMP Iterative solver with OpenMP Simple heat
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More informationModule 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.
The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops
More informationLecture V: Introduction to parallel programming with Fortran coarrays
Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationCSL 730: Parallel Programming. Algorithms
CSL 73: Parallel Programming Algorithms First 1 problem Input: n-bit vector Output: minimum index of a 1-bit First 1 problem Input: n-bit vector Output: minimum index of a 1-bit Algorithm: Divide into
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationLecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy
Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction
More informationLight HPF for PC Clusters
Light HPF for PC Clusters Hidetoshi Iwashita Fujitsu Limited November 12, 2004 2 Background Fujitsu had developed HPF compiler product. For VPP5000, a distributed-memory vector computer.
More informationParallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS
Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction
More informationParallel Processing: October, 5, 2010
Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationModule 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.
The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationCS 293S Parallelism and Dependence Theory
CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law
More informationEnhancing Parallelism
CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More information1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors
1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of
More informationData Dependence Analysis
CSc 553 Principles of Compilation 33 : Loop Dependence Data Dependence Analysis Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Data Dependence
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationIntroduction to parallel computing. Seminar Organization
Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -
More informationTransportation problem
Transportation problem It is a special kind of LPP in which goods are transported from a set of sources to a set of destinations subjects to the supply and demand of the source and destination, respectively,
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationHPF High Performance Fortran
Table of Contents 270 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF -- influential but failed Chapel
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationCSL 730: Parallel Programming
CSL 73: Parallel Programming General Algorithmic Techniques Balance binary tree Partitioning Divid and conquer Fractional cascading Recursive doubling Symmetry breaking Pipelining 2 PARALLEL ALGORITHM
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationVector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data
Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.
More informationAuto-Vectorization with GCC
Auto-Vectorization with GCC Hanna Franzen Kevin Neuenfeldt HPAC High Performance and Automatic Computing Seminar on Code-Generation Hanna Franzen, Kevin Neuenfeldt (RWTH) Auto-Vectorization with GCC Seminar
More informationExample of a Parallel Algorithm
-1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software
More informationHigh Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science
High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationIn context with optimizing Fortran 90 code it would be very helpful to have a selection of
1 ISO/IEC JTC1/SC22/WG5 N1186 03 June 1996 High Performance Computing with Fortran 90 Qualiers and Attributes In context with optimizing Fortran 90 code it would be very helpful to have a selection of
More informationEmbedded Systems Design with Platform FPGAs
Embedded Systems Design with Platform FPGAs Spatial Design Ron Sass and Andrew G. Schmidt http://www.rcs.uncc.edu/ rsass University of North Carolina at Charlotte Spring 2011 Embedded Systems Design with
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationProgramming for Electrical and Computer Engineers. Pointers and Arrays
Programming for Electrical and Computer Engineers Pointers and Arrays Dr. D. J. Jackson Lecture 12-1 Introduction C allows us to perform arithmetic addition and subtraction on pointers to array elements.
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationComputer Science & Engineering 150A Problem Solving Using Computers
Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using
More informationChapter 3. Fortran Statements
Chapter 3 Fortran Statements This chapter describes each of the Fortran statements supported by the PGI Fortran compilers Each description includes a brief summary of the statement, a syntax description,
More informationContinuations provide a novel way to suspend and reexecute
Continuations provide a novel way to suspend and reexecute computations. 2. ML ( Meta Language ) Strong, compile-time type checking. Types are determined by inference rather than declaration. Naturally
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationParallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam
Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationCS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following
More informationExtrinsic Procedures. Section 6
Section Extrinsic Procedures 1 1 1 1 1 1 1 1 0 1 This chapter defines the mechanism by which HPF programs may call non-hpf subprograms as extrinsic procedures. It provides the information needed to write
More information6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT
6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among
More informationData parallel algorithms 1
Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize
More informationLecture 32: Partitioned Global Address Space (PGAS) programming models
COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP
More informationJohn Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University
Co-Array Fortran and High Performance Fortran John Mellor-Crummey Department of Computer Science Center for High Performance Software Research Rice University LACSI Symposium October 2006 The Problem Petascale
More informationAutomatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy
Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written
More informationArrays, Vectors Searching, Sorting
Arrays, Vectors Searching, Sorting Arrays char s[200]; //array of 200 characters different type than class string can be accessed as s[0], s[1],..., s[199] s[0]= H ; s[1]= e ; s[2]= l ; s[3]= l ; s[4]=
More informationParallelisation. Michael O Boyle. March 2014
Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationParallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers
More informationParallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring
Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationCompiler techniques for leveraging ILP
Compiler techniques for leveraging ILP Purshottam and Sajith October 12, 2011 Purshottam and Sajith (IU) Compiler techniques for leveraging ILP October 12, 2011 1 / 56 Parallelism in your pocket LINPACK
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationOpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system
OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More informationLecture 5: Outline. I. Multi- dimensional arrays II. Multi- level arrays III. Structures IV. Data alignment V. Linked Lists
Lecture 5: Outline I. Multi- dimensional arrays II. Multi- level arrays III. Structures IV. Data alignment V. Linked Lists Multidimensional arrays: 2D Declaration int a[3][4]; /*Conceptually 2D matrix
More informationName: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)
Name: PID: Email: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Cache Performance (Questions from 15-213 @ CMU. Thanks!) 1. This problem requires you to analyze the cache behavior of a function that sums
More informationCoarse-Grained Parallelism
Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and
More informationLecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining
Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software
More informationPractice problems Set 2
Practice problems Set 2 1) Write a program to obtain transpose of a 4 x 4 matrix. The transpose of matrix is obtained by exchanging the elements of each row with the elements of the corresponding column.
More informationSupercomputing in Plain English Part IV: Henry Neeman, Director
Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is
More informationi=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)
Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer
More informationOutline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities
Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:
More informationChapel Introduction and
Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationJ. E. Smith. Automatic Parallelization Vector Architectures Cray-1 case study. Data Parallel Programming CM-2 case study
Outline SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in ECE/CS 757, no part of these notes may be
More informationDeclaration and Initialization
6. Arrays Declaration and Initialization a1 = sqrt(a1) a2 = sqrt(a2) a100 = sqrt(a100) real :: a(100) do i = 1, 100 a(i) = sqrt(a(i)) Declaring arrays real, dimension(100) :: a real :: a(100) real :: a(1:100)!
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationG Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University
G22.2110-001 Programming Languages Spring 2010 Lecture 4 Robert Grimm, New York University 1 Review Last week Control Structures Selection Loops 2 Outline Subprograms Calling Sequences Parameter Passing
More informationParallel Sorting. Sathish Vadhiyar
Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More information