Parallel Programming. March 15,
|
|
- Gordon Shepherd
- 6 years ago
- Views:
Transcription
1 Parallel Programming March 15,
2 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15,
3 Models Computational model is a model of a (real) system which can be solved by a computer Model of Computation is the way a computational model (problem) is solved by a computer Model of Computation defines the functionality of a virtual machine March 15,
4 Elements of Models of Concurrent Computation Schedulable units of computation and data operators, functions, programs, arrays, objects,... Bindings between schedulable units of computation (values to names) global variables, communicated variables [Brown 86] March 15,
5 Example T1 program T1 var: A, global; A := 5; fork (T1,T2); T2 T3 program T2 var: B, local; B := A;... program T3 var: C, global; C := 1; A := C; schedulable computations: T 1, T 2, T 3 schedulable data: local variables aligned with T s global variables not binding computation: control flow name-to-value binding: may lead to non-determinism March 15,
6 From Application to Machine (Concurrent) Computation can be hierarchically specified by selecting refined models of concurrent computation The refinement is done by specifying the elements for the model of computation and the mapping of those elements between successive virtual machine For each element in the model of computation there might exist alternative choices March 15,
7 Major Levels Specification Design Implementation System Machines March 15,
8 Example Application Level: Element Assembly Node ordering Linear System Solver Design level: Element Assembly e.g. Cholesky with nested dissection March 15,
9 Example (cnt ) Implementation level: sequential programs with message passing explicit mapping System level: MPI March 15,
10 Another view Decompose Find schedulable units of computation (i.e. tasks) and data Cluster Cluster tasks and data into sequential allocatable units of work Goal: Establishing right level of granularity e.g. define processes or threads of execution Embed in programming environment select parallel programming language add communication and synchronization constructs Map allocatable units of work to processors Binding processes or threads to physical processors March 15,
11 Decompose Decide what level of parallelism to use Parallelism may be static or dynamic Level of parallelism may depend of problem size Determine number of processors that can effectively be used given a certain problem size More problem dependent than architecture dependent March 15,
12 Cluster Bring tasks (and sometimes data) together in conceptually executable programs (processes or threads) Various mechanisms cluster iterations data driven decomposition use heuristic cluster algorithms More problem dependent than architecture dependent March 15,
13 Embed in programming environment Select parallel programming method Add communication and synchronization constructs Add access mechanisms to shared data Design distributed data structures Optimize program to reduce cost of parallelism combine communication messages remove redundant synchronizations reduce parallel loop overhead Partly dependent on architecture March 15,
14 Map Assign processes or threads to processors Try to schedule processes that communicate a lot to the same processor Dependent on architecture and run-time system March 15,
15 Languages and Machines (Parallel) Language Or: sequential language + library Compiler Run-time system Machines March 15,
16 Basic Notions T1 T2 T3 T4 3 5 Critical Path Max Concurrency 18 3 T5 Task dependency graph but: data handling is not included March 15,
17 How to find parallelism? Start with sequential code base Do profiling to identify computationally intensive kernels Code inspection (identify computationally intensive loops) Inspect data structures and if necessary unbundle them rewrite parts of the code using parallel constructs or library calls Change the algorithm(s) Sometimes algorithms are too sequential find parallel alternatives Start all over again Working with existing code base is sometimes in efficient Recode from specification March 15,
18 Algorithm Change [1] i Stencil operation: nearest neighbor Gauss-Seidel solver j for(i=1;i<dim;i++){ for(j=1;j<dim;j++){ a(i,j)=0.2*(a(i,j)+a(i,j-1)+a(i-1,j)+a(i,j+1)+a(i+1,j)) }} March 15,
19 Algorithm Change [2] j Do coordinate transform on iteration space Parallelism along anti-diagonals i Load balancing problem for(i=1;i<dim-1;i++){ parfor(j=??){ a(i,j)=0.2*(a(i,j)+a(i,j-1)+a(i-1,j)+a(i,j+1)+a(i+1,j)) }} March 15,
20 Algorithm Change [3] If all (i,j) iterations are done simultaneously; Jacobi solver (needs duplicate data structure) If not, program might become nondeterministic! parallel for(i=1;i<dim;i++){ parallel for(j=1;j<dim;j++){ a(i,j)=0.2*(a(i,j)+a(i,j-1)+a(i-1,j)+a(i,j+1)+a(i+1,j)) } } March 15,
21 Algorithm Change [4] Alternative scheme: red-black ordering Red and black elements can be done in parallel, respectively Different schemes have different numerical convergence!! parallel for( red elements ){ a(i,j)=0.2*(a(i,j)+a(i,j-1)+a(i-1,j)+a(i,j+1)+a(i+1,j)) } parallel for( black elements ){ a(i,j)=0.2*(a(i,j)+a(i,j-1)+a(i-1,j)+a(i,j+1)+a(i+1,j)) } March 15,
22 Important Machine Classes Shared Memory physically shared address space Distributed Memory physically local address spaces March 15,
23 Shared Memory a b c interconnection network P P P P P P a(1)=.. a(2)=.. a(3)=.. All processors have equal access to all memory locations for(i=1;i<dim+1;i++){ a(i)=b(i)+c(i) } translate to: parallel for(i=1;i<dim+1;i++){ a(i)=b(i)+c(i) } (or directly if possible) March 15,
24 Distributed Memory a b c interconnection network P?b(1) P?b(3) P a(1)=.. a(2)=.. a(3)=.. Every processor has its own local address space Non-local data can only be accessed through a request to processor owning the data Access times may depend on distance March 15,
25 Virtual Shared Memory Virtual Shared Memory P P P Global address space Memories are distributed Local access is faster than remote access Hardware (or software) hide memory distribution. March 15,
26 Problems Shared Memory model machines have scalability problems automatic parallel program generation difficult Distributed Memory model data distribution is important automatic parallel program generation very difficult March 15,
27 Models of parallel computation For sequential programming, structured programming techniques have evolved to aid the programmer What do we have for parallel programming? conceptual level farmer/worker divide and conquer data parallelism function parallelism bulk synchronous system (implementation) level (logical) data spaces and programs communication and synchronization March 15,
28 Farmer/Worker 1. Farmer / Worker pool of workers Farmer A pool of identical processes is formed, which are each given a piece of work by the farmer process. When finished, a worker asks the farmer for another job until nothing is left. March 15,
29 Divide and Conquer 2. Divide and Conquer A task is recursively split in smaller tasks, which at some point of granularity are handed to workers. When finished, the results are again recursively moved up until the root task is reached. March 15,
30 Data parallelism 3. Data Parallelism block distribution data structure processor-memory pairs communication network The data structures are subdivided among the memories of the distributed processors. Computations follow the given data distributions. March 15,
31 Function or Task Parallelism 4. Function parallelism f f g g h h Functions act on arguments. There are a number of subclasses in this model, due to differences in execution semantics: pipelining data flow demand driven / control driven March 15,
32 Bulk synchronous 5. Bulk Synchronous global sync between phases computation communication decision communication Parallel computation consists of repeated cycles of Compute phase, where all calculations are done locally Communication phase Decision Phase (e.g. convergence) Communication Phase March 15,
33 Communication typical mechanisms: - semaphores - critical sections thread-1 d:=e+a thread-2 a in global memory a:=5 typical mechanisms: - message passing a in local memory March 15,
34 Communication in shared memory thread-1 thread-2 s in global memory lock statement must be atomic s:=s+4 /code thread-1... lock s; s=s+4; unlock s;... mutual exclusions required s:=s+5 /code thread-1... lock s; s=s+5; unlock s;... March 15,
35 Synchronization in shared memory Some parallel models, like the bulk synchronous model, require global synchronization. Barrier construct: barrier(name,processor_group) /code thread-1... /compute code barrier(5,procs) /communication /or updating shared /variables barrier(6,procs) /decision code barrier(7,procs) /code thread-2... /compute code barrier(5,procs) /communication /or updating shared /variables barrier(6,procs) /decision code barrier(7,procs)... March 15,
36 Message Passing Basic building blocks send(void *sendbuf, int n_elems, int p_dest) receive(void *recbuf, int n_elems, int p_source) (not all parameters shown for convenience) /code P0... a=100; send(&a,1,1) a=0;... /code P1... receive(&a,1,0) Printf( %d\n,a);... not the same a MPI March 15,
37 Message Passing Discipline /code P send(&a,1,1) receive(&b,1,1) /code P send(&b,1,0) receive(&a,1,0) send must be non-blocking receive must be blocking March 15,
38 Matrix vector example y 1 y 2 y 3 y 4 a 11 a 12 a 13 a 14 a = 21 a 22 a 23 a 24 a 31 a 32 a 33 a 34 a 41 a 42 a 43 a 44 x 1 x 2 x 3 x 4 y = A x for(i=1;i<dim+1;i++){ y(i)=0; for(j=1;j<dim+1;j++){ y(i)=y(i) + a(i,j) * x(j) } } March 15,
39 Shared memory model: OpenMP Extension to Fortran, C, and C++ Basic Features Parallel Loops Parallel Sections Synchronization Critical Sections Communication Shared variables March 15,
40 The OMP execution model parallel construct worksharing construct base process n idle end parallel construct Parallel construct creates a number of worker processes (or threads) Worksharing construct hands out work to available workers (threads) March 15,
41 Format OMP #pragma omp directive [clause list] Example: #pragma omp parallel [clause list] /* structured block */. March 15,
42 Parallel DO #pragma omp parallel #pragma omp for schedule(static,2) for(i=1;i<dim+1;i++){ March 15,
43 Parallel Sections PARALLEL SECTIONS combines parallel and worksharing construct Allows independent execution of subprograms #pragma omp parallel section { #pragma omp section{ task_a(); } #pragma omp section{ } task_b(); } March 15,
44 Locks and Critical Sections for(i=1;i<dim+1;i++){ sum = sum + b(i)**2 } #pragma omp parallel private(i) reduction(+:sum) schedule(dynamic,1) for(i=1;i<dim+1;i++){ } sum = sum + b(i)**2 } b(i)**2 update sum March 15,
45 MVP example #pragma omp parallel for private(j) for(i=1;i<dim+1;i++){ y(i)=0; for(j=1;j<dim+1;j++){ y(i)= y(i) + a(i,j) * x(j) } } March 15,
46 Distributed Memory model: HPF High Performance Fortran (HPF) Based on adding annotations to sequential Fortran 90 base program Main features parallel constructs worksharing based on data distribution statements March 15,
47 Mapping model a REAL a(n,n) align (optional) tem P Real_P!HPF$ TEMPLATE tem(n,n)!hpf$ ALIGN a(:,:) WITH tem(:,:) distribute!hpf$ PROCESSORS p(5)!hpf$ DISTRIBUTE a(block,*) ONTO p P -> Real_P mapping by compiler March 15,
48 Virtual processors Create a grid of virtual processors Examples:!HPF$ PROCESSORS p(3,3)!hpf$ PROCESSORS q(m)!hpf$ PROCESSORS powerpc q p powerpc March 15,
49 Distribution Distributes the array elements over virtual processors Various predefined distribution functions block, block(m) cyclic, cyclic(m) * (all elements in dimension to same processor) examples:!hpf$ DISTRIBUTE a(block,block) ONTO p!hpf$ DISTRIBUTE a(cyclic,*) ONTO q!hpf$ DISTRIBUTE a(*,*) ONTO powerpc March 15,
50 Block distribution BLOCK means give equal sized contiguous blocks of array elements to processors Suited for problems with equal computational load per element REAL a(16)!hpf$ PROCESSORS p(4)!hpf$ DISTRIBUTE a(block) ONTO p or!hpf$ DISTRIBUTE a(block(8)) ONTO p block P 0 P 1 P 2 P 3 block(8) P 0 P 1 March 15,
51 Alignment Arrays can be ALIGNed to each other Aligned arrays are collectively distributed through top array Top array can be specified as a fictitious array, called TEMPLATE Purpose: specification of locality: two aligned array elements reside on the same processor to allow replication and collapsing of dimensions convenient means to distribute group of arrays with a single DISTRIBUTE statement March 15,
52 Alignment, example (1) One-to-one alignment REAL a(16), b(16)!hpf$ ALIGN b(:) WITH a(:)!hpf$ DISTRIBUTE a(block) ONTO p P 0 P 1 P 2 P 3 a b March 15,
53 Alignment, example (2) Collapsed alignment REAL a(16), b(16,3)!hpf$ ALIGN b(:,*) WITH a(:)!hpf$ DISTRIBUTE a(block) ONTO p a b More alignment functions available March 15,
54 Parallelism F90 array assignment statements FORALL statement FORALL is new language feature, no directive March 15,
55 Examples Array statements REAL, DIMENSION(N) :: a,b,c a = b+c a(1:10:2) = b(6:10) + c(1:5) Forall statements FORALL (i=1:n,j=1,m, a(i,j).ne.0) a(i,j) = 1.0 / a(i,j) has copy-in copy-out semantics March 15,
56 MVP in HPF (1) P 0 P 1 P 2 y A x P 3 REAL a(m,n),y(m),x(n)!hpf$ PROCESSORS p(4)!hpf$ ALIGN a(:,*) WITH y(:)!hpf$ ALIGN x(:) WITH y(:)!hpf$ DISTRIBUTE y(block) ONTO p!hpf$ INDPENDENT DO j=1,n FORALL(i=1:m) y(i) =y(i) + a(i,j)*x(j) END FORALL END DO or March 15, FORALL(i=1,m) y(i) = DOT_PRODUCT(a(i,:),x(:))
57 MVP in HPF (2) P 0 P 1 P 2 P 3 y A x x x x REAL a(m,n),y(m),x(n)!hpf$ PROCESSORS p(4)!hpf$ ALIGN a(:,*) WITH y(:)!hpf$ ALIGN x(*) WITH y(*)!hpf$ DISTRIBUTE y(block) ONTO p!hpf$ INDEPENDENT DO j=1,n FORALL(i=1,m) y(i) = DOT_PRODUCT(a(i,:),x(:)) END DO March 15,
58 Distributed Memory Model: Message Passing No notion of global data Every processor has its own private arrays and variables No separate synchronization constructs implicit synchronization by message transfer barrier synchronization is done by all-to-all broadcast Program can only identify itself by (locally) calling a function that returns the processor number it is executing on The usual way of programming is SPMD (Single Program, Multiple Data), i.e. each processor runs a copy of the same program and parameterizes itself according to the processor identifier Difficult to get code error free (lots of details to handle in code) March 15,
59 Local data structures P 0 P 1 P 2 P 3 y A x (third loop) = y_local a_local x_local s from other Processors (second loop) x_local temp to other processors (first loop) 3/15/10 59
60 MVP in Message Passing } float a_local[m/procs,n],y_local[m/procs],x_local[n/procs],temp[n]; int mypid; mypid=proc(); k=n/procs; for(i=0,i<procs,i++){ if(mypid=i)then send(&x_local[0],k,i); (1) }}} for(i=0,i<procs,i++){ if(mypid!=i)then receive(&temp[i*k],k,i); else{ for(j=0,j<k,j++){ temp[i*k+j]=x_local[j]; March 15, }} for(i=0,i<k,i++){ for(j=0,j<n,j++){ y_local[i] = y_local[i] + a_local[i,j]*temp(j); P 0 P 1 P 2 P 3 y A (2) x (3)
61 MVP in Message Passing [2] Need local buffer (temp[ ]) of global size n anyway! Whether previous message passing scheme is efficient is determined by how x_local[ ] changes during the rest of the computation the MVP is embedded in If it changes locally on the processors, we need to resend it each cycle. But in that case we might as well put in all elements of temp in a buffer x_local[ ] of global size n, saving a local copy loop March 15,
62 Local data structures P 0 P 1 P 2 P 3 y A x (third loop) = y_local a_local x_local s from other Processors (second loop) x-local to other processors (first loop) 3/15/10 62
63 MVP in Message Passing [3] resulting in: float a_local[m/procs,n],y_local[m/procs],x_local[n]; int mypid; mypid=proc(); k=n/procs; for(i=0,i<procs,i++){ if(mypid=i)then send(&x_local[mypid*k],k,i); (1) for(i=0,i<procs,i++){ if(mypid!=i)then receive(&x_local[i*k],k,i); (2) }} for(i=0,i<k,i++){ for(j=0,j<n,j++){ (3) y_local[i] = y_local[i] + a_local[i,j]*x_local(j); }} March 15,
64 Message Passing Observations A general observation is that in message passing often a local copy of remote data needs to be created. The only way to avoid that are element-wise transfers, but they are very inefficient Also the HPF compiler needs to do that, but this is hidden from the programmer For example in stencil-type of operations, we need to extend the local array with so-called ghost rows March 15,
65 Ghost rows tedious hand-coding required to get the details right March 15,
HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90
149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More informationSynchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen
Synchronous Computation Examples HPC Fall 2008 Prof. Robert van Engelen Overview Data parallel prefix sum with OpenMP Simple heat distribution problem with OpenMP Iterative solver with OpenMP Simple heat
More informationSynchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen
Synchronous Shared Memory Parallel Examples HPC Fall 2012 Prof. Robert van Engelen Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution
More informationMultithreading in C with OpenMP
Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads
More informationSystolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program
SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationSimulating ocean currents
Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on
More informationOpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen
OpenMP - III Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationStatic Data Race Detection for SPMD Programs via an Extended Polyhedral Representation
via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationIntroduction to OpenMP.
Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationOpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 Worksharing constructs To date: #pragma omp parallel created a team of threads We distributed
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationComputational Mathematics
Computational Mathematics Hamid Sarbazi-Azad Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu OpenMP Work-sharing Instructor PanteA Zardoshti Department of Computer
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline
More informationEE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California
EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP
More informationOpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen
OpenMP Dr. William McDoniel and Prof. Paolo Bientinesi HPAC, RWTH Aachen mcdoniel@aices.rwth-aachen.de WS17/18 Loop construct - Clauses #pragma omp for [clause [, clause]...] The following clauses apply:
More information6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT
6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationChip Multiprocessors COMP Lecture 9 - OpenMP & MPI
Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather
More informationECE7660 Parallel Computer Architecture. Perspective on Parallel Programming
ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationParallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides
Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for
More informationCSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing
HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationLecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications
Lecture 7 OpenMP: Reduction, Synchronization, Scheduling & Applications Announcements Section and Lecture will be switched on Thursday and Friday Thursday: section and Q2 Friday: Lecture 2010 Scott B.
More informationHigh Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science
High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2017 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationOpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix
More informationJukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples
Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More information1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008
1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction
More informationParallel Programming using OpenMP
1 OpenMP Multithreaded Programming 2 Parallel Programming using OpenMP OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationParallel Programming using OpenMP
1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard
More informationOpenMP on Ranger and Stampede (with Labs)
OpenMP on Ranger and Stampede (with Labs) Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition November 6, 2012 Based on materials developed by Kent
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationCOMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction
COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationMulti-core Architecture and Programming
Multi-core Architecture and Programming Yang Quansheng( 杨全胜 ) http://www.njyangqs.com School of Computer Science & Engineering 1 http://www.njyangqs.com Programming with OpenMP Content What is PpenMP Parallel
More informationCOMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP
COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including
More informationLecture 14. Performance Programming with OpenMP
Lecture 14 Performance Programming with OpenMP Sluggish front end nodes? Advice from NCSA Announcements Login nodes are honest[1-4].ncsa.uiuc.edu Login to specific node to target one that's not overloaded
More informationShared Memory Programming with OpenMP
Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced
More informationCSL 860: Modern Parallel
CSL 860: Modern Parallel Computation Hello OpenMP #pragma omp parallel { // I am now thread iof n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely
More informationChap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationIntroduction to OpenMP. Lecture 4: Work sharing directives
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationSC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level
SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationIntroduction to OpenMP
Introduction to OpenMP Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 A little about me! PhD Computer Engineering Texas A&M University Computer Science
More informationAlexey Syschikov Researcher Parallel programming Lab Bolshaya Morskaya, No St. Petersburg, Russia
St. Petersburg State University of Aerospace Instrumentation Institute of High-Performance Computer and Network Technologies Data allocation for parallel processing in distributed computing systems Alexey
More informationTowards OpenMP for Java
Towards OpenMP for Java Mark Bull and Martin Westhead EPCC, University of Edinburgh, UK Mark Kambites Dept. of Mathematics, University of York, UK Jan Obdrzalek Masaryk University, Brno, Czech Rebublic
More informationReview. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause
Review Lecture 12 Compiler Directives Conditional compilation Parallel construct Work-sharing constructs for, section, single Synchronization Work-tasking Library Functions Environment Variables 1 2 13b.cpp
More informationShared Memory Programming. Parallel Programming Overview
Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism
More informationShared Memory Programming Models I
Shared Memory Programming Models I Peter Bastian / Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-69120 Heidelberg phone: 06221/54-8264
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationDPHPC: Introduction to OpenMP Recitation session
SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set
More informationParallel Programming
Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)
More informationParallel Computing Parallel Programming Languages Hwansoo Han
Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Programming Practice Current Start with a parallel algorithm Implement, keeping in mind Data races Synchronization Threading syntax
More information[Potentially] Your first parallel application
[Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationA Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh
A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory
More informationChapel Introduction and
Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple
More informationShared Memory Parallelism - OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openmp/#introduction) OpenMP sc99 tutorial
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationMultiprocessors at Earth This lecture: How do we program such computers?
Multiprocessors at IT/UPPMAX DARK 2 Programming of Multiprocessors Sverker Holmgren IT/Division of Scientific Computing Ra. 280 proc. Cluster with SM nodes Ngorongoro. 64 proc. SM (cc-numa) Debet&Kredit.
More informationIntroduction to parallel computing. Seminar Organization
Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads
More informationTopics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP
Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationModule 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming
The Lecture Contains: Shared Memory Version Mutual Exclusion LOCK Optimization More Synchronization Message Passing Major Changes MPI-like Environment file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_1.htm[6/14/2012
More information!OMP #pragma opm _OPENMP
Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is
More information