OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Similar documents
OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Parallel Computing. Lecture 17: OpenMP Last Touch

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

JURECA Tuning for the platform

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

MPI & OpenMP Mixed Hybrid Programming

CS 426. Building and Running a Parallel Application

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Hybrid Programming with MPI and OpenMP. B. Estrade

Shared Memory programming paradigm: openmp

Our new HPC-Cluster An overview

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Parallel Computing: Overview

Hybrid Programming. John Urbanic Parallel Computing Specialist Pittsburgh Supercomputing Center. Copyright 2017

Concurrent Programming with OpenMP

Introduction to Parallel Programming

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Concurrent Programming with OpenMP

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Lecture 4: OpenMP Open Multi-Processing

Parallel Programming in C with MPI and OpenMP

Holland Computing Center Kickstart MPI Intro

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Parallel Programming in C with MPI and OpenMP

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Computer Architecture

Introduction to parallel computing concepts and technics

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Tutorial: parallel coding MPI

DPHPC: Introduction to OpenMP Recitation session

Solution of Exercise Sheet 2

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Introduction to OpenMP

Parallel Programming. Libraries and Implementations

Parallel Programming Libraries and implementations

Hybrid Model Parallel Programs

ECE 574 Cluster Computing Lecture 10

Multithreading in C with OpenMP

HPC Parallel Programing Multi-node Computation with MPI - I

Parallele Numerik. Blatt 1

Message Passing Interface. George Bosilca

Scientific Computing

Session 4: Parallel Programming with OpenMP

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Introduction to OpenMP

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Parallel Programming in C with MPI and OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Lecture 7: Distributed memory

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

High Performance Computing

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

A brief introduction to OpenMP

Parallel programming in Madagascar. Chenlong Wang

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Parallel Programming in C with MPI and OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

Hybrid Computing. Lars Koesterke. University of Porto, Portugal May 28-29, 2009

Parallel Programming Using MPI

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

A short overview of parallel paradigms. Fabio Affinito, SCAI

Parallel Programming

Parallel Programming, MPI Lecture 2

Hybrid Programming with MPI and OpenMP

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Performance Considerations: Hardware Architecture

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS420: Operating Systems

Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering

Simple examples how to run MPI program via PBS on Taurus HPC

Translating OpenMP Programs for Distributed-Memory Systems

Message Passing Interface

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Assignment 1 OpenMP Tutorial Assignment

COMP/CS 605: Introduction to Parallel Computing Topic : Distributed Memory Programming: Message Passing Interface

Message Passing Interface

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

All-Pairs Shortest Paths - Floyd s Algorithm

Transcription:

OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 1 / 31

Outline OpenMP vs MPI Hybrid programming: OpenMP and MPI together CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 2 / 31

OpenMP vs MPI Suitable for multiprocessors OpenMP MPI CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP vs MPI OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP vs MPI OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers No Yes Supports incremental parallelization CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP vs MPI OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers No Yes Supports incremental parallelization Yes No Minimal extra code CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP vs MPI OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers No Yes Supports incremental parallelization Yes No Minimal extra code Yes No Explicit control of memory CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP vs MPI OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers No Yes Supports incremental parallelization Yes No Minimal extra code Yes No Explicit control of memory No Yes CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 3 / 31

OpenMP: Advantages / Disadvantages Advantages CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 4 / 31

OpenMP: Advantages / Disadvantages Advantages Applications are relatively easy to implement. Low latency, high bandwidth. Allows run time scheduling, dynamic load balancing. Both fine and course grain parallelism are effective. Implicit communication. Disadvantages CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 4 / 31

OpenMP: Advantages / Disadvantages Advantages Applications are relatively easy to implement. Low latency, high bandwidth. Allows run time scheduling, dynamic load balancing. Both fine and course grain parallelism are effective. Implicit communication. Disadvantages Concurrent access to memory may degrade performance. Overheads can become an issue when the size of the parallel loop is too small. Course grain parallelism often requires a parallelization strategy similar to an MPI strategy, making the implementation more complicated. Explicit synchronization is required. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 4 / 31

MPI: Advantages / Disadvantages Advantages CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 5 / 31

MPI: Advantages / Disadvantages Advantages MPI involves explicit parallelism, often provides a better performance. A number of optimized collective communication routines are available In principle, communications and computation can be overlapped. Parallel access to memory Communications often cause synchronization naturally, reducing overheads. Disadvantages CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 5 / 31

MPI: Advantages / Disadvantages Advantages MPI involves explicit parallelism, often provides a better performance. A number of optimized collective communication routines are available In principle, communications and computation can be overlapped. Parallel access to memory Communications often cause synchronization naturally, reducing overheads. Disadvantages Decomposition, development and debugging of applications can be a considerable overhead. Communications can often create a large overhead, which needs to be minimized. The granularity often has to be large, fine grain granularity can create a large quantity of communications. Dynamic load balancing is often difficult. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 5 / 31

Clusters of SMPs Clusters of SMPs: Network SMP SMP SMP CPU CPU CPU Memory CPU. CPU Memory CPU. CPU... Memory CPU. CPU CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 6 / 31

Clusters of SMPs Using only MPI: Network SMP SMP SMP CPU Process CPU Process CPU Process Memory CPU Process. CPU Process Memory CPU Process. CPU Process... Memory CPU Process. CPU Process CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 6 / 31

Clusters of SMPs Using MPI together with OpenMP: Network SMP SMP SMP CPU Process Thread CPU Process Thread CPU Process Thread Memory CPU Thread. CPU Thread Memory CPU Thread. CPU Thread... Memory CPU Thread. CPU Thread Best of both worlds! CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 6 / 31

Hello World, in OpenMP+MPI #include <stdio.h> #include <mpi.h> #include <omp.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[mpi_max_processor_name]; int iam = 0, np = 1; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); #pragma omp parallel default(shared) private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf("hello from thread %d out of %d from process %d out of %d on %s\n", iam, np, rank, numprocs, processor_name); } } MPI_Finalize(); CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 7 / 31

Hello World, in OpenMP+MPI Compiling: $ mpicc -fopenmp HellowWorld.c -o HelloWorld CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 8 / 31

Hello World, in OpenMP+MPI Compiling: $ mpicc -fopenmp HellowWorld.c -o HelloWorld Running: $./HelloWorld Hello from thread 0 out of 2 from process 0 out of 1 on markov Hello from thread 1 out of 2 from process 0 out of 1 on markov CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 8 / 31

Hello World, in OpenMP+MPI Compiling: $ mpicc -fopenmp HellowWorld.c -o HelloWorld Running: $./HelloWorld Hello from thread 0 out of 2 from process 0 out of 1 on markov Hello from thread 1 out of 2 from process 0 out of 1 on markov $ mpirun -np 3./HelloWorld Hello from thread 0 out of 2 from process 0 out of 3 on markov Hello from thread 1 out of 2 from process 0 out of 3 on markov Hello from thread 0 out of 2 from process 2 out of 3 on markov Hello from thread 1 out of 2 from process 2 out of 3 on markov Hello from thread 0 out of 2 from process 1 out of 3 on markov Hello from thread 1 out of 2 from process 1 out of 3 on markov CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 8 / 31

A Common Execution Scenario 1 a single MPI process is launched on each SMP node in the cluster; 2 each process spawns N threads on each SMP node; 3 at some global sync point, the master thread on each SMP communicate with one another; 4 the threads belonging to each process continue until another sync point or completion; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 9 / 31

MPI Handling of Threads Replace MPI Init by: int MPI_Init_thread ( int argc, /* command line arguments */ char *argv[], int desired, /* desired level of thread support */ int *provided /* level of thread support provided */ ) CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 10 / 31

MPI Handling of Threads Thread support levels: MPI THREAD SINGLE Only one thread will execute. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 11 / 31

MPI Handling of Threads Thread support levels: MPI THREAD SINGLE Only one thread will execute. MPI THREAD FUNNELED Process may be multi-threaded, but only main thread will make MPI calls (all MPI calls are funneled to main thread). (Default) CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 11 / 31

MPI Handling of Threads Thread support levels: MPI THREAD SINGLE Only one thread will execute. MPI THREAD FUNNELED Process may be multi-threaded, but only main thread will make MPI calls (all MPI calls are funneled to main thread). (Default) MPI THREAD SERIALIZED Process may be multi-threaded, multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized ). CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 11 / 31

MPI Handling of Threads Thread support levels: MPI THREAD SINGLE Only one thread will execute. MPI THREAD FUNNELED Process may be multi-threaded, but only main thread will make MPI calls (all MPI calls are funneled to main thread). (Default) MPI THREAD SERIALIZED Process may be multi-threaded, multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized ). MPI THREAD MULTIPLE Multiple threads may call MPI, with no restrictions. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 11 / 31

MPI Handling of Threads MPI communication is process oriented MPI THREAD MULTIPLE requires very careful organization of threads safer to have all communication being handled outside OMP parallel regions CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 12 / 31

MPI Handling of Threads MPI THREAD FUNNELED use omp barrier since there is no implicit barrier in master workshare construct all other threads sleeping #include <mpi.h> int main(int argc, char *argv[]) {... MPI_Init (&argc, &argv);... #pragma omp parallel { #pragma omp barrier #pragma omp master { MPI_<whatever> } #pragma omp barrier }... } CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 13 / 31

MPI Handling of Threads MPI THREAD SERIALIZED only omp barrier is at beginning, since there is an implicit barrier in single workshare construct all other threads will be sleeping #include <mpi.h> int main(int argc, char *argv[]) {... MPI_Init_thread(argc, argv, MPI_THREAD_SERIALIZED, &provided);... #pragma omp parallel { #pragma omp barrier #pragma omp single { MPI_<whatever> } }... } CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 14 / 31

MPI Handling of Threads more sophisticated: communicate with one core, work with others during communication need at least MPI THREAD FUNNELED support can be difficult to manage and load balance! #include <mpi.h> int main(int argc, char *argv[]) {... MPI_Init_thread(argc, argv, MPI_THREAD_SERIALIZED, &provided);... #pragma omp parallel { if(thread == x) MPI_<whatever> else work(); }... } CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 15 / 31

MPI Handling of Threads even more sophisticated: use thread id and process rank in directing communication to a given core need MPI THREAD MULTIPLE support level of complexity increases significantly #include <mpi.h> int main(int argc, char *argv[]) {... MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided); MPI_Comm_rank(MPI_COMM_WORLD, int *p);... #pragma omp parallel { tid = omp_get_thread_num(); if(p == x) MPI_Send(smsg, N, MPI_CHAR, 1, tid, MPI_COMM_WORLD); else MPI_Recv(rmsg, N, MPI_CHAR, 0, tid, MPI_COMM_WORLD, &stat); }... } CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 16 / 31

Hybrid MPI/OpenMP Hybrid MPI/OpenMP paradigm is the software trend for clusters of SMP architectures. Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). Avoids the extra communication overhead with MPI within node. OpenMP adds fine granularity and allows increased and/or dynamic load balancing. Some problems have two-level parallelism naturally. Some problems could only use restricted number of MPI tasks. Could have better scalability than both pure MPI and pure OpenMP. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 17 / 31

Hybrid Parallelization Strategies From sequential code, decompose with MPI first, then add OpenMP. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 18 / 31

Hybrid Parallelization Strategies From sequential code, decompose with MPI first, then add OpenMP. From OpenMP code, treat as serial code. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 18 / 31

Hybrid Parallelization Strategies From sequential code, decompose with MPI first, then add OpenMP. From OpenMP code, treat as serial code. From MPI code, add OpenMP. Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 18 / 31

Adding OpenMP to MPI is the easiest of the two options because the program state synchronization is already handled in an explicit way; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 19 / 31

Adding OpenMP to MPI is the easiest of the two options because the program state synchronization is already handled in an explicit way; benefits depend on how many simple loops may be work-shared; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 19 / 31

Adding OpenMP to MPI is the easiest of the two options because the program state synchronization is already handled in an explicit way; benefits depend on how many simple loops may be work-shared; the number of MPI processes per SMP node will depend on how many threads one wants to use per process. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 19 / 31

Adding MPI to OpenMP not as straightforward as adding OpenMP to an MPI application because global program state must be explicitly handled with MPI; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 20 / 31

Adding MPI to OpenMP not as straightforward as adding OpenMP to an MPI application because global program state must be explicitly handled with MPI; requires careful thought about how each process will communicate amongst one another; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 20 / 31

Adding MPI to OpenMP not as straightforward as adding OpenMP to an MPI application because global program state must be explicitly handled with MPI; requires careful thought about how each process will communicate amongst one another; may require a complete reformulation of the parallelization, with a need to possibly redesign it from the ground up; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 20 / 31

Adding MPI to OpenMP not as straightforward as adding OpenMP to an MPI application because global program state must be explicitly handled with MPI; requires careful thought about how each process will communicate amongst one another; may require a complete reformulation of the parallelization, with a need to possibly redesign it from the ground up; successful retro-fitting of OpenMP applications with MPI will usually yield greater improvement in performance and scaling, presumably because the original shared memory program takes great advantage of the entire SMP node. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 20 / 31

Example: Conjugate Gradient Method Conjugate Gradient Method: iterative method for efficiently solving linear systems of equations. Ax = b It can be demonstrated that the function q(x) = 1 2 x T Ax x T b + c has a unique minimum x, that is the solution to Ax = b. CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 21 / 31

Example: Conjugate Gradient Method Algorithm for Conjugate Gradient Method: 1 compute gradient: g(t) = Ax(t 1) b 2 if g(t) T g(t) < ɛ, stop 3 compute direction vector: d(t) = g(t) + g(t)t g(t) d(t 1) g(t 1) T g(t 1) 4 compute step size: s(t) = d(t)t g(t) d(t) T Ad(t) 5 compute new approximation of x: x(t) = x(t 1) + s(t)d(t) CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 22 / 31

Example: Conjugate Gradient Method for(it = 0; it < n; it++) { denom1 = dot_product(g, g, n); matrix_vector_product(id, p, n, a, x, g); for(i = 0; i < n; i++) g[i] -= b[i]; num1 = dot_product(g, g, n); /* When g is sufficiently close to 0, it is time to halt */ if(num1 < EPSILON) break; } for(i = 0; i < n; i++) d[i] = -g[i] + (num1/denom1) * d[i]; num2 = dot_product(d, g, n); matrix_vector_product (id, p, n, a, d, tmpvec); denom2 = dot_product (d, tmpvec, n); s = -num2 / denom2; for(i = 0; i < n; i++) x[i] += s * d[i]; CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 23 / 31

Example: Conjugate Gradient Method MPI implementation: program based on rowwise block decomposition of matrix A, and replication of vector b. Results of profiling: Function 1 CPU 8 CPUs matrix vector product 99,5% 97,5% dot product 0,2% 1,1% cg 0,3% 1,4% CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 24 / 31

Example: Conjugate Gradient Method void matrix_vector_product (int id, int p, int n, double **a, double *b, double *c) { int i, j; double tmp; } /* Accumulates sum */ for(i = 0; i < BLOCK_SIZE(id,p,n); i++) { tmp = 0.0; for(j = 0; j < n; j++) tmp += a[i][j] * b[j]; piece[i] = tmp; } new_replicate_block_vector (id, p, piece, n, (void *) c, MPI_DOUBLE); CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 25 / 31

Example: Conjugate Gradient Method Adding OpenMP directives: make outermost loop parallel CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 26 / 31

Example: Conjugate Gradient Method Adding OpenMP directives: make outermost loop parallel Outer loop may be executed in parallel if each thread has a private copy of tmp and j: #pragma omp parallel for private(j,tmp) for(i = 0; i < BLOCK_SIZE(id,p,n); i++) specify number of active threads per process omp_set_num_threads (atoi(argv[3])); CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 26 / 31

Example: Conjugate Gradient Method Target system: a commodity cluster with four dual-processor nodes. Pure MPI program executes on 1, 2,..., 8 CPUs On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU MPI + OpenMP program executes on 1, 2, 3, 4 processes, each process has two threads (program executes on 2, 4, 6, 8 threads) CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 27 / 31

Example: Conjugate Gradient Method Benchmarking results: CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 28 / 31

Example: Conjugate Gradient Method MPI+OpenMP program slower on 2, 4 CPUs because threads are sharing memory bandwidth, while MPI processes are not MPI+OpenMP programs faster on 6, 8 CPUs because they have lower communication cost CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 29 / 31

Review OpenMP vs MPI Hybrid programming: OpenMP and MPI together CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 30 / 31

Next Class Monte-Carlo Methods Search Algorithms CPD (DEI / IST) Parallel and Distributed Computing 18 2011-11-16 31 / 31