Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Similar documents
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Computer Architecture

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Introduction to parallel computing concepts and technics

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

CSE 160 Lecture 18. Message Passing

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Message Passing Interface

Lecture 4: OpenMP Open Multi-Processing

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Message-Passing Computing

High Performance Computing Course Notes Message Passing Programming I

Basic MPI Communications. Basic MPI Communications (cont d)

MPI. (message passing, MIMD)

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

MPI Message Passing Interface

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE 563 Midterm 1 Spring 2015

DPHPC: Introduction to OpenMP Recitation session

CS 5220: Shared memory programming. David Bindel

A short overview of parallel paradigms. Fabio Affinito, SCAI

Introduction to the Message Passing Interface (MPI)

An Introduction to MPI

Recap of Parallelism & MPI

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Holland Computing Center Kickstart MPI Intro

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

A brief introduction to OpenMP

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Message Passing Interface

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Framework of an MPI Program

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Parallel Computing: Overview

CSE 160 Lecture 15. Message Passing

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS691/SC791: Parallel & Distributed Computing

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 426. Building and Running a Parallel Application

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Slides prepared by : Farzana Rahman 1

Parallel Numerical Algorithms

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

HPC Parallel Programing Multi-node Computation with MPI - I

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

A Message Passing Standard for MPP and Workstations

MPI (Message Passing Interface)

Introduc)on to OpenMP

MPI Collective communication

Hybrid MPI and OpenMP Parallel Programming

ECE 574 Cluster Computing Lecture 10

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Basic Communication Operations (Chapter 4)

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming with OpenMP. CS240A, T. Yang

Message Passing Interface. most of the slides taken from Hanjun Kim

Data Environment: Default storage attributes

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Parallel Programming in C with MPI and OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Scalasca performance properties The metrics tour

Parallel Numerical Algorithms

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Parallel Programming in C with MPI and OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Distributed Memory Programming with MPI

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computing Paradigms

AMath 483/583 Lecture 18 May 6, 2011

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

OpenMP Fundamentals Fork-join model and data environment

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computing. Lecture 17: OpenMP Last Touch

Parallel programming using OpenMP

ECE 574 Cluster Computing Lecture 13

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

CS 179: GPU Programming. Lecture 14: Inter-process Communication

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

The Message Passing Model

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Introduction to MPI. Branislav Jansík

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Programming with MPI. Pedro Velho

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Transcription:

Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1

Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather awkward, especially if you want to do it more than once in a program Many programs requiring parallelism are not written in Java anyway but Fortran or C or C++! There are two standard parallel programming notations already in widespread use: OpenMP = Open MultiProcessing MPI = Message Passing Interface 14 February 2018 COMP35112 Lecture 9 2

OpenMP Used to create parallel programs for shared memory computers Collection of compiler directives and library functions http://www.openmp.org Based on fork-join model (i.e. very similar to the way we wrote Java earlier) Goals (besides parallel performance): Sequential equivalence Incremental parallelism 14 February 2018 COMP35112 Lecture 9 3

Parallel Regions Having identified part of a sequential program which would benefit, annotate it so that multiple threads execute in this parallel region and join up again at the end Start the region with a pragma (in C) or a comment (in Fortran) #pragma omp parallel Then the following code block is parallel 14 February 2018 COMP35112 Lecture 9 4

Fortran Aside Fortran is not block-structured so it uses:!$omp PARALLEL!$OMP END PARALLEL This issue crops up with other features and is dealt with similarly. We will stick with C. 14 February 2018 COMP35112 Lecture 9 5

Shared and Private Variables As you would expect, variables declared before the parallel region are shared (by default can be controlled in the pragma), whereas those declared within it are private (local to thread) Following example shows this, and how to specify the number of threads to be used (one way) Also note that the code block following a pragma must be a structured block which means it has a single point of entry (at top) and a single point of exit (at the bottom) There is an implicit barrier at the end 14 February 2018 COMP35112 Lecture 9 6

Example #include <stdio.h> #include <omp.h> int main() { int j = 42 ; // shared variable #pragma omp parallel num_threads(3) { int c ; // private to each thread c = omp_get_thread_num() ; printf( c = %d, j = %d\n, c, j) ; } } 14 February 2018 COMP35112 Lecture 9 7

Output from Example c = 0, j = 42 c = 2, j = 42 c = 1, j = 42 Note: the order of the threads outputs is not defined! 14 February 2018 COMP35112 Lecture 9 8

Worksharing Typically don t want all threads to do the same thing. Instead want them to share work between them. The most commonly used form of this is loop-splitting: Find a time consuming loop Restructure, if necessary, to make iterations independent of each other Split parts of the iteration space to different threads - confusingly, this is known as (loop) scheduling there are many different ways of doing it 14 February 2018 COMP35112 Lecture 9 9

Loop-splitting Example double answer, res ; answer = 0.0 ; for (j = 0 ; j < N ; j++) { res = big_comp(j) ; // big_comp takes a lot of time combine(answer, res) ; // combine fairly quick } To decouple loop iterations, one way is to make res an array (or declare it locally or private), and then we can compute all the res values in parallel 14 February 2018 COMP35112 Lecture 9 10

Split-loop Version double answer, res[n] ; // N hash-defined earlier answer = 0.0 ; #pragma omp parallel for { // If we don t specify num_threads, a default number is used (from the environment variable OMP_NUM_THREADS) } // loop scheduling scheme also defaults (to a block schedule) } for (j = 0 ; j < N ; j++) { // Special case: j is NOT shared! res[j] = big_comp(j) ; for (j = 0 ; j < N ; j++) combine(answer, res[j]) ; 14 February 2018 COMP35112 Lecture 9 11

Reduction Operations A common case is to use the result of each iteration of a loop in a binary associative operator e.g. + * & max min (last two not in C) The order of application does not matter Initial value to use is implied by the operator OpenMP has a reduction clause 14 February 2018 COMP35112 Lecture 9 12

Reduction Example double x, pi, step, sum = 0.0 ; step = 1.0/(double) num_steps ; #pragma omp parallel for private(x) reduction(+:sum) for (j = 0 ; j < num_steps ; j++) { x = (j + 0.5) * step ; sum += 4.0/(1 + x * x) ; } pi = step * sum ; This computes pi by integrating 4/(1+x²) between 0 and 1 using the midpoint (or rectangle) rule. 14 February 2018 COMP35112 Lecture 9 13

Synchronization Constructs flush used to ensure memory consistency between threads! critical indicates a critical section (for mutual exclusion) can be named to provide multiple locks barrier can be added explicitly These are all done using pragmas example follows Also locks done by library routines omp_init_lock(&l), omp_set_lock(&l), omp_unset_lock(&l) 14 February 2018 COMP35112 Lecture 9 14

#pragma omp parallel for private(res) for (j = 0 ; j < N ; j++) { res = big_comp(j) ; #pragma omp critical combine(answer, res) ; } This is how to do earlier example if combine isn t short but wouldn t be safely done in parallel. 14 February 2018 COMP35112 Lecture 9 15

Loop Scheduling Users can affect the scheduling of loop-iterations to threads (i.e. which iterations to give to a thread) and whether to do this statically or dynamically e.g. #pragma omp parallel for schedule(dynamic,10) indicates that the iteration space should be broken into chunks of size = 10 iterations, and each thread should take one chunk at a time and come back for more when it has finished A static allocation divides the chunks evenly between the threads at the start this is a mistake if iterations vary considerably in required computation 14 February 2018 COMP35112 Lecture 9 16

Static Loop Scheduling There are three commonly used static schedules: Block schedule divide the iterations into contiguous chunks of (near) equal size default in OpenMP Cyclic schedule give one iteration to each thread in turn, returning to the first when you run out of threads useful when the work per iteration changes in an orderly fashion Block-cyclic schedule give a fixed size chunk of iterations to each thread in turn, returning to the first when you run out of threads useful for using the cache in a friendly way 14 February 2018 COMP35112 Lecture 9 17

MPI MPI can be used to program a wide range of architectures Distributed memory machines i.e. clusters MPPs Shared memory machines Library for C & Fortran (& others) Most widely used API for parallel programming 14 February 2018 COMP35112 Lecture 9 18

Messages, Processes, Communicators Messages are identified by (rank of) sending process, receiving process, and integer tag These are defined with a communication context to avoid confusion (so same identifier can be reused in library code, for example, in a different context) Processes are grouped initially within a single group, but can split later Communication context plus process group forms a communicator (e.g. MPI_COMM_WORLD) 14 February 2018 COMP35112 Lecture 9 19

MPI Hello World #include <stdio.h> #include mpi.h int main(int argc, char ** argv) { int num_procs ; // number of processes int ID ; // will be in range 0 <= ID < num_procs if (MPI_Init(&argc, &argv)!= MPI_SUCCESS) exit(-1) ; MPI_Comm_size(MPI_COMM_WORLD, &num_procs) ; // ask for value MPI_Comm_rank(MPI_COMM_WORLD, &ID) ; // ask which one am I! printf( Hello World from process %i of %i\n, ID, num_procs) ; MPI_Finalize() ; } 14 February 2018 COMP35112 Lecture 9 20

Blocking Send and Recv int MPI_Send(buff, count, MPI_type, dest, tag, COMM) ; int MPI_Recv(buff, count, MPI_type, source, tag, COMM, &stat) ; buff is a pointer to a buffer count is #items of type MPI_type MPI_type e.g. MPI_DOUBLE, MPI_INT,. dest the rank of the receiving process source the rank of the sending process (could be MPI_ANY_SOURCE in Recv) tag integer. In Recv could be MPI_ANY_TAG. COMM the MPI communicator (e.g. MPI_COMM_WORLD) stat a structure holding status information Send can also be non-blocking implementation dependent you can also allocate an explicit buffer to make it non-blocking 14 February 2018 COMP35112 Lecture 9 21

Collective Operations Besides various send and receive operations there are collective operations, e.g. MPI_Barrier barrier synchronisation MPI_Bcast broadcast MPI_Scatter broadcast with patterns of data movement MPI_Gather inverse of scatter MPI_Reduce a reduction operation Using such operations, many programs make little use of more primitive Send and Recv operations 14 February 2018 COMP35112 Lecture 9 22

Other MPI Features Asynchronous i.e. non-blocking communication MPI_Isend and MPI_Irecv With these, you need MPI_Wait and/or MPI_Test Persistent communication Modes of communication Standard Synchronous Buffered MPI also supports one-sided messaging MPI_Put(), MPI_Get() 14 February 2018 COMP35112 Lecture 9 23

Next Lecture Using locks is inherently pessimistic: to be safe, anything which could cause a problem if executed in parallel is made to execute sequentially. You pay a high price for what might actually be very rare. So why not be optimistic? Speculate i.e. do things which might not be right, and then check if you need to throw them away (hoping this happens only occasionally) 14 February 2018 COMP35112 Lecture 9 24