MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Similar documents
MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Introduction to OpenMP.

Computer Architecture

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Holland Computing Center Kickstart MPI Intro

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Lecture 4: OpenMP Open Multi-Processing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Parallel Computing and the MPI environment

Introduction to OpenMP. Motivation

CS 426. Building and Running a Parallel Application

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Multi-core Architecture and Programming

Introduction to the Message Passing Interface (MPI)

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

An Introduction to OpenMP

ECE 574 Cluster Computing Lecture 10

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

HPC Parallel Programing Multi-node Computation with MPI - I

Shared Memory programming paradigm: openmp

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Message Passing Interface

Parallel Programming

Introduction to parallel computing concepts and technics

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS 470 Spring Mike Lam, Professor. OpenMP

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI Message Passing Interface

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

CSE 160 Lecture 18. Message Passing

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Computing: Overview

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Parallel Programming. Libraries and Implementations

A short overview of parallel paradigms. Fabio Affinito, SCAI

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Számítogépes modellezés labor (MSc)

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

COMP Parallel Computing. SMM (2) OpenMP Programming Model

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

MPI Collective communication

Parallel Programming using OpenMP

Parallel Programming using OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

EE/CSCI 451: Parallel and Distributed Computation

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

ECE 574 Cluster Computing Lecture 13

The Message Passing Model

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel Programming with OpenMP. CS240A, T. Yang

Multithreading in C with OpenMP

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Introduction to Parallel Programming

CSE 160 Lecture 15. Message Passing

Allows program to be incrementally parallelized

Parallel Programming in C with MPI and OpenMP

A brief introduction to OpenMP

Distributed Memory Programming with Message-Passing

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Message Passing Interface

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Shared Memory Programming with OpenMP

[Potentially] Your first parallel application

Synchronisation in Java - Java Monitor

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Parallel Programming in C with MPI and OpenMP

Distributed Memory Parallel Programming

Parallel Programming

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

High Performance Computing: Tools and Applications

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

High Performance Computing Course Notes Message Passing Programming I

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Transcription:

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared Memory Message passing: exchange data explicitly via IPC Application developers define protocol and exchanging format, number of participants, and each exchange Shared memory: all multiple processes to share data via memory Applications must locate and and map shared memory regions to exchange data

Architectures Uniformed Shared Memory (UMA) Cray 2 Non-Uniformed Shared Memory (NUMA) SGI Altix 3700 Massively Parallel DistrBluegene/L Orthogonal to programming model

MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based on message passing Portable: one standard, many implementations Available on almost all parallel machines in C and Fortran De facto standard platform for the HPC community

Groups, Communicators, Contexts Group: a fixed ordered set of k processes, i.e., 0, 1,.., k-1 Communicator: specify scope of communication Between processes in a group Between two disjoint groups Context: partition communication space A message sent in one context cannot be received in another context This image is captured from: Writing Message Passing Parallel Programs with MPI, Course Notes, Edinburgh Parallel Computing Centre The University of Edinburgh

Synchronous vs. Asynchronous Message Passing A synchronous communication is not complete until the message has been received An asynchronous communication completes before the message is received

Communication Modes Synchronous: completes once ack is received by sender Asynchronous: 3 modes Standard send: completes once the message has been sent, which may or may not imply that the message has arrived at its destination Buffered send: completes immediately, if receiver not ready, MPI buffers the message locally Ready send: completes immediately, if the receiver is ready for the message it will get it, otherwise the message is dropped silently

Blocking vs. Non-Blocking Blocking, means the program will not continue until the communication is completed Synchronous communication Barriers: wait for every process in the group to reach a point in execution Non-Blocking, means the program will continue, without waiting for the communication to be completed

MPI library Huge (125 functions) Basic (6 functions)

MPI Basic Many parallel programs can be written using just these six functions, only two of which are non-trivial; MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV

Skeleton MPI Program (C) #include <mpi.h> main(int argc, char** argv) { MPI_Init(&argc, &argv); /* main part of the program */ } /* Use MPI function call depend on your data * partitioning and the parallelization architecture */ MPI_Finalize();

A minimal MPI program (C) #include mpi.h #include <stdio.h> int main(int argc, char *argv[]) { } MPI_Init(&argc, &argv); printf( Hello, world!\n ); MPI_Finalize(); return 0;

A minimal MPI program (C) #include mpi.h provides basic MPI definitions and types. MPI_Init starts MPI MPI_Finalize exits MPI Notes: Non-MPI routines are local; this printf run on each process MPI functions return error codes or MPI_SUCCESS

Error handling By default, an error causes all processes to abort The user can have his/her own error handling routines Some custom error handlers are available for downloading from the net

Improved Hello (C) #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); /* rank of this process in the communicator */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* get the size of the group associates to the communicator */ MPI_Comm_size(MPI_COMM_WORLD, &size); printf("i am %d of %d\n", rank, size); MPI_Finalize(); return 0; }

Improved Hello (C) /* Find out rank, size */ int world_rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); Number of MPI_Comm_size(MPI_COMM_WORLD, &world_size); elements Rank of int number; destination if (world_rank == 0) { number = -1; MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (world_rank == 1) { Tag to identify message Default communicator MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("process 1 received number %d from process 0\n", number); } Rank of Status source

Many other functions MPI_Bcast: send same piece of data to all processes in the group MPI_Scatter: send different pieces of an array to different processes (i.e., partition an array across processes) From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

Many other functions MPI_Gather: take elements from many processes and gathers them to one single process E.g., parallel sorting, searching From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

Many other functions MPI_Reduce: takes an array of input elements on each process and returns an array of output elements to the root process given a specified operation MPI_Allreduce: Like MPI_Reduce but distribute results to all processes From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

MPI Discussion Gives full control to programmer Exposes number of processes Communication is explicit, driven by the program Assume Long running processes Homogeneous (same performance) processors Little support for failures, no straggler mitigation Summary: achieve high performance by hand-optimizing jobs but requires experts to do so, and little support for fault tolerance

OpenMP Based on the Introduction to OpenMP presentation: (webcourse.cs.technion.ac.il/236370/winter2009.../openmplecture.ppt)

Motivation Multicore CPUs are everywhere: Servers with over 100 cores today Even smartphone CPUs have 8 cores Multithreading, natural programming model All processors share the same memory Threads in a process see same address space Many shared-memory algorithms developed

But Multithreading is hard Lots of expertise necessary Deadlocks and race conditions Non-deterministic behavior makes it hard to debug

Example Parallelize the following code using threads: for (i=0; i<n; i++) { } Why hard? sum = sum + sqrt(sin(data[i])); Need mutex to protect the accesses to sum Different code for serial and parallel version No built-in tuning (# of processors?)

OpenMP A language extension with constructs for parallel programming: Critical sections, atomic access, private variables, barriers Parallelization is orthogonal to functionality If the compiler does not recognize OpenMP directives, the code remains functional (albeit single-threaded) Industry standard: supported by Intel, Microsoft, IBM, HP"

OpenMP execution model Fork and Join: Master thread spawns a team of threads as needed Worker Thread Master thread FORK JOIN FORK JOIN Parallel regions

OpenMP memory model Shared memory model Threads communicate by accessing shared variables The sharing is defined syntactically Any variable that is seen by two or more threads is shared Any variable that is seen by one thread only is private Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize the synchronization

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1!= answer2) { } How to parallelize?

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1!= answer2) { } How to parallelize? #pragma omp sections { #pragma omp section answer1 = long_computation_1(); #pragma omp section answer2 = long_computation_2(); } if (answer1!= answer2) { }

OpenMP: Work sharing example Sequential code for (int i=0; i<n; i++) { a[i]=b[i]+c[i]; }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<n; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { } int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*n/nt, i_end = (id+1)*n/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<n; i++) { a[i]=b[i]+c[i]; } Launch nt threads #pragma omp parallel Each thread uses id { and nt variables to int id = omp_get_thread_num(); operate on a different int nt = omp_get_num_threads(); segment of the arrays } int i_start = id*n/nt, i_end = (id+1)*n/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization Automatic parallelization of the for loop using #parallel for for (int i=0; i<n; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { } #pragma omp parallel Initialization: #pragma omp for schedule(static) var = init { for (int i=0; i<n; i++) { a[i]=b[i]+c[i]; } } int id = omp_get_thread_num(); int nt = omp_get_num_threads(); Increment: var++, var--, " int i_start = id*n/nt, i_end = (id+1)*n/nt; Comparison: " var += incr, var -= incr for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } var op last, where op: <, >, <=, >= One signed variable in the loop

Challenges of #parallel for Load balancing If all iterations execute at the same speed, the processors are used optimally If some iterations are faster, some processors may get idle, reducing the speedup We don t always know distribution of work, may need to re-distribute dynamically Granularity Thread creation and synchronization takes time Assigning work to threads on per-iteration resolution may take more time than the execution itself Need to coalesce the work to coarse chunks to overcome the threading overhead Trade-off between load balancing and granularity

Schedule: controlling work distribution schedule(static [, chunksize]) Default: chunks of approximately equivalent size, one to each thread If more chunks than threads: assigned in round-robin to the threads Why might want to use chunks of different size? schedule(dynamic [, chunksize]) Threads receive chunk assignments dynamically Default chunk size = 1 schedule(guided [, chunksize]) Start with large chunks Threads receive chunks dynamically. Chunk size reduces exponentially, down to chunksize

OpenMP: Data Environment Shared Memory programming model Most variables (including locals) are shared by threads" { int sum = 0; #pragma omp parallel for for (int i=0; i<n; i++) sum += i; } Global variables are shared Some variables can be private Variables inside the statement block Variables in the called functions Variables can be explicitly declared as private

Overriding storage attributes private: A copy of the variable is created for each thread There is no connection between original variable and private copies Can achieve same using variables inside { } firstprivate: Same, but the initial value of the variable is copied from the main copy lastprivate: Same, but the last value of the variable is copied to the main copy int i; #pragma omp parallel for private(i) for (i=0; i<n; i++) { } int idx=1; int x = 10; #pragma omp parallel for \ firsprivate(x) lastprivate(idx) for (i=0; i<n; i++) { if (data[i] == x) } idx = i;

Reduction for (j=0; j<n; j++) { sum = sum + a[j]*b[j]; } How to parallelize this code? sum is not private, but accessing it atomically is too expensive Have a private copy of sum in each thread, then add them up Use the reduction clause #pragma omp parallel for reduction(+: sum) Any associative operator could be used: +, -,,, *, etc The private value is initialized automatically (to 0, 1, ~0 )

#pragma omp reduction float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction(+:sum) for(int i = 0; i < N; i++) { sum += a[i] * b[i]; } return sum; }

Conclusions OpenMP: A framework for code parallelization Available for C++ and FORTRAN Based on a standard Implementations from a wide selection of vendors Relatively easy to use Write (and debug!) code first, parallelize later Parallelization can be incremental Parallelization can be turned off at runtime or compile time Code is still correct for a serial machine