Implementation of Parallelization

Similar documents
Parallel Programming: OpenMP

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Overview: The OpenMP Programming Model

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Barbara Chapman, Gabriele Jost, Ruud van der Pas

An Introduction to OpenMP

Questions from last time

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

15-418, Spring 2008 OpenMP: A Short Introduction

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP threading: parallel regions. Paolo Burgio

Shared Memory Programming Model

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

CUDA GPGPU Workshop 2012

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

POSIX Threads and OpenMP tasks

A brief introduction to OpenMP

Concurrency, Thread. Dongkun Shin, SKKU

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Introduction to Standard OpenMP 3.1

Computer Architecture

CS691/SC791: Parallel & Distributed Computing

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to. Slides prepared by : Farzana Rahman 1

Shared Memory programming paradigm: openmp

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Chapter 4: Multi-Threaded Programming

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP - Introduction

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

A Source-to-Source OpenMP Compiler

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 13

CME 213 S PRING Eric Darve

Lab: Scientific Computing Tsunami-Simulation

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Computing Parallel Programming Languages Hwansoo Han

Introduction to OpenMP

Lecture 4: OpenMP Open Multi-Processing

OpenMP Application Program Interface

OpenACC Course. Office Hour #2 Q&A

Enhancements in OpenMP 2.0

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Session 4: Parallel Programming with OpenMP

Parallel Computing. Prof. Marco Bertini

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CS 5220: Shared memory programming. David Bindel

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Open Multi-Processing: Basic Course

Introduction to parallel computing with MPI

[Potentially] Your first parallel application

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Martin Kruliš, v

Shared Memory Programming with OpenMP (3)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Introduction to OpenMP

Chapter 4: Threads. Chapter 4: Threads

Our new HPC-Cluster An overview

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP.

EPL372 Lab Exercise 5: Introduction to OpenMP

Scientific Programming in C XIV. Parallel programming

CS 470 Spring Mike Lam, Professor. OpenMP

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

TotalView. Debugging Tool Presentation. Josip Jakić

Introduction to Multicore Programming

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

OPENMP TIPS, TRICKS AND GOTCHAS

OpenMP: Open Multiprocessing

OpenMP on Ranger and Stampede (with Labs)

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP: Open Multiprocessing

6.1 Multiprocessor Computing Environment

OPENMP TIPS, TRICKS AND GOTCHAS

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

GCC Developers Summit Ottawa, Canada, June 2006

Hybrid Model Parallel Programs

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Advanced Message-Passing Interface (MPI)

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

CS 261 Fall Mike Lam, Professor. Threads

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Introduction to pthreads

Distributed Systems + Middleware Concurrent Programming with OpenMP

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

C++ and OpenMP. 1 ParCo 07 Terboven C++ and OpenMP. Christian Terboven. Center for Computing and Communication RWTH Aachen University, Germany

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Transcription:

Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 1 / 48

Outline 1 OpenMP 2 POSIX Threads 3 MPI 4 Debugging JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 2 / 48

OpenMP JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 3 / 48

OpenMP - Goals JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 4 / 48

OpenMP - Goals Standardization provide standard for vatiety of platforms/shared-mem architectures JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 4 / 48

OpenMP - Goals Standardization provide standard for vatiety of platforms/shared-mem architectures Lean and Mean simple and limited set of directives, very few uses of directives needed JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 4 / 48

OpenMP - Goals Standardization provide standard for vatiety of platforms/shared-mem architectures Lean and Mean simple and limited set of directives, very few uses of directives needed Ease of Use can incrementally parallelize program (source stays the same except for added directives), supports both coarse-grain and fine-grain parallelism JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 4 / 48

OpenMP - Goals Standardization provide standard for vatiety of platforms/shared-mem architectures Lean and Mean simple and limited set of directives, very few uses of directives needed Ease of Use can incrementally parallelize program (source stays the same except for added directives), supports both coarse-grain and fine-grain parallelism Portability public API, implementations for C, C++, Fortran JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 4 / 48

OpenMP - Structure & Implementations JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 5 / 48

OpenMP - Structure & Implementations Supported/shipped with various compilers for various platforms (e.g. Intel and GNU compilers for Linux), i.e. to compile simply add option: e.g. gcc -fopenmp JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 5 / 48

OpenMP - Structure & Implementations Supported/shipped with various compilers for various platforms (e.g. Intel and GNU compilers for Linux), i.e. to compile simply add option: uses a form-join model e.g. gcc -fopenmp JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 5 / 48

OpenMP - Structure & Implementations Supported/shipped with various compilers for various platforms (e.g. Intel and GNU compilers for Linux), i.e. to compile simply add option: uses a form-join model e.g. gcc -fopenmp comprised of 3 API components: Compiler Directives Runtime Library routines Environment Variables JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 5 / 48

OpenMP - Compiler Directives JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 6 / 48

OpenMP - Compiler Directives We will focus here on C/C++ syntax, FORTRAN syntax slightly different: #pragma omp <directive name> [<clauses>]!$omp [END] <directive name> [<clauses>] (C/C++) (Fortran) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 6 / 48

OpenMP - Compiler Directives We will focus here on C/C++ syntax, FORTRAN syntax slightly different: #pragma omp <directive name> [<clauses>]!$omp [END] <directive name> [<clauses>] (C/C++) (Fortran) Used for: Defining parallel regions / spawning threads Distributing loop iterations or sections of code between threads Serializing sections of code (e.g. for access to I/O or shared variables) Synchronizing threads You can find a reference sheet for the C/C++ API for OpenMP 4.0 in the source code archive for this workshop. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 6 / 48

OpenMP - Runtime Library Routines JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 7 / 48

OpenMP - Runtime Library Routines These routines are provided by the openmp library are used to configuring and monitoring the multithreading during execution: e.g. omp get num threads returns number of threads in current team omp in parallel check if in parallel regions omp set schedule modify scheduler policy JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 7 / 48

OpenMP - Runtime Library Routines These routines are provided by the openmp library are used to configuring and monitoring the multithreading during execution: e.g. omp get num threads returns number of threads in current team omp in parallel check if in parallel regions omp set schedule modify scheduler policy There are further routines for locks for synchronization/access control (see later) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 7 / 48

OpenMP - Runtime Library Routines These routines are provided by the openmp library are used to configuring and monitoring the multithreading during execution: e.g. omp get num threads returns number of threads in current team omp in parallel check if in parallel regions omp set schedule modify scheduler policy There are further routines for locks for synchronization/access control (see later) as well as timing routines for recording elapsed time for each thread. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 7 / 48

OpenMP - Environment variables JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 8 / 48

OpenMP - Environment variables Like for most programs in the UNIX world, environmental variables are used to store configurations needed for running the program. In OpenMP, they are used for setting e.g. the number of threads per team (OMP NUM THREADS), maximum number of threads (OMP THREAD LIMIT) or the scheduler policy (OMP SCHEDULE). JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 8 / 48

OpenMP - Environment variables Like for most programs in the UNIX world, environmental variables are used to store configurations needed for running the program. In OpenMP, they are used for setting e.g. the number of threads per team (OMP NUM THREADS), maximum number of threads (OMP THREAD LIMIT) or the scheduler policy (OMP SCHEDULE). While most of these settings can also be done using clauses in the compiler directives of runtime library routines, environmental variables provide a user an easy way to change these crucial settings without the need of an additional config file (parsed by your program) or even rewritting/recompiling the openmp-enhanced program. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 8 / 48

OpenMP - Worksharing JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 9 / 48

OpenMP - Worksharing (examples) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 10 / 48

OpenMP - advanced Worksharing JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 11 / 48

OpenMP - advanced Worksharing defines explicit tasks similar to sections that are generated (usually by a single task) and then deferred to any thread in the team via a queue/scheduler JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 11 / 48

OpenMP - advanced Worksharing defines explicit tasks similar to sections that are generated (usually by a single task) and then deferred to any thread in the team via a queue/scheduler tasks are not necessarily tied to a single thread, can be e.g. postponed or migrated to other threads JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 11 / 48

OpenMP - advanced Worksharing defines explicit tasks similar to sections that are generated (usually by a single task) and then deferred to any thread in the team via a queue/scheduler tasks are not necessarily tied to a single thread, can be e.g. postponed or migrated to other threads allows for defining dependencies among tasks (e.g. task X has to finish before any thread can work on task Y) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 11 / 48

OpenMP - advanced Worksharing (example) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 12 / 48

OpenMP - Synchronization / Flow control In the Introduction to Parallelization, we discussed the need of controlling the execution of threads at certain points to e.g. synchronize them to exchange intermediate results or to protect resources from getting accessed simultaneously with non-deterministic outcome ( race condition ). OpenMP provides two ways to do this: JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 13 / 48

OpenMP - Synchronization / Flow control In the Introduction to Parallelization, we discussed the need of controlling the execution of threads at certain points to e.g. synchronize them to exchange intermediate results or to protect resources from getting accessed simultaneously with non-deterministic outcome ( race condition ). OpenMP provides two ways to do this: Compiler Directives: (for general parallel regions) e.g. cancel,single,master,critical,atomic, barrier (for loops) ordered (for tasks) taskwait, taskyield JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 13 / 48

OpenMP - Synchronization / Flow control In the Introduction to Parallelization, we discussed the need of controlling the execution of threads at certain points to e.g. synchronize them to exchange intermediate results or to protect resources from getting accessed simultaneously with non-deterministic outcome ( race condition ). OpenMP provides two ways to do this: Compiler Directives: (for general parallel regions) e.g. cancel,single,master,critical,atomic, barrier (for loops) ordered (for tasks) taskwait, taskyield Runtime Library Routines: omp set lock,omp unset lock,omp test lock JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 13 / 48

OpenMP - Synchronization / Flow control (RESTRICTION) BARRIER SINGLE MASTER implicit BARRIER NO implicit BARRIER JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 14 / 48

OpenMP - Synchronization / Flow control (MUTEX) CRITICAL / ATOMIC #1 executes #1 exited #3 exited #3 entered #3 executes #2 executes CRITICAL,ATOMIC exclusive for ALL threads, not just team CRITICAL regions can be named, regions with same name treated as same region JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 15 / 48

OpenMP - Memory management (CLAUSES) Certain clauses for compiler directives allow us to specify how data is shared (e.g. shared, private, threadprivate) and how they are initialized (e.g.firstprivate, copyin) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 16 / 48

OpenMP - Memory management (CLAUSES) Certain clauses for compiler directives allow us to specify how data is shared (e.g. shared, private, threadprivate) and how they are initialized (e.g.firstprivate, copyin) Others like copyprivate allow for broadcasting the content of private variables from one thread to all others JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 16 / 48

OpenMP - Memory management (CLAUSES) Certain clauses for compiler directives allow us to specify how data is shared (e.g. shared, private, threadprivate) and how they are initialized (e.g.firstprivate, copyin) Others like copyprivate allow for broadcasting the content of private variables from one thread to all others Similarly, the reduction clause provides an elegant way to gather private data from the threads when joining them JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 16 / 48

OpenMP - Memory management (CLAUSES) Similarly, the reduction clause provides an elegant way to gather private data from the threads when joining them JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 16 / 48

OpenMP - Memory management (FLUSHING DATA) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 17 / 48

OpenMP - Memory management (FLUSHING DATA) even if shared, sometimes variable may not be updated in the global view, e.g. if kept in a register or cache of a CPU instead of the shared memory JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 17 / 48

OpenMP - Memory management (FLUSHING DATA) even if shared, sometimes variable may not be updated in the global view, e.g. if kept in a register or cache of a CPU instead of the shared memory while many directives (e.g. for, section, critical) implicitly flush variable to synchronize them with other threads, sometimes explicit flushing using the flush may be necessary. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 17 / 48

OpenMP - Memory management (STACK) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 18 / 48

OpenMP - Memory management (STACK) OpenMP standard does not specify a default stack size for each thread. So depends on the compiler e.g. Compiler Approx Stack Limit icc/ifort (Linux) gcc/gfort (Linux) 4 MB 2 MB JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 18 / 48

OpenMP - Memory management (STACK) OpenMP standard does not specify a default stack size for each thread. So depends on the compiler e.g. Compiler Approx Stack Limit icc/ifort (Linux) gcc/gfort (Linux) 4 MB 2 MB if stack allocation exceeded, may result in seg fault or (worse) data corruption. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 18 / 48

OpenMP - Memory management (STACK) OpenMP standard does not specify a default stack size for each thread. So depends on the compiler e.g. Compiler Approx Stack Limit icc/ifort (Linux) gcc/gfort (Linux) 4 MB 2 MB if stack allocation exceeded, may result in seg fault or (worse) data corruption. Env. variable OMP STACKSIZE allows to set stacksize prior to execution. So if your program needs an significant amount of data on the stack, make sure to adapt the stacksize this way! JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 18 / 48

POSIX Threads JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 19 / 48

PThreads - History/Goals JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 20 / 48

PThreads - History/Goals standardized API for multithreading to allow for portable threaded applications JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 20 / 48

PThreads - History/Goals standardized API for multithreading to allow for portable threaded applications first defined in IEEE POSIX standard 1003.1c in 1995, but undergoes continuous evolution/revision JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 20 / 48

PThreads - History/Goals standardized API for multithreading to allow for portable threaded applications first defined in IEEE POSIX standard 1003.1c in 1995, but undergoes continuous evolution/revision historically implementations focused on Unix as OS, but implementations also exist now for others e.g. for Windows JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 20 / 48

PThreads - Compiling & Running JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 21 / 48

PThreads - Compiling & Running Like for OpenMP, POSIX Threads are included in most recent compiler suites by default JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 21 / 48

PThreads - Compiling & Running Like for OpenMP, POSIX Threads are included in most recent compiler suites by default To enable these included libraries, use e.g. icc -pthread for INTEL (Linux) gcc -pthread for GNU (Linux) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 21 / 48

PThreads - API JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 22 / 48

PThreads - API The subroutines defined in the API can be classified into four major groups: Thread management For creating new threads, checking their properties and joining/destroying them and the end of their lifecycle (pthread,pthread attr ) Mutexes For creating mutex locks to control excess to exclusive resources (pthread mutex,pthread mutexattr ) Condition variables routines for managing condition variable to allow for easy communication between threads that share a mutex (pthread cond,pthread condattr ) Synchronization barriers, read/write locks (pthread barrier,pthread rwlock ) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 22 / 48

PThreads - API The subroutines defined in the API can be classified into four major groups: Thread management For creating new threads, checking their properties and joining/destroying them and the end of their lifecycle (pthread,pthread attr ) Mutexes For creating mutex locks to control excess to exclusive resources (pthread mutex,pthread mutexattr ) Condition variables routines for managing condition variable to allow for easy communication between threads that share a mutex (pthread cond,pthread condattr ) Synchronization barriers, read/write locks (pthread barrier,pthread rwlock ) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 22 / 48

PThreads - Thread management: Creation & Termination JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 23 / 48

PThreads - Thread management: Creation & Termination POSIX threads (pthread t) are created explicitly using the pthread create(thread,attr,start routine,arg) where attr is a thread attribute structure containing settings for creating/running thread start routine is a procedure that works as a starting point for the thread arg is a pointer to the argument for the starting routine (can be pointing to a single data element, an array or a custom data structure) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 23 / 48

PThreads - Thread management: Creation & Termination POSIX threads (pthread t) are created explicitly using the pthread create(thread,attr,start routine,arg) where attr is a thread attribute structure containing settings for creating/running thread start routine is a procedure that works as a starting point for the thread arg is a pointer to the argument for the starting routine (can be pointing to a single data element, an array or a custom data structure) They terminate when finishing their starting routine, calling pthread exit(status) to return a status flag, by another thread by calling pthread cancel(thread) with thread pointing to them or the host process finishing first (without pthread exit() call) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 23 / 48

PThreads - Thread management: Example 1 JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 24 / 48

PThreads - Thread management: Joining & Detaching JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Thread management: Joining & Detaching Joining threads allows the master thread to synchronize with its worker threads on completion of their task JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Thread management: Joining & Detaching Joining threads allows the master thread to synchronize with its worker threads on completion of their task JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Thread management: Joining & Detaching Joining threads allows the master thread to synchronize with its worker threads on completion of their task threads can be declared joinable on creation JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Thread management: Joining & Detaching Joining threads allows the master thread to synchronize with its worker threads on completion of their task threads can be declared joinable on creation data (Thread Control Block) remains in memory after completion of a thread until pthread join is called on this dead thread and the clean-up is triggered JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Thread management: Joining & Detaching Joining threads allows the master thread to synchronize with its worker threads on completion of their task threads can be declared joinable on creation data (Thread Control Block) remains in memory after completion of a thread until pthread join is called on this dead thread and the clean-up is triggered detached threads do not keep such (potentially unnecessary) data, i.e. get cleaned up directly on completion JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 25 / 48

PThreads - Mutexes Mutexes work in similar way as the OpemMP locks: once claimed by one thread, other threads encountering it will be hold until the mutex released again. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 26 / 48

PThreads - Joining & Mutexes: Example JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 27 / 48

PThreads - Condition variables JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 28 / 48

PThreads - Condition variables Conditions variables control the flow of threads like Mutexes JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 28 / 48

PThreads - Condition variables Conditions variables control the flow of threads like Mutexes instead of claiming a lock, it allows threads to wait (pthread cond wait()) until another thread send a signal (pthread cond signal()) through the condition variable to continue. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 28 / 48

PThreads - Synchronization: Barriers JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 29 / 48

PThreads - Synchronization: Barriers POSIX Threads also feature a synchronization barrier similar to OpenMP. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 29 / 48

PThreads - Synchronization: Barriers POSIX Threads also feature a synchronization barrier similar to OpenMP. Since there are no team structure like in OpenMP, on creation a number of threads is defined, that has to reach the barrier before any of them is allowed to pass. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 29 / 48

PThreads - Memory management JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 30 / 48

PThreads - Memory management As for OpenMP, POSIX does not dictate the (default) stack size for a thread and thus can vary greatly. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 30 / 48

PThreads - Memory management As for OpenMP, POSIX does not dictate the (default) stack size for a thread and thus can vary greatly. So better explicitly allocate enough stack to provide portability and avoid segmentation faults or data corruption JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 30 / 48

PThreads - Memory management As for OpenMP, POSIX does not dictate the (default) stack size for a thread and thus can vary greatly. So better explicitly allocate enough stack to provide portability and avoid segmentation faults or data corruption use pthread attr setstacksize to set the desired stacksize in the attribute object used for creating the thread. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 30 / 48

MPI JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 31 / 48

MPI - History/Goals JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 32 / 48

MPI - History/Goals Goals: Standardization De-facto industry standard for message passing. Portability Runs on a huge variety of platforms, allows for parallelization on very heterogeneous clusters JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 32 / 48

MPI - History/Goals Goals: Standardization De-facto industry standard for message passing. Portability Runs on a huge variety of platforms, allows for parallelization on very heterogeneous clusters First presented at supercomputing conference in 1993, initial releases in 1994 (MPI-1), 1998 (MPI-2), 2012 (MPI-3) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 32 / 48

MPI - History/Goals Goals: Standardization De-facto industry standard for message passing. Portability Runs on a huge variety of platforms, allows for parallelization on very heterogeneous clusters First presented at supercomputing conference in 1993, initial releases in 1994 (MPI-1), 1998 (MPI-2), 2012 (MPI-3) Many popular implementations e.g. OpenMPI (free), Intel MPI, MPICH JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 32 / 48

MPI - Compiling & Running JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 33 / 48

MPI - Compiling & Running For compiling MPI programs, each implementation comes with specific wrapper scripts for the compilers, e.g. OpenMPI Intel MPI GNU Intel C mpicc C++ mpicc/mpic++/mpicxx Fortran mpifort C mpicc/mpigcc mpicc/mpiicc C++ mpi{cc,c++,cxx}/mpigxx mpi{cc,c++,cxx}/mpiicpc Fortran mpifort mpifort*/mpiifort JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 33 / 48

MPI - Compiling & Running For compiling MPI programs, each implementation comes with specific wrapper scripts for the compilers, e.g. OpenMPI Intel MPI GNU Intel C mpicc C++ mpicc/mpic++/mpicxx Fortran mpifort C mpicc/mpigcc mpicc/mpiicc C++ mpi{cc,c++,cxx}/mpigxx mpi{cc,c++,cxx}/mpiicpc Fortran mpifort mpifort*/mpiifort For running a MPI program, we use mpirun, which starts as many copies of the program as requested on nodes provided by the batch system, e.g. mpirun -np 4 my program JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 33 / 48

MPI - Init & Finalize JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 34 / 48

MPI - Init & Finalize Before using any MPI routine (or better as early as possible), the MPI framework must be initialized by calling MPI Init(&argc,&argv), which also broadcast the command line arguments to all processes JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 34 / 48

MPI - Init & Finalize Before using any MPI routine (or better as early as possible), the MPI framework must be initialized by calling MPI Init(&argc,&argv), which also broadcast the command line arguments to all processes At the end of your program, always call MPI Finalize() to properly terminate/clean up the MPI execution environment JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 34 / 48

MPI - Communicators JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 35 / 48

MPI - Communicators MPI uses communicators to define which processes may communicate with each other - in many cases, the predefined MPI COMM WORLD, which includes all MPI processes. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 35 / 48

MPI - Communicators MPI uses communicators to define which processes may communicate with each other - in many cases, the predefined MPI COMM WORLD, which includes all MPI processes. each process has a unique rank within the communicator. You can get the rank for a process with the command MPI Comm rank(comm,&rank) as well as the total size of the communicator (MPI Comm size(comm,&size)) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 35 / 48

MPI - Example 1 JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 36 / 48

MPI - Communication JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 37 / 48

MPI - Communication In MPI there are routines for Point-to-Point communication (i.e. from one process to exactly one other) as well as for collective communication JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 37 / 48

MPI - Communication In MPI there are routines for Point-to-Point communication (i.e. from one process to exactly one other) as well as for collective communication a Point-to-Point communication always consists of a send and a matching receive (or combined send/recv) routines JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 37 / 48

MPI - Communication In MPI there are routines for Point-to-Point communication (i.e. from one process to exactly one other) as well as for collective communication a Point-to-Point communication always consists of a send and a matching receive (or combined send/recv) routines those routines can be blocking and non-blocking, non-synchronous and synchronous JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 37 / 48

MPI - Communication: Buffering JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 38 / 48

MPI - Communication: Blocking vs Non-Blocking Blocking A blocking send (MPI Send(...)) waits until message is processed by local MPI (does not mean, that message has been received by other processes!), for waiting for confirmed processing by recipient, use synchronous blocking send (MPI Ssend(...)); a blocking receive waits until data is received and ready for use Non-Blocking Non-blocking send/receive routines (MPI Isend(...),MPI Irecv(...),MPI Issend(...)) work like their blocking counter-parts, but only request the operation and do not wait for its completion. Instead they return a request object that can be used to test/wait (e.g. MPI Wait(...),MPI Probe(...)) until operation has been processed/certain status is reached for one or more request simultaneously. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 39 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax with MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) buffer Memory block to send/receive data from/to JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax with MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types dest/src Rank of the communication partner (within the used shared communicator); wildcard MPI ANY SOURCE JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types dest/src Rank of the communication partner (within the used shared communicator); wildcard MPI ANY SOURCE tag arbitrary non-negative (short) integer; same for send & receive (unless wildcard MPI ANY TAG used for recv) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types dest/src Rank of the communication partner (within the used shared communicator); wildcard MPI ANY SOURCE tag arbitrary non-negative (short) integer; same for send & receive (unless wildcard MPI ANY TAG used for recv) comm communicator JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types dest/src Rank of the communication partner (within the used shared communicator); wildcard MPI ANY SOURCE tag arbitrary non-negative (short) integer; same for send & receive (unless wildcard MPI ANY TAG used for recv) comm communicator request allocated request structure used to communicate progress of comm. process for non-blocking routines JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Communication: Syntax MPI Isend(&buffer,count,datatype,dest,tag,comm,&request) MPI Recv(&buffer,count,datatype,src,tag,comm,&status) with buffer Memory block to send/receive data from/to count Number of data elements to be sent/ maximum number to be received (see MPI Get count() for received amount) datatype One of the predefined elementary MPI data types or derived data types dest/src Rank of the communication partner (within the used shared communicator); wildcard MPI ANY SOURCE tag arbitrary non-negative (short) integer; same for send & receive (unless wildcard MPI ANY TAG used for recv) comm communicator request allocated request structure used to communicate progress of comm. process for non-blocking routines status allocated status structure containing source & tag for receive routines JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 40 / 48

MPI - Collective Communication More efficient data exchange with multiple processes always involves all processes in one communicator can only used with predefined datatypes can be blocking or non-blocking (since MPI-3) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 41 / 48

MPI - Collective Comm. [Broadcast/Scatter/Gather] JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 42 / 48

MPI - Collective Communication [Allgather,Alltoall] JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 43 / 48

MPI - Collective Communication [Reduce,Allreduce] JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 44 / 48

MPI - Collective Communication [Reduce,Allreduce] JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 44 / 48

MPI - Collective Communication: Example JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 45 / 48

MPI - Multithreading As shown in the Introduction talk, you can combine Multithreading and Multiprocessing, BUT... you have to check whether your MPI implementations is thread-safe. MPI libraries vary in their level of thread support: MPI THREAD SINGLE no multithreading supported MPI THREAD FUNNELED only main thread may make MPI calls MPI THREAD SERIALIZED MPI calls are serialized i.e. cannot be processed concurrently MPI THREAD MULTIPLE thread-safe JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 46 / 48

Debugging JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 47 / 48

Debugging JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 48 / 48

Debugging As for analyzing and tuning parallel program performance, debugging can be much more challenging for parallel programs than for serial programs (in particular for MPI programs) JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 48 / 48

Debugging As for analyzing and tuning parallel program performance, debugging can be much more challenging for parallel programs than for serial programs (in particular for MPI programs) And again, unfortunately, covering this topic in any detail would go beyond the scope of this introduction to parallel program. JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 48 / 48

Debugging As for analyzing and tuning parallel program performance, debugging can be much more challenging for parallel programs than for serial programs (in particular for MPI programs) And again, unfortunately, covering this topic in any detail would go beyond the scope of this introduction to parallel program. While popular open source debuggers like gdb provide facilites for debugging multi-threaded programs, MPI debugging relies on commercial solutions like DDT or TotalView JAS (ICG, Portsmouth) Implementation of Parallelization May 9, 2018 48 / 48