Parallel Programming. Libraries and Implementations

Similar documents
Parallel Programming Libraries and implementations

Parallel Programming. Libraries and implementations

Message Passing Programming. Introduction to MPI

Message-Passing Programming with MPI. Message-Passing Concepts

Holland Computing Center Kickstart MPI Intro

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

HPC Architectures. Types of resource currently in use

Shared Memory programming paradigm: openmp

Blue Waters Programming Environment

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 4.0/4.5. Mark Bull, EPCC

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Message Passing Programming. Designing MPI Applications

CFD exercise. Regular domain decomposition

Lecture 4: OpenMP Open Multi-Processing

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Parallel Programming Overview

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

OpenMP 4.0. Mark Bull, EPCC

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Our new HPC-Cluster An overview

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Computer Architecture

Message Passing Interface

Threaded Programming. Lecture 9: Alternatives to OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenACC 2.6 Proposed Features

Parallel Computing. November 20, W.Homberg

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

ARCHER Champions 2 workshop

High Performance Computing Course Notes Message Passing Programming I

OpenMP: Open Multiprocessing

OpenACC (Open Accelerators - Introduced in 2012)

Introduction to Parallel Programming

Advanced Message-Passing Interface (MPI)

Cray XE6 Performance Workshop

CS 426. Building and Running a Parallel Application

S Comparing OpenACC 2.5 and OpenMP 4.5

Shared Memory Programming with OpenMP

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

JCudaMP: OpenMP/Java on CUDA

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Reusing this material

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

COMP528: Multi-core and Multi-Processor Computing

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPU Programming Using CUDA

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Message Passing Interface

CS420: Operating Systems

Parallel Programming in C with MPI and OpenMP

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Introduction to OpenACC. 16 May 2013

Message-Passing Programming with MPI

ECE 563 Midterm 1 Spring 2015

Fractals. Investigating task farms and load imbalance

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP: Open Multiprocessing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

HPC Parallel Programing Multi-node Computation with MPI - I

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Overview: The OpenMP Programming Model

A brief introduction to OpenMP

Advanced OpenMP. Memory model, flush and atomics

Introducing Task-Containers as an Alternative to Runtime Stacking

Parallel Computing. Lecture 17: OpenMP Last Touch

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Addressing Heterogeneity in Manycore Applications

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Distributed Memory Programming with Message-Passing

Introduction to Multicore Programming

Introduction to OpenMP

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Introduction to OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to OpenMP

Hybrid Programming. John Urbanic Parallel Computing Specialist Pittsburgh Supercomputing Center. Copyright 2017

Trends and Challenges in Multicore Programming

Transcription:

Parallel Programming Libraries and Implementations

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

h"p://www.archer.ac.uk support@archer.ac.uk

Outline MPI de facto standard for distributed memory programming OpenMP de facto standard for shared memory programming CUDA dominant GPGPU programming model & libraries Other Approaches PGAS SHMEM

MPI Distributed memory parallelism using message passing

Message-passing concepts Processes can not access each other s memory spaces Variables are private to each process Processes communicate data by passing messages

What is MPI? MPI = Message Passing Interface MPI is not a programming language There is no such thing as an MPI compiler MPI is available as a library of function/subroutine calls Library implements a communications protocol Follows an agreed-upon standard (see next slide) The C or Fortran compiler you invoke knows nothing about what MPI actually does only knows prototype/interface of the function/subroutine calls

The MPI standard MPI is a standard Agreed upon through extensive joint effort of ~100 representatives from ~40 different organisations (the MPI Forum) Academics Industry experts Vendors Application developers Users First version (MPI 1.0) drafted in 1993 Now on version 3 (version 4 being drafted)

MPI Libraries The MPI Forum defines the standard, vendors / opensource developers create libraries that actually implement versions of the standard There are a number of different implementations but all should support the MPI standard (version 2 or 3) As with different compilers there will be variations in implementation details but all the features specified in the standard should work. Examples: MPICH2, OpenMPI Cray-MPICH on ARCHER (optimised for interconnect on Cray machines)

Features of MPI MPI is a portable library used for writing parallel programs using the message passing model You can expect MPI to be available on any HPC platform you use Based on a number of processes running independently in parallel HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) Can think of each process as an instance of your executable communicating with other instances

Explicit Parallelism In message-passing all the parallelism is explicit The program includes specific instructions for each communication What to send or receive When to send or receive Synchronisation It is up to the developer to design the parallel decomposition and implement it How will you divide up the problem? When will you need to communicate between processes?

Point-to-point communications A message sent by one process and received by another Both processes are actively involved in the communication not necessarily at the same time Wide variety of semantics provided: Blocking vs. non-blocking Ready vs. synchronous vs. buffered Tags, communicators, wild-cards Built-in and custom data-types Can be used to implement any communication pattern Collective operations, if applicable, can be more efficient

Collective communications A communication that involves all processes all within a communicator, i.e. a defined sub-set of all processes Each collective operation implements a particular communication pattern Easier to program than lots of point-to-point messages Should be more efficient than lots of point-to-point messages Commonly used examples: Broadcast Gather Reduce AllToAll

Example: MPI HelloWorld #include <mpi.h> int main(int argc, char* argv[]) { int size,rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("hello world - I'm rank %d of %d\n", rank, size); } MPI_Finalize(); return 0;

OpenMP Shared-memory parallelism using directives

Shared-memory concepts Threads communicate by having access to the same memory space Any thread can alter any bit of data No explicit communications between the parallel tasks

OpenMP OpenMP = Open Multi Processing Application Program Interface (API) for shared memory programming OpenMP is a set of extensions to Fortran, C and C++: Compiler directives Runtime library routines Environment variables Not a library interface, unlike MPI A directive is a special line of source code with meaning only to certain compilers thanks to keywords (sentinels) Directives are ignored if code is compiled as regular sequential Fortran/C/C++ OpenMP is also a standard (see http://openmp.org/)

Features of OpenMP Directives define parallel regions in code within which OpenMP threads divide work done in the region Should decide which variables are private to each thread or shared The compiler needs to know what OpenMP actually does It is responsible for producing the OpenMP-parallel code OpenMP supported by all common compilers used in HPC Compilers should implement the standard Parallelism is less explicit than for MPI You specify which parts of the program you want to parallelise and the compiler produces a parallel executable Also used for programming Intel Xeon Phi

Loop-based parallelism A very common form of OpenMP parallelism is to parallelise the work in a loop The OpenMP directives tell the compiler to divide the iterations of the loop between the threads #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) { c[i] = a[i] + b[i]; } }

Addition example asum = 0.0 #pragma omp parallel \ shared(a,n) private(i) \ reduction(+:asum) { #pragma omp for for (i=0; i < N; i++) { asum += a[i]; } } printf( asum = %f\n, asum); asum=0 loop: i = istart,istop myasum += a[i] end loop asum

CUDA Programming GPGPU Accelerators

CUDA CUDA is an Application Program Interface (API) for programming NVIDIA GPU accelerators Proprietary software provided by NVIDIA. Should be available on all systems with NVIDIA GPU accelerators Write GPU specific functions called kernels Launch kernels using syntax within standard C programs Includes functions to shift data between CPU and GPU memory Similar to OpenMP programming in many ways in that the parallelism is implicit in the kernel design and launch More recent versions of CUDA include ways to communicate directly between multiple GPU accelerators (GPUdirect)

Example: // CUDA kernel. Each thread takes care of one element of c global void vecadd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockidx.x*blockdim.x+threadidx.x; } // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; // Called with vecadd<<<gridsize, blocksize>>(d_a, d_b, d_c, n);

OpenCL An open, cross-platform standard for programming accelerators includes GPUs, e.g. from both NVIDIA and AMD also Xeon Phi, Digital Signal Processors,... Comprises a language + library Harder to write than CUDA if you have NVIDIA GPUs but portable across multiple platforms although maintaining performance is difficult

Others Niche and future implementations

Other parallel implementations Partitioned Global Address Space (PGAS) Coarray Fortran, Unified Parallel C, Chapel Cray SHMEM, OpenSHMEM Single-sided communication library OpenACC Directive-based approach for programming accelerators

Summary

Parallel Implementations Distributed memory programmed using MPI Shared memory programmed using OpenMP GPU accelerators most often programmed using CUDA Hybrid programming approaches very common in HPC, especially MPI + X (where X is usually OpenMP) Hybrid approaches matches the hardware layout more closely A number of other, more experimental approaches are available