Hybrid Model Parallel Programs

Similar documents
OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Our new HPC-Cluster An overview

Scientific Programming in C XIV. Parallel programming

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CUDA GPGPU Workshop 2012

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Hybrid MPI and OpenMP Parallel Programming

Implementation of Parallelization

Trends and Challenges in Multicore Programming

Chapter 3 Parallel Software

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

MPI & OpenMP Mixed Hybrid Programming

The Use of Cloud Computing Resources in an HPC Environment

COMP528: Multi-core and Multi-Processor Computing

Parallelism paradigms

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

The Art of Parallel Processing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Parallelism. Parallel Hardware. Introduction to Computer Systems

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Parallel Programming Libraries and implementations

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Parallel and High Performance Computing CSE 745

Introduction to parallel Computing

Approaches to Parallel Computing

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

OpenMP tasking model for Ada: safety and correctness

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

CS691/SC791: Parallel & Distributed Computing

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Shared Memory Programming With OpenMP Exercise Instructions

Introduction to Parallel Computing

AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Introduction to Parallel Programming Dianne P.

Parallel Programming In C With MPI And OpenMP By Quinn,Michael READ ONLINE

Blue Waters Programming Environment

High Performance Computing. Introduction to Parallel Computing

Parallel Computing. Lecture 17: OpenMP Last Touch

Chapter 4: Multi-Threaded Programming

Shared Memory Programming With OpenMP Computer Lab Exercises

Session 4: Parallel Programming with OpenMP

Addressing Heterogeneity in Manycore Applications

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Practical Introduction to

Intro to Parallel Computing

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Computing Basics, Semantics

Parallel Programming. Libraries and implementations

Lab: Scientific Computing Tsunami-Simulation

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Serial and Parallel Sobel Filtering for multimedia applications

Parallel Programming. Libraries and Implementations

Parallel Computing: Overview

GPUs and Emerging Architectures

OpenMP threading: parallel regions. Paolo Burgio

Parallele Numerik. Blatt 1

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

CS420: Operating Systems

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS

High Performance Computing (HPC) Introduction

Technology for a better society. hetcomp.com

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

An Introduction to OpenACC

Distributed Memory Programming With MPI Computer Lab Exercises

NUMA-aware OpenMP Programming

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Parallel Programming with OpenMP

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Trends in HPC (hardware complexity and software challenges)

OpenACC Course. Office Hour #2 Q&A

Tutorial: Compiling, Makefile, Parallel jobs

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

PRACE Autumn School Basic Programming Models

AusAn TACC. OpenMP. TACC OpenMP Team

OpenACC 2.6 Proposed Features

Parallel Numerical Algorithms

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

SCALASCA v1.0 Quick Reference

Introduction to Programming: Variables and Objects. HORT Lecture 7 Instructor: Kranthi Varala

OpenMP for next generation heterogeneous clusters

Carlo Cavazzoni, HPC department, CINECA

MPI+X on The Way to Exascale. William Gropp

Tasking in OpenMP. Paolo Burgio.

STUDYING OPENMP WITH VAMPIR

A4. Intro to Parallel Computing

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Overview of research activities Toward portability of performance

Transcription:

Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1

Well, How Did We Get Here? Almost all of the clusters provisioned now and for the forseeable future are constellations, that is they are composed of nodes, each with 1 or more sockets holding CPUs with 2 or more cores, possibly with GP GPUs, connected by a high speed network fabric. With their additional levels of memory hierarchy these constellation clusters are more difficult to use efficiently than their predecessors, yet they offer the promise of significantly greater cycles available for science. The Blue Waters computational resource is a good example of this trend. Often a hybrid approach, utilizing 2 or more of MPI, OpenMP, pthreads and/or GP GPU techniques, can utilize these computational resource more effeciently and effectively than any one of them on their own. 2

Common Parallel Programming Paradigms, Strengths and Weaknesses OpenMP 1. Strengths - relatively easy to use, portable, relatively easy to adapt to existing serial software, standard standard, low latency/high bandwidth communication, implicit communication model, dynamic load balancing 2. Weaknesses - shared memory model only, i.e. one node, support for C/C++ and FORTRAN only, little explicit control over thread creation and rundown 3

Message Passing Interface (MPI) 1. Strengths - support for C/C++, FORTRAN, Perl, Python and other languages, scales beyond one shared memory image node, widely used in the scientific software community, enables you to harness more compute cycles and more memory which translate to bigger science 2. Weaknesses - non standard standard to some extent, can support shared memory model for intra node communication but not all bindings do it efficiently, can be difficult to program, high latency/low bandwidth communication (compared to shared memory), explicit communication model, load balancing can be difficult

GP GPU with CUDA 1. Strengths - ability to harness significant computational cycles, becoming widely used in the scientific software community 2. Weaknesses - can be difficult to program, non portable proprietary language, requires re thinking many problems to take advantage of the high degree of parallel cardinality

pthreads 1. Strengths - better granularity of control over thread model than OpenMP, fairly portable, explicit control over thread creation and rundown 2. Weaknesses - can be difficult to program, only fairly portable

Field Programmable Gate Array (FPGA) 1. Strengths - encode any algorithm in hardware 2. Weaknesses - can be difficult to program, not portable, expensive hardware

Hybrid Parallel Algorithms OpenMP + CUDA OpenMP + FPGA MPI + OpenMP MPI + CUDA MPI + FPGA MPI + pthreads MPI + OpenMP + CUDA MPI + pthreads + CUDA 4

Designing, Building and Debugging Hybrid Parallel Programs The basic approach is to find the most efficient way to do the work, whether it be OpenMP (CPU), CUDA (GP GPU), pthreads or a blend of them, and then the most efficient way to distribute the data and harvest the results via MPI. Study the communication pattern(s) to insure that the algorithm maps to the architecture. 1. Embarrassingly parallel 2. Loosely coupled through tightly coupled Look for opportunities to overlap computation and communication, this is a key attribute of efficient hybrid parallel programs. 5

Don t design/implement an algorithm that requires more than one type of parallelism to be enabled for it to run. This will make testing and debugging much harder than it needs to be. Better to {OpenMP, CUDA, Pthread} your MPI code than the other way around. MPI is harder, do that first, then work on the on node parallelism. The simplest and least error prone software architecture is to use MPI calls only outside of any parallel regions {OpenMP, pthreads} and only allow the master thread to communicate between MPI processes (this is known as funneling). It s also possible to use MPI calls within parallel regions if you are using a thread safe MPI binding (not all of them are).

Debug by running on one node and testing the {OpenMP, CUDA, Pthreads}, then on 2 n nodes with just the MPI enabled to verify the data transfer. Take care to have no more than about one process/thread doing network communication contemporaneously, contention for the network port will quickly become a bottleneck unless there is a minimal amount of communication. Take care to have no more than about two processes/threads performing CUDA calculations contemporaneously, contention for the I/O bandwidth to/from the GP GPU card and the GP GPU s cores can become a bottleneck.

Simple MPI + OpenMP Example #i n c l u d e <omp. h> #i n c l u d e mpi. h #i n c l u d e < s t d i o. h> i n t main ( i n t a r g c, c h a r a r g v [ ] ) { i n t w o r l d s i z e, my rank, t h r e a d c o u n t ; c o n s t i n t NUM THREADS = 4 ; o m p s e t n u m t h r e a d s ( NUM THREADS ) ; M P I I n i t (& a r g c, &a r g v ) ; MPI Comm size (MPI COMM WORLD, & w o r l d s i z e ) ; MPI Comm rank (MPI COMM WORLD, &m y r a n k ) ; #pragma omp p a r a l l e l r e d u c t i o n (+: t h r e a d c o u n t ) { t h r e a d c o u n t = o m p g e t n u m t h r e a d s ( ) ; } p r i n t f ( MPI Rank %d h a s r e p o r t e d %d \ n, my rank, t h r e a d c o u n t ) ; } M P I F i n a l i z e ( ) ; r e t u r n 0 ; 6

Questions? 7

Lab Exercises To compile MPI + OpenMP programs on Sooner add -fopenmp to your mpicc command line. To run MPI + OpenMP programs on Sooner copy and modify the example bsub script found here: charliep/ncsi2010/hybrid-mpi-openmp/example parallel hybrid.bsub Copy, build, run and explore charliep/ncsi2010/hybrid-mpi-openmp/mpi-openmp-first.c Copy, build, run and explore charliep/ncsi2010/hybrid-mpi-openmp/mpi-openmp-second.c Fix the total threads problem. Copy, add the appropriate OpenMP directives, build, run and explore charliep/ncsi2010/hybrid-mpi-openmp/area under curve mpi.c Using the input file found in the same directory explore the efficiency of using more or less MPI processes and more or less OpenMP threads spread across different numbers of nodes for that problem size. What s the most efficient mapping? 8