Hybrid MPI+OpenMP+CUDA Programming

Similar documents
Hybrid MPI+CUDA Programming

Hybrid MPI+OpenMP Parallel MD

Hybrid MPI+OpenMP Parallel MD

OpenMP Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano

Pair Distribution Computation on GPU

Parallel Computing. Lecture 17: OpenMP Last Touch

Message Passing Interface (MPI) Programming

OPEN MP and MPI on Kingspeak chpc cluster

Our new HPC-Cluster An overview

CUDA BASICS Matt Heavner -- 10/21/2010

Solution of Exercise Sheet 2

Hybrid MPI and OpenMP Parallel Programming

P a g e 1. HPC Example for C with OpenMPI

MPI introduction - exercises -

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Parallele Numerik. Blatt 1

Parallel Computing: Overview

MPI-Hello World. Timothy H. Kaiser, PH.D.

Hybrid Programming with MPI and OpenMP. B. Estrade

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel programming in Madagascar. Chenlong Wang

Working on the NewRiver Cluster

Practical Introduction to Message-Passing Interface (MPI)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to Parallel Programming with MPI

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

MPI-Hello World. Timothy H. Kaiser, PH.D.

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 4: OpenMP Open Multi-Processing

Tech Computer Center Documentation

Cluster Clonetroop: HowTo 2014

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Parallel Programming. Libraries and Implementations

Getting started with the CEES Grid

Introduction to CINECA Computer Environment

Beacon Quickstart Guide at AACE/NICS

Simple examples how to run MPI program via PBS on Taurus HPC

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Compilation and Parallel Start

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming Libraries and implementations

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

UBDA Platform User Gudie. 16 July P a g e 1

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Blue Waters Programming Environment

Anomalies. The following issues might make the performance of a parallel program look different than it its:

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1


Scientific Programming

High Performance Beowulf Cluster Environment User Manual

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Ambiente CINECA: moduli, job scripts, PBS. A. Grottesi (CINECA)

INTRODUCTION TO OPENMP (PART II)

Parallel Programming Languages 1 - OpenMP

JURECA Tuning for the platform

Hybrid MPI/ OpenMP. Presentation. Martin Čuma Center for High Performance Computing University of Utah

ECE 574 Cluster Computing Lecture 10

The Message Passing Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

MIC Lab Parallel Computing on Stampede

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

DPHPC: Introduction to OpenMP Recitation session

HW4-2. float phi[size][size]={};!

INTRODUCTION TO OPENMP

GPU Programming Using CUDA

Multicore Programming with OpenMP. CSInParallel Project

Effective Use of CCV Resources

Parallel Numerical Algorithms

Tutorial: parallel coding MPI

Introduction to HPC and Optimization Tutorial VI

PBS Pro and Ansys Examples

Parallel Programming

Introduction to HPC Using the New Cluster at GACRC

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Supporting Information Tom Brown Supervisors: Dr Andrew Angel, Prof. Jane Mellor Project 1

An introduction to MPI

Why Combine OpenMP and MPI

Lecture 10!! Introduction to CUDA!

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

COMP528: Multi-core and Multi-Processor Computing

Assignment 1 OpenMP Tutorial Assignment

OpenMP - exercises -

For Ryerson EE Network

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Shifter on Blue Waters

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Advanced Message-Passing Interface (MPI)

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Docker task in HPC Pack

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

CS/Math 471: Intro. to Scientific Computing

Some PBS Scripting Tricks. Timothy H. Kaiser, Ph.D.

Transcription:

Hybrid MPI+OpenMP+CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science Department of Biological Sciences University of Southern California Email: anakano@usc.edu https://hpcc.usc.edu/support/documentation/gpucluster/

Using MPI+CUDA on HPC Set an environment on the front-end (ssh to hpc-login3.usc.edu) source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.csh (if tcsh) source /usr/usc/cuda/6.5/setup.csh or source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.sh (if bash) source /usr/usc/cuda/6.5/setup.sh Compilation mpicc -o hypi hypi.cu Try mpicc --showme Submit the following PBS script using the qsub command #!/bin/bash #PBS -l nodes=2:ppn=1:gpus=1 #PBS -l walltime=00:00:59 #PBS -o hypi.out #PBS -j oe #PBS -N hypi #PBS -A lc_an2 source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.sh source /usr/usc/cuda/6.5/setup.sh cd /home/rcf-proj/an2/your_folder mpirun./hypi np 2 machinefile $PBS_NODEFILE

MPI+CUDA Calculation of p Spatial decomposition: Each MPI process integrates over a range of width 1/nproc, as a discrete sum of nbin bins each of width step Interleaving: Within each MPI process, NUM_BLOCK*NUM_THREAD CUDA threads perform part of the sum t0 t1 t0 t1

Calculate Pi with MPI+CUDA: hypi.cu (1) #include <stdio.h> #include <mpi.h> #include <cuda.h> #define NBIN 10000000 // Number of bins #define NUM_BLOCK 13 // Number of thread blocks #define NUM_THREAD 192 // Number of threads per block // Kernel that executes on the CUDA device global void cal_pi(float *sum,int nbin,float step,float offset,int nthreads,int nblocks) { int i; float x; int idx = blockidx.x*blockdim.x+threadidx.x; // Sequential thread index across blocks for (i=idx; i< nbin; i+=nthreads*nblocks) { // Interleaved bin assignment to threads x = offset+(i+0.5)*step; sum[idx] += 4.0/(1.0+x*x); } } MPI spatial decomposition CUDA thread interleaving

Calculate Pi with MPI+CUDA: hypi.cu (2) int main(int argc,char **argv) { int myid,nproc,nbin,tid; float step,offset,pi=0.0,pig; dim3 dimgrid(num_block,1,1); // Grid dimensions (only use 1D) dim3 dimblock(num_thread,1,1); // Block dimensions (only use 1D) float *sumhost,*sumdev; // Pointers to host & device arrays MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); // My MPI rank MPI_Comm_size(MPI_COMM_WORLD,&nproc); // Number of MPI processes nbin = NBIN/nproc; // Number of bins per MPI process step = 1.0/(float)(nbin*nproc); // Step size with redefined number of bins offset = myid*step*nbin; // Quadrature-point offset size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size sumhost = (float *)malloc(size); // Allocate array on host cudamalloc((void **) &sumdev,size); // Allocate array on device cudamemset(sumdev,0,size); // Reset array in device to 0 // Calculate on device (call CUDA kernel) cal_pi <<<dimgrid,dimblock>>> (sumdev,nbin,step,offset,num_thread,num_block); // Retrieve result from device and store it in host array cudamemcpy(sumhost,sumdev,size,cudamemcpydevicetohost); // Reduction over CUDA threads for(tid=0; tid<num_thread*num_block; tid++) pi += sumhost[tid]; pi *= step; // CUDA cleanup free(sumhost); cudafree(sumdev); printf("myid = %d: partial pi = %f\n",myid,pi); // Reduction over MPI processes MPI_Allreduce(&pi,&pig,1,MPI_FLOAT,MPI_SUM,MPI_COMM_WORLD); if (myid==0) printf("pi = %f\n",pig); MPI_Finalize(); return 0;}

Interactive Run Here, we assume that you have included the source commands (in second slide) to set up interoperable OpenMP & CUDA environments within your.cshrc or.bashrc in your home directory [anakano@hpc-login3 ~]$ qsub -I -l nodes=2:ppn=1:gpus=1,walltime=00:29:59 qsub: waiting for job 24925066.hpc-pbs1.hpcc.usc.edu to start qsub: job 24925066.hpc-pbs1.hpcc.usc.edu ready ---------------------------------------- Begin PBS Prologue Fri Oct 13 08:58:35 PDT 2017 Job ID: 24925066.hpc-pbs1.hpcc.usc.edu Username: anakano... Nodes: hpc3025 hpc3026 Scratch is: /scratch TMPDIR: /tmp/24925066.hpc-pbs1.hpcc.usc.edu End PBS Prologue Fri Oct 13 08:59:14 PDT 2017 ---------------------------------------- ---------------------------------------- [anakano@hpc3025 ~]$ cd /home/rcf-proj/an/hpc/cs596 [anakano@hpc3025 cs596]$ mpicc -o hypi hypi.cu [anakano@hpc3025 cs596]$ mpirun -np 2 -machinefile $PBS_NODEFILE./hypi myid = 0: partial pi = 1.854596 myid = 1: partial pi = 1.287001 PI = 3.141597

Variation: Using 2 GPUs per Node (1) Run multiple MPI processes on each node, and assign different GPUs to different processes hypi_setdevice.cu int main(int argc,char **argv) { int dev_used;... MPI_Comm_rank(MPI_COMM_WORLD,&myid); // My MPI rank cudasetdevice(myid%2); // Pick one of the 2 GPUs (0 or 1)... cudagetdevice(&dev_used); // Find which GPU is being used printf("myid = %d: device used = %d; partial pi = %f\n,myid,dev_used,pi);... }

Variation: Using 2 GPUs per Node (2) hypi_setdevice.pbs #!/bin/bash #PBS -l nodes=2:ppn=2:gpus=2 #PBS -l walltime=00:00:59 #PBS -o hypi_setdevice.out #PBS -j oe #PBS -N hypi_setdevice #PBS -A lc_an2 source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.sh source /usr/usc/cuda/6.5/setup.sh cd /home/rcf-proj2/an2/anakano mpirun./hypi_setdevice np 4 machinefile $PBS_NODEFILE hypi_setdevice.out ---------------------------------------- Begin PBS Prologue Fri Oct 13 09:19:37 PDT 2017... ranks 0 & 1 ranks 2 & 3 Nodes: hpc3025 hpc3026 ---------------------------------------- myid = 0: device used = 0; partial pi = 0.979926 myid = 1: device used = 1; partial pi = 0.874671 myid = 2: device used = 0; partial pi = 0.719409 myid = 3: device used = 1; partial pi = 0.567582 PI = 3.141588

Using MPI+OpenMP+CUDA on HPC Set an environment on the front-end (ssh to hpc-login3.usc.edu) source /usr/usc/cuda/7.5/setup.csh (if tcsh) source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.csh or source /usr/usc/cuda/7.5/setup.sh (if bash) source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.sh Compilation $ mpicc --showme nvcc -I/usr/usc/openmpi/1.6.4/include -L/usr/usc/openmpi/1.6.4/lib -lmpi -lmpi_cxx -lrt -lnsl -lutil -lm ldl $ mpicc -o pi3 -Xcompiler -fopenmp pi3.cu nvcc option to pass the following option (-fopenmpi) to gcc

MPI+OpenMP+CUDA Computation of p Write a triple-decker MPI+OpenMP+CUDA program, pi3.cu, by inserting an OpenMP layer to the double-decker MPI+CUDA program, hypi_setdevice.cu Launch one MPI rank per node, where each rank spawns two OpenMP threads that run on different CPU cores & use different GPU devices

MPI+OpenMP Spatial Decompositions #include <omp.h> #define NUM_DEVICE 2 // # of GPU devices = # of OpenMP threads... // In main() MPI_Comm_rank(MPI_COMM_WORLD,&myid); // My MPI rank MPI_Comm_size(MPI_COMM_WORLD,&nproc); // # of MPI processes omp_set_num_threads(num_device); // One OpenMP thread per GPU device nbin = NBIN/(nproc*NUM_DEVICE); // # of bins per OpenMP thread step = 1.0/(float)(nbin*nproc*NUM_DEVICE); #pragma omp parallel private(list the variables that need private copies) { mpid = omp_get_thread_num(); offset = (NUM_DEVICE*myid+mpid)*step*nbin; // Quadrature-point offset cudasetdevice(mpid%2);... } For the CUDA layer, leave the interleaved assignment of quadrature points to CUDA threads in hypi_setdevice.cu as it is

Data Privatization Circumvent the race condition for variable pi, by defining a private accumulator per OepnMP thread (or GPU device): float pid[num_device]; Use the array elements as dedicated accumulators for the OepnMP threads Upon exiting from the OpenMP parallel section, perform reduction over the elements of pid[] to obtain the partial sum, pi, per MPI rank Alternatively use #pragma omp parallel reduction(+:pi)

Output To report which of the two GPUs has been used for the run, insert the following lines within the OpenMP parallel block: cudagetdevice(&dev_used); printf("myid = %d; mpid = %d: device used = %d; partial pi = %f\n", myid, mpid, dev_used, pi); MPI rank Output OpenMP thread ID ID of the GPU device (0 or 1) that was used myid = 0; mpid = 0: device used = 0; partial pi = 0.979926 myid = 0; mpid = 1: device used = 1; partial pi = 0.874671 myid = 1; mpid = 0: device used = 0; partial pi = 0.719409 myid = 1; mpid = 1: device used = 1; partial pi = 0.567582 PI = 3.141588 Partial sum per OpenMP thread or pid[mpid] if data privatized manually

Using MPI+OpenMP+CUDA on HPC Submit the following PBS script using the qsub command #!/bin/bash #PBS -l nodes=2:ppn=2:gpus=2 #PBS -l walltime=00:00:59 #PBS -o pi3.out #PBS -j oe #PBS -N pi3 #PBS -A lc_an2 source /usr/usc/cuda/7.5/setup.sh source /usr/usc/openmpi/1.6.4/share/setup-cuda-gnu.sh cd /home/rcf-proj/an2/your_folder cat $PBS_NODEFILE uniq > nodefile mpirun -np 2 -machinefile nodefile./pi3 --bind-to none