Hybrid MPI+OpenMP Parallel MD

Similar documents
Hybrid MPI+OpenMP Parallel MD

Hybrid MPI+CUDA Programming

Hybrid MPI+OpenMP+CUDA Programming

OpenMP Programming. Aiichiro Nakano

Parallel Computing. Lecture 17: OpenMP Last Touch

Hybrid MPI and OpenMP Parallel Programming

OPEN MP and MPI on Kingspeak chpc cluster

Message Passing Interface (MPI) Programming

Parallel Computing: Overview

Parallele Numerik. Blatt 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Hybrid Programming with MPI and OpenMP. B. Estrade

Our new HPC-Cluster An overview

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

Cluster Clonetroop: HowTo 2014

P a g e 1. HPC Example for C with OpenMPI

Why Combine OpenMP and MPI

Solution of Exercise Sheet 2

Practical Introduction to Message-Passing Interface (MPI)

MPI introduction - exercises -

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

CUDA Programming. Aiichiro Nakano

MPI-Hello World. Timothy H. Kaiser, PH.D.

Simple examples how to run MPI program via PBS on Taurus HPC

Introduction to Parallel Programming with MPI

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Parallel programming in Madagascar. Chenlong Wang

Beacon Quickstart Guide at AACE/NICS

Parallel Programming with OpenMP. CS240A, T. Yang

Compilation and Parallel Start

Combining OpenMP and MPI

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

CS/Math 471: Intro. to Scientific Computing

Some PBS Scripting Tricks. Timothy H. Kaiser, Ph.D.

Getting started with the CEES Grid

INTRODUCTION TO OPENMP (PART II)

PBS Pro and Ansys Examples

Data Environment: Default storage attributes

MPI-Hello World. Timothy H. Kaiser, PH.D.

Pair Distribution Computation on GPU

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

ECE 574 Cluster Computing Lecture 10

JURECA Tuning for the platform

UBDA Platform User Gudie. 16 July P a g e 1

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Numerical Algorithms

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Assignment 1 OpenMP Tutorial Assignment

DPHPC: Introduction to OpenMP Recitation session

Advanced Message-Passing Interface (MPI)

Shared Memory programming paradigm: openmp

Parallel Molecular Dynamics

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Backgrounding and Task Distribution In Batch Jobs

Process States. Controlling processes. Process states. PID and PPID UID and EUID GID and EGID Niceness Control terminal. Runnable. Sleeping.

Supporting Information Tom Brown Supervisors: Dr Andrew Angel, Prof. Jane Mellor Project 1

Hybrid MPI/ OpenMP. Presentation. Martin Čuma Center for High Performance Computing University of Utah

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

INTRODUCTION TO OPENMP

Parallel Programming

Tutorial: Compiling, Makefile, Parallel jobs

High Performance Beowulf Cluster Environment User Manual

Shifter on Blue Waters

Practical Introduction to

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

The Message Passing Model

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

ISTeC Cray High Performance Computing System User s Guide Version 5.0 Updated 05/22/15

PROGRAMMING MODEL EXAMPLES

Practical Introduction to

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

15-440: Recitation 8

Message Passing Interface

Practical Introduction to Message-Passing Interface (MPI)

High Performance Computing: Tools and Applications

Cerebro Quick Start Guide

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Computing with the Moore Cluster

Parallel Programming Languages 1 - OpenMP

Hybrid Programming. John Urbanic Parallel Computing Specialist Pittsburgh Supercomputing Center. Copyright 2017

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

To connect to the cluster, simply use a SSH or SFTP client to connect to:

<Insert Picture Here> OpenMP on Solaris

Introduction to CINECA Computer Environment

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Supercomputing in Plain English Exercise #6: MPI Point to Point

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Introduction to OpenMP

Anomalies. The following issues might make the performance of a parallel program look different than it its:

Computational Biology Software Overview

High Performance Computing

Transcription:

Hybrid MPI+OpenMP Parallel MD Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California Email: anakano@usc.edu

Hybrid MPI+OpenMP Programming Each MPI process spawns multiple OpenMP threads

MPI+OpenMP Calculation of π Each MPI process integrates over a range of width 1/nproc, as a discrete sum of nbin bins each of width step Within each MPI process, nthreads OpenMP threads perform part of the sum as in omp_pi.c

MPI+OpenMP Calculation of π: hpi.c #include <stdio.h> #include <mpi.h> #include <omp.h> #define NBIN 100000 #define MAX_THREADS 8 void main(int argc,char **argv) { int nbin,myid,nproc,nthreads,tid; double step,sum[max_threads]={0.0},pi=0.0,pig; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Comm_size(MPI_COMM_WORLD,&nproc); nbin = NBIN/nproc; step = 1.0/(nbin*nproc); #pragma omp parallel private(tid) { int i; double x; nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) { x = (i+0.5)*step; sum[tid] += 4.0/(1.0+x*x);} printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]); } for(tid=0; tid<nthreads; tid++) pi += sum[tid]*step; MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); if (myid==0) printf("pi = %f\n",pig); MPI_Finalize(); }

MPI+OpenMP Example: hpi.c Compilation on hpc.usc.edu source /usr/usc/mpich/default/gm-intel/setup.csh mpicc -o hpi hpi.c -openmp PBS script #!/bin/bash #PBS -l nodes=2:ppn=1 #PBS -l walltime=00:00:59 #PBS -o hpi.out #PBS -j oe #PBS -N hpi source /usr/usc/mpich/default/gm-intel/setup.sh OMP_NUM_THREADS=2 export OMP_NUM_THREADS WORK_HOME=/auto/rcf-12/anakano/hpc/cs596/ cd $WORK_HOME np=$(cat $PBS_NODEFILE wc -l) mpirun -np $np -machinefile $PBS_NODEFILE./hpi Output rank tid sum = 1 1 6.434981e+04 rank tid sum = 1 0 6.435041e+04 rank tid sum = 0 0 9.272972e+04 rank tid sum = 0 1 9.272932e+04 PI = 3.141593

Hybrid MPI+OpenMP Parallel MD OpenMP threads handle blocks of linked-list cells in each MPI process (= spatial-decomposition subsystem)

Linked-List Cell Block Variables vthrd[0 1 2] = # of OpenMP threads per MPI process in the x y z direction. nthrd = # of OpenMP threads = vthrd[0] vthrd[1] vthrd[2]. thbk[3]: thbk[0 1 2] is the # of linked-list cells in the x y z direction that each thread is assigned. /* Compute the # of cells for linked-list cells */ for (a=0; a<3; a++) lc[a] = al[a]/rcut; /* Cell size potential cutoff */ /* Size of cell block that each thread is assigned */ for (a=0; a<3; a++) thbk[a] = lc[a]/vthrd[a]; /* # of cells = integer multiple of the # of threads */ for (a=0; a<3; a++) { lc[a] = thbk[a]*vthrd[a]; /* Adjust # of cells/mpi process */ rc[a] = al[a]/lc[a]; /* Linked-list cell length */ }

OpenMP Threads for Cell Blocks Variables std = scalar thread index. vtd[3]: vtd[0 1 2] is the x y z element of vector thread index. mofst[3]: mofst[0 1 2] is the x y z offset cell index of cell-block. std = omp_get_thread_num(); vtd[0] = std/(vthrd[1]*vthrd[2]); vtd[1] = (std/vthrd[2])%vthrd[1]; vtd[2] = std%vthrd[2]; for (a=0; a<3; a++) mofst[a] = vtd[a]*thbk[a]; Call omp_get_thread_num() within an OpenMP parallel block.

Threads Processing of Cell Blocks Start with the MPI parallel MD program, pmd.c Within each MPI process, parallelize the outer loops over central linked-list cells, mc[], in the force computation function, compute_accel(), using OpenMP threads If each thread needs separate copy of a variable (e.g., loop index mc[]), declare it as private in the OpenMP parallel block #pragma omp parallel private(mc,...) {... for (mc[0]=mofst[0]+1; mc[0]<=mofst[0]+thbk[0]; (mc[0])++) for (mc[1]=mofst[1]+1; mc[1]<=mofst[1]+thbk[1]; (mc[1])++) for (mc[2]=mofst[2]+1; mc[2]<=mofst[2]+thbk[2]; (mc[2])++) { Each thread handles thbk[0] thbk[1] thbk[2] cells independently } }...

Avoiding Critical Sections A cell block should be at least 3-cells wide, so that no two threads are accessing the same atoms simultaneously Remove the critical section if (bintra) lpe += vval; else lpe += 0.5*vVal; by defining an array, lpe[nthrd], where each array element stores the partial sum of the potential energy by a thread

Running HMD at HPC 1. Interactively submit a PBS job, & wait until you are allocated nodes. (Note that you will be automatically logged in to one of the allocated nodes.) hpc-master(18): qsub -I -l nodes=2:ppn=4 -l walltime=00:29:00 qsub: waiting for job 1932791.hpc-pbs.usc.edu to start qsub: job 1932791.hpc-pbs.usc.edu ready ---------------------------------------- Begin PBS Prologue Tue Oct 24 12:57:44 PDT 2006 Job ID: 1932791.hpc-pbs.usc.edu Username: anakano Group: m-csci Name: STDIN Queue: quick Shared Access: no Nodes: hpc0097 hpc0104 PVFS: /scratch (128G) TMPDIR: /tmp/1932791.hpc-pbs.usc.edu End PBS Prologue Tue Oct 24 12:57:47 PDT 2006 ---------------------------------------- [anakano@hpc0104 ~]$

Running HMD at HPC 2. Type the following sequence of commands. (In this example, hpc/cs596 is my working directory, where the executable hmd is located.) [anakano@hpc0104 ~]$ bash bash-2.05b$ source /usr/usc/mpich/default/gm-intel/setup.sh bash-2.05b$ OMP_NUM_THREADS=4 bash-2.05b$ export OMP_NUM_THREADS bash-2.05b$ cd hpc/cs596 bash-2.05b$ cp $PBS_NODEFILE nodefile 3. Edit nodefile, which originally consisted of 8 lines, to delete 6 lines. (Original nodefile) hpc0104 hpc0104 hpc0104 hpc0104 hpc0097 hpc0097 hpc0097 hpc0097 (Edited nodefile) hpc0104 hpc0097 4. Submit a two-process MPI program (named hmd); each of the MPI process will spawn 4 OpenMP threads. bash-2.05b$ mpirun -np 2 -machinefile nodefile./hmd

Running HMD at HPC 5. While the job is running, you can open another window & log in to both the nodes to check that both processors on each node are busy. hpc-master(2): rlogin hpc0104 [anakano@hpc104 ~]$ top 13:00:36 up 93 days, 14:03, 1 user, load average: 1.41, 0.37, 0.23 70 processes: 65 sleeping, 5 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 86.8% 0.0% 0.0% 0.0% 0.0% 0.0% 13.1% cpu00 89.1% 0.0% 0.0% 0.0% 0.0% 0.0% 10.8% cpu01 88.1% 0.0% 0.0% 0.0% 0.0% 0.0% 11.8% cpu02 82.1% 0.0% 0.0% 0.0% 0.0% 0.0% 17.8% cpu03 88.1% 0.0% 0.0% 0.0% 0.0% 0.0% 11.8% Mem: 3976844k av, 1747492k used, 2229352k free, 0k shrd, 155128k buff 1411876k active, 92572k inactive Swap: 1052248k av, 6792k used, 1045456k free 1387084k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 24918 anakano 25 0 45436 44M 34200 R 23.7 1.1 0:35 3 hmd 24941 anakano 25 0 45436 44M 34200 R 22.2 1.1 0:33 0 hmd 24940 anakano 25 0 45436 44M 34200 R 22.0 1.1 0:30 1 hmd 24942 anakano 25 0 45436 44M 34200 R 18.8 1.1 0:32 2 hmd 1 root 15 0 296 276 224 S 0.0 0.0 1:04 2 init...