Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Similar documents
n N c CIni.o ewsrg.au

HPC Architectures. Types of resource currently in use

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Symmetric Computing. SC 14 Jerome VIENNE

Introduction to Xeon Phi. Bill Barth January 11, 2013

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Symmetric Computing. John Cazes Texas Advanced Computing Center

Parallel Programming. Libraries and Implementations

NUMA Support for Charm++

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

SCALABLE HYBRID PROTOTYPE

Understanding Parallel Applications

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Practical Introduction to

GPUs and Emerging Architectures

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Mapping MPI+X Applications to Multi-GPU Architectures

Accelerating Implicit LS-DYNA with GPU

SuperMike-II Launch Workshop. System Overview and Allocations

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Slurm Configuration Impact on Benchmarking

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Overview of Intel Xeon Phi Coprocessor

Parallel Programming Libraries and implementations

NUMA-aware OpenMP Programming

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Introduc)on to Xeon Phi

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Native Computing and Optimization. Hang Liu December 4 th, 2013

Parallel computing on the Grid

COSC 6385 Computer Architecture - Multi Processor Systems

AcuSolve Performance Benchmark and Profiling. October 2011

Introduction to Parallel Programming

CUDA GPGPU Workshop 2012

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Why you should care about hardware locality and how.

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Introduc)on to Hyades

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Hybrid (MPP+OpenMP) version of LS-DYNA

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Advanced Job Launching. mapping applications to hardware

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Hybrid programming with MPI and OpenMP On the way to exascale

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Benchmark results on Knight Landing (KNL) architecture

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Programming. Libraries and implementations

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

High performance Computing and O&G Challenges

Simulation using MIC co-processor on Helios

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OP2 FOR MANY-CORE ARCHITECTURES

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

LBRN - HPC systems : CCT, LSU

MIC Lab Parallel Computing on Stampede

Our new HPC-Cluster An overview

Debugging Intel Xeon Phi KNC Tutorial

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Ravindra Babu Ganapathi

Reusing this material

OpenACC Course. Office Hour #2 Q&A

No Time to Read This Book?

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Non-uniform memory access (NUMA)

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Advanced Threading and Optimization

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Getting Performance from OpenMP Programs on NUMA Architectures

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Transcription:

Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU

Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming Workshop 2015 1

Shared memory model All threads can access the global address space Data sharing achieved via writing to/reading from the same memory location Example: OpenMP data M C C C C 6/5/2012 LONI Parallel Programming Workshop 2012 2

Distributed memory systems (1) Each process has its own address space Data is local to each process Data sharing achieved via explicit message passing (through network) Example: MPI (Message Passing Interface) data M M M M C C C C Node interconnect 6/3/2015 LONI Parallel Programming Workshop 2015 3

Distributed memory systems (2) Advantages Memory is scalable with the number of processors Cost effective (compared to big shared memory systems) Disadvantages Explicit data transfer programmers are responsible for moving data around The Top 500 list is dominated by distributed memory systems nowadays 6/3/2015 LONI Parallel Programming Workshop 2015 4

Distributed Memory Systems (3) Login nodes Users use these nodes to access the system Compute nodes Run user jobs Not accessible from outside I/O nodes Serve files stored on disk arrays over the network Not accessible from outside either 6/3/2015 LONI Parallel Programming Workshop 2015 5

Distributed Memory System: Shelob LSU HPC system 32 nodes Each node is equipped with Two 8 core Intel Sandy Bridge processors 64 GB memory Two Nvidia K20 Kepler GPUs FDR Infiniband interconnect (56 Gbps) 6/3/2015 LONI Parallel Programming Workshop 2015 6

Distributed Memory System: SuperMIC LSU HPC System 380 nodes Each node is equipped with Two 10 core Intel Ivy Bridge processors 64 GB memory Two Intel Xeon Phi coprocessors FDR Infiniband network interface (56 Gbps) 6/3/2015 LONI Parallel Programming Workshop 2015 7

Distributed Memory System: Stampede Texas Advanced Computing Center system 6400 nodes Intel Sandy Bridge processors and Xeon Phi coprocessors Infiniband interconnect 75 miles of network cable Source: https://www.tacc.utexas.edu/stampede/ 6/3/2015 LONI Parallel Programming Workshop 2015 8

HPC Jargon Soup Nodes, sockets, processors, cores, memory, cache, register what? Source: https://portal.tacc.utexas.edu/userguides/stampede 6/3/2015 LONI Parallel Programming Workshop 2015 9

6/3/2015 LONI Parallel Programming Workshop 2015 10

Inside of the Intel Ivy Bridge Processor Source: http://www.theregister.co.uk/2013/09/10/intel_ivy_bridge_xe on_e5_2600_v2_launch/ 6/3/2015 LONI Parallel Programming Workshop 2015 11

NUMA Architecture socket 1 Socket 2 memory plane 1 Shelob001 memory plane 2 [lyan1@shelob1 ~]$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32739 MB node 0 free: 29770 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32768 MB node 1 free: 26576 MB node distances: node 0 1 0: 10 11 1: 11 10 Non Uniform Memory Access Most likely ccnuma (cache coherent NUMA) these days Not all cores can access any memory location with same speed 6/3/2015 LONI Parallel Programming Workshop 2015 12

Memory Hierarchy disk register cache memory Data link Latency Bandwidth Cache O(1ns~10ns) O(100GBps) Memory O(100ns) O(10GBps) Disk I/O O(1millisecond) O(100MBps) Network O(1microsecond) O(1GBps) memory of other nodes 6/3/2015 LONI Parallel Programming Workshop 2015 13

Memory Hierarchy disk register cache memory memory of other nodes Moving data slows a program down The rule of locality applies Temporal: keep data where it is as long as possible Spatial: move adjacent data (avoid random access if possible) 6/3/2015 LONI Parallel Programming Workshop 2015 14

Distributed Memory Systems Are Hierarchical A cluster with N nodes, each with M sockets (processors), each with L cores K accelerators (GPU or Xeon Phi) Many different programming models/types of parallel applications No silver bullet for all problems 6/3/2015 LONI Parallel Programming Workshop 2015 15

Potential Performance Bottlenecks on Distributed Memory Systems Intra node Memory bandwidth Link between sockets (CPU to CPU) Intra CPU communication (core to core) Link between CPU and accelerators Inter node Communication over network 6/3/2015 LONI Parallel Programming Workshop 2015 16

Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming Workshop 2015 17

Message Passing Context: distributed memory systems Each processor has its own memory space and cannot access the memory of other processors Any data to be shared must be explicitly transferred from one to another as messages, hence the name message passing 6/3/2015 LONI Parallel Programming Workshop 2015 18

Message Passing Interface MPI defines a standard API for message passing The standard includes What functions are available The syntax and semantics of those functions What the expected outcome is when those functions are called The standard does NOT include Implementation details (e.g. how the data transfer occurs) Runtime details (e.g. how many processes the code with run with etc.) MPI provides C/C++ and Fortran bindings 6/3/2015 LONI Parallel Programming Workshop 2015 19

MPI Functions Point to point communication functions Message transfer from one process to another Collective communication functions Message transfer involving all processes in a communicator Environment management functions Initialization and termination Process group and topology 6/5/2012 LONI Parallel Programming Workshop 2012 20

Example of MPI Functions Source: Practical MPI Programming, IBM Redbook 6/5/2012 LONI Parallel Programming Workshop 2012 21

Why MPI? Standardized With efforts to keep it evolving (MPI 3.0 draft came out in 2010) Portability MPI implementations are available on almost all platforms Scalability In the sense that it is not limited by the number of processors that can access the same memory space Popularity De Facto programming model for distributed memory systems 6/3/2015 LONI Parallel Programming Workshop 2015 22

Implementations of MPI (1) There are many different implementations Only a few are being actively developed MPICH and derivatives MPICH MVAPICH2 Intel MPI (not open source) OpenMPI (Not OpenMP!!!) We will focus on OpenMPI today, which is the default implementation on Shelob They all comply to the MPI standards, but differ in implement details 6/3/2015 LONI Parallel Programming Workshop 2015 23

Implementations of MPI (2) MPICH Developed as an example implementation by a group of people who are involved in making the MPI standard MVAPICH2 Based on MPICH Mainly targets HPC systems using Infiniband interconnect OpenMPI Grew out of four other implementations 6/3/2015 LONI Parallel Programming Workshop 2015 24

Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming Workshop 2015 25

Parallel Applications on Distributed Systems Pure MPI applications CPU only CPU + Xeon Phi Hybrid applications (MPI + X) CPU only: MPI + OpenMP CPU + GPU: MPI + GPU / MPI + OpenACC CPU + Xeon Phi: MPI + OpenMP Others Applications that do not use MPI to launch and manage processes on multiple hosts 6/3/2015 LONI Parallel Programming Workshop 2015 26

Pure MPI Applications The launcher starts one copy of the program on each available processing core The runtime daemon monitors all MPI processes MPI 0 MPI 1 MPI 2 6/3/2015 LONI Parallel Programming Workshop 2015 27

MPI Runtime Start multiple copies of the program Map MPI processes to physical cores Coordinate communication between MPI processes Send/receive signals to MPI processes (e.g. Ctrl C is pressed by the user) Trap errors 6/3/2015 LONI Parallel Programming Workshop 2015 28

Running MPI Programs Use mpirun or mpiexec to launch MPI programs on multiple hosts Need to provide a list of hosts so that the launcher knows where to start the program With the -host <host1>,<host2> flag List all hosts in a host file and point to the file using the -hostfile flag 6/3/2015 LONI Parallel Programming Workshop 2015 29

Example: MPI Laplace Solver [lyan1@shelob001 par2015]$ mpicc -o lap.mpi laplace_solver_mpi_v3.c [lyan1@shelob001 par2015]$ mpirun -np 4 -host shelob001,shelob001,shelob001,shelob001./lap.mpi 1000 1000 2 100000 10000 0.00001 bash(21490) mpirun(25141) lap.mpi(25142) pstree(25154) {lap.mpi}(25149) {lap.mpi}(25153) lap.mpi(25143) {lap.mpi}(25148) {lap.mpi}(25152) lap.mpi(25144) {lap.mpi}(25146) {lap.mpi}(25151) lap.mpi(25145) {lap.mpi}(25147) {lap.mpi}(25150) pbs_demux(21491) 6/3/2015 LONI Parallel Programming Workshop 2015 30

Host File MPI launcher spawns processes on remote hosts according to the host file One host name on each line If N processes are desired on a certain host, the host name needs to be repeated N times (N lines) The host file can be modified to control the number of MPI processes started on each node 6/3/2015 LONI Parallel Programming Workshop 2015 31

[lyan1@shelob001 par2015]$ cat hostlist shelob001 shelob001 shelob002 shelob002 [lyan1@shelob001 par2015]$ mpirun -np 4 -hostfile./hostlist./lap.mpi 1000 1000 2 100000 10000 0.00001 bash(21490) mpirun(28736) lap.mpi(28737) pstree(28743) {lap.mpi}(28740) {lap.mpi}(28742) lap.mpi(28738) {lap.mpi}(28739) {lap.mpi}(28741) pbs_demux(21491) [lyan1@shelob002 ~]$ pstree -p -u lyan1 orted(21757) lap.mpi(21758) {lap.mpi}(21761) {lap.mpi}(21762) lap.mpi(21759) {lap.mpi}(21760) {lap.mpi}(21763) 6/3/2015 LONI Parallel Programming Workshop 2015 32

Why Do We Need Hybrid Applications? Adding OpenMP threading May reduce memory footprint of MPI processes MPI processes on a node replicate memory May reduce MPI communication between processes May reduce programming effort for certain sections of the code May achieve better load balancing Allows programmers to take advantage of the accelerators 6/3/2015 LONI Parallel Programming Workshop 2015 33

MPI + OpenMP MPI programs with OpenMP parallel regions MPI communication can occur within an OpenMP parallel region Needs support from the MPI library MPI 0 MPI 1 MPI 2 6/3/2015 LONI Parallel Programming Workshop 2015 34

Running MPI+OpenMP Hybrid Applications Use mpirun or mpiexec to start the application Set OMP_NUM_THREADS to indicate how many threads per MPI process Indicate how many MPI processes per node Modify the host file, or Use the perhost (MPICH based) or npernode (OpenMPI) flag 6/3/2015 LONI Parallel Programming Workshop 2015 35

Example: NPB BT MZ Benchmark BT MZ Part of NPB, which is a parallel benchmark suite developed by NASA Finite difference solver of the Navier Stokes equation There are a few difference sizes We will run the C class (relatively small) 6/3/2015 LONI Parallel Programming Workshop 2015 36

Running BT MZ Hybrid Benchmark [lyan1@shelob012 run]$ OMP_NUM_THREADS=4 mpirun -np 16 -npernode 4 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.16 Use the default load factors with threads Total number of threads: 64 ( 4.0 threads/process) Calculated speedup = 63.89 BT-MZ Benchmark Completed. Class = C Size = 480x 320x 28 Iterations = 200 Time in seconds = 14.51 Total processes = 16 6/3/2015 LONI Parallel Programming Workshop 2015 37

Running BT MZ Hybrid Benchmark These two commands are equivalent with the one on last slide: [lyan1@shelob012 run]$ OMP_NUM_THREADS=4 mpirun -np 16 -npernode 4 \ -hostfile $PBS_NODEFILE../bin/bt-mz.C.16 [lyan1@shelob012 run]$ OMP_NUM_THREADS=4 mpirun -np 16 -hostfile modified_hostfile../bin/bt-mz.c.16 For MPICH based implementations, use: [lyan1@shelob012 run]$ mpirun env OMP_NUM_THREADS 4 -np 16 -perhost 4 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.16 6/3/2015 LONI Parallel Programming Workshop 2015 38

How Many Processes? How Many Threads? In our example we have 4 nodes with 16 cores each at our disposal, should we run 4 MPI processes, each with 16 threads, or 64 MPI processes, each with 1 thread, or Anywhere in between? 6/3/2015 LONI Parallel Programming Workshop 2015 39

How Many Processes? How Many Threads? In our example we have 4 nodes with 16 cores each at our disposal, should we run 4 MPI processes, each with 16 threads, or 64 MPI processes, each with 1 thread, or Anywhere in between? It depends In many cases pure MPI applications are faster than their naïvely written/run hybrid counterparts 6/3/2015 LONI Parallel Programming Workshop 2015 40

How Many Processes? How Many Threads? [lyan1@shelob012 run]$ OMP_NUM_THREADS=1 mpirun -np 64 -npernode 16 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.64 [lyan1@shelob012 run]$ OMP_NUM_THREADS=2 mpirun -np 32 -npernode 8 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.32 [lyan1@shelob012 run]$ OMP_NUM_THREADS=4 mpirun -np 16 -npernode 4 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.16 [lyan1@shelob012 run]$ OMP_NUM_THREADS=8 mpirun -np 8 -npernode 2 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.8 [lyan1@shelob012 run]$ OMP_NUM_THREADS=16 mpirun -np 4 -npernode 1 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.4 6/3/2015 LONI Parallel Programming Workshop 2015 41

How Many Processes? How Many Threads? [lyan1@shelob012 run]$ OMP_NUM_THREADS=1 mpirun -np 64 -npernode 16 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.64 [lyan1@shelob012 run]$ OMP_NUM_THREADS=2 mpirun -np 32 -npernode 8 -host shelob012,shelob013,shelob014,shelob015 # of processes # of threads per proc../bin/bt-mz.c.32 Time (seconds) 4 16 18.61 [lyan1@shelob012 run]$ OMP_NUM_THREADS=4 mpirun -np 16 -npernode 4 -host shelob012,shelob013,shelob014,shelob015 8 8../bin/bt-mz.C.16 16.11 16 4 14.51 [lyan1@shelob012 run]$ OMP_NUM_THREADS=8 mpirun -np 8 -npernode 2 -host shelob012,shelob013,shelob014,shelob015 32 2../bin/bt-mz.C.8 13.05 64 1 13.28 [lyan1@shelob012 run]$ OMP_NUM_THREADS=16 mpirun -np 4 -npernode 1 -host shelob012,shelob013,shelob014,shelob015../bin/bt-mz.c.4 6/3/2015 LONI Parallel Programming Workshop 2015 42

Affinity Matters (1) MPI process/openmp thread placement will affect performance because sub optimal placement may cause Unnecessary inter node communication Communication between MPI processes could cause network contention Unnecessary across socket memory access ccnuma issue 6/3/2015 LONI Parallel Programming Workshop 2015 43

Affinity Matters (2) [lyan1@shelob012 run]$ OMP_NUM_THREADS=1 mpirun -np 64 -npernode 16 -host shelob012,shelob013,shelob014,shelob015 bynode bind-to-socket -report-bindings../bin/bt-mz.c.64 [shelob012:04434] MCW rank 0 bound to socket 0[core 0-7]: [B B B B B B B B][........] [shelob012:04434] MCW rank 4 bound to socket 0[core 0-7]: [B B B B B B B B][........] [shelob012:04434] MCW rank 8 bound to socket 0[core 0-7]: [B B B B B B B B][........] [shelob012:04434] MCW rank 12 bound to socket 0[core 0-7]: [B B B B B B B B][........] [shelob014:25593] MCW rank 2 bound to socket 0[core 0-7]: [B B B B B B B B][........] Time in seconds = 57.81 4X slower! 6/3/2015 LONI Parallel Programming Workshop 2015 44

Affinity Matters (3) Process placement How to map MPI processes to processing elements OpenMPI: -by[slot socket node ] option MPICH/MVAPICH2: -map-by [socket core ] Process binding How to bind MPI processes to physical cores OpenMPI: -bind-to-[slot socket node] option MPICH/MVAPICH2: -bind-to [socket core ] Threading binding compiler specific runtime flags Intel: KMP_AFFINITY GNU: GOMP_CPU_AFFINITY 6/3/2015 LONI Parallel Programming Workshop 2015 45

Affinity Matters (4) Pure MPI Applications can be affected by affinity as well BT MZ Benchmark C By Bind Time (seconds) Core No binding 13.28 Node No binding 13.13 Node Core 12.87 Core Core 13.20 Socket Socket 13.04 In this case, the code is not very demanding on memory and network, but there is still ~3% difference. 6/3/2015 LONI Parallel Programming Workshop 2015 46

MPI + CUDA Use MPI launcher to start the program Each MPI process can launch CUDA kernels Example: NAMD A very popular molecular dynamics code Based on Charm++, which is a parallel programming system which can function with or without MPI On Shelob Charm++ is built on top of MVAPICH2 6/3/2015 LONI Parallel Programming Workshop 2015 47

Running NAMD with MPI + CUDA [lyan1@shelob001 stmv] mpirun -np 32 -hostfile $PBS_NODEFILE \ `which namd2` stmv.namd Charm++> Running on MPI version: 2.1 Charm++> Running on 2 unique compute nodes (16-way SMP). Pe 1 physical rank 1 will use CUDA device of pe 4 Pe 14 physical rank 14 will use CUDA device of pe 8 Pe 0 physical rank 0 will use CUDA device of pe 4 Pe 3 physical rank 3 will use CUDA device of pe 4 Pe 12 physical rank 12 will use CUDA device of pe 8 Pe 4 physical rank 4 binding to CUDA device 0 on shelob001: 'Tesla K20Xm' Mem: 5759MB Rev: 3.5 <simulation starts> 6/3/2015 LONI Parallel Programming Workshop 2015 48

MPI + Xeon Phi Use MPI launcher to start the program Each MPI process can offload computation to Xeon Phi device(s) Example: LAMMPS Stands for Large scale Atomic/Molecular Massively Parallel Simulator Another popular molecular dynamics code On SuperMIC, it is built with Intel MPI 6/3/2015 LONI Parallel Programming Workshop 2015 49

Running LAMMPS with Xeon Phi [lyan1@smic043 TEST]$ mpiexec.hydra -env OMP_NUM_THREADS=1 -n 80 -perhost 20 -f $PBS_NODEFILE `which lmp_intel_phi` -in in.intel.rhodo -log none -v b -1 -v s intel LAMMPS (21 Jan 2015) using 1 OpenMP thread(s) per MPI task Intel Package: Affinitizing MPI Tasks to 1 Cores Each Using Intel Coprocessor with 4 threads per core, 24 threads per task Precision: mixed Setting up run... Memory usage per processor = 86.7087 Mbytes <simulation starts> 6/3/2015 LONI Parallel Programming Workshop 2015 50

Non MPI Parallel Applications Example: NAMD On SuperMIC Charm++ is built using IB verbs, the Infiniband communication library No MPI is involved It automatically detects and utilizes the Xeon Phi devices 6/3/2015 LONI Parallel Programming Workshop 2015 51

[lyan1@smic031 stmv]for node in `cat $PBS_NODEFILE uniq`; \ do echo host $node; done > hostfile Running NAMD without MPI [lyan1@smic031 stmv]$ cat hostfile host smic031 host smic037 host smic039 host smic040 [lyan1@smic031 stmv]charmrun ++p 80 ++nodelist./hostfile \ ++remote-shell ssh `which namd2` stmv.namd Charmrun> IBVERBS version of charmrun Pe 1 physical rank 0 binding to MIC device 0 on smic277: 240 procs 240 threads Pe 8 physical rank 2 will use MIC device of pe 32 Pe 25 physical rank 6 will use MIC device of pe 1 Pe 0 physical rank 0 will use MIC device of pe 32 <simulation starts> 6/3/2015 LONI Parallel Programming Workshop 2015 52

Wrap up Distributed memory systems are dominant in the HPC world MPI is de facto programming model on those systems There are a few different implementations around That said, node level architecture awareness is also important If you are a developer: keep in mind all the bottlenecks that could hurt the performance of your code If you are not a developer: at least try a few different combinations 6/3/2015 LONI Parallel Programming Workshop 2015 53

Questions? 6/3/2015 LONI Parallel Programming Workshop 2015 54