MVAPICH2-X 2.1 User Guide

Size: px
Start display at page:

Download "MVAPICH2-X 2.1 User Guide"

Transcription

1 2.1 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c) Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: April 3, 2015

2 Contents 1 Overview of the MVAPICH2-X Project 1 2 Features 3 3 Download and Installation Instructions Example downloading RHEL6 package Example installing RHEL6 package Installing with local Berkeley UPC translator support Installing with GNU UPC translator support Installing CAF with OpenUH Compiler Basic Usage Instructions Compile Applications Compile using mpicc for MPI or MPI+OpenMP Applications Compile using oshcc for OpenSHMEM or MPI+OpenSHMEM applications Compile using upcc for UPC or MPI+UPC applications compile using uhcaf for CAF and MPI+CAF applications Run Applications Run using mpirun rsh Run using oshrun Run using upcrun Run using cafrun Run using Hydra (mpiexec) Hybrid (MPI+PGAS) Applications MPI+OpenSHMEM Example MPI+UPC Example OSU PGAS Benchmarks OSU OpenSHMEM Benchmarks OSU UPC Benchmarks Runtime Parameters UPC Runtime Parameters UPC SHARED HEAP SIZE OpenSHMEM Runtime Parameters OOSHM USE SHARED MEM OOSHM SYMMETRIC HEAP SIZE OSHM USE CMA FAQ and Troubleshooting with MVAPICH2-X General Questions and Troubleshooting Compilation Errors with upcc Unresponsive upcc Shared memory limit for OpenSHMEM / MPI+OpenSHMEM programs i

3 8.1.4 Install MVAPICH2-X to a specific location ii

4 1 Overview of the MVAPICH2-X Project Message Passing Interface (MPI) has been the most popular programming model for developing parallel scientific applications. Partitioned Global Address Space (PGAS) programming models are an attractive alternative for designing applications with irregular communication patterns. They improve programmability by providing a shared memory abstraction while exposing locality control required for performance. It is widely believed that hybrid programming model (MPI+X, where X is a PGAS model) is optimal for many scientific computing problems, especially for exascale computing. MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for PGAS programming model. This minimizes the development overheads that have been a huge deterrent in porting MPI applications to use PGAS models. The unified runtime also delivers superior performance compared to using separate MPI and PGAS libraries by optimizing use of network and memory resources. MVAPICH2-X supports Unified Parallel C (UPC) OpenSHMEM, and CAF as PGAS models. It can be used to run pure MPI, MPI+OpenMP, pure PGAS (UPC/OpenSHMEM/CAF) as well as hybrid MPI(+OpenMP) + PGAS applications. MVAPICH2-X derives from the popular MVAPICH2 library and inherits many of its features for performance and scalability of MPI communication. It takes advantage of the RDMA features offered by the InfiniBand interconnect to support UPC/OpenSHMEM/CAF data transfer and OpenSHMEM atomic operations. It also provides a high-performance shared memory channel for multi-core InfiniBand clusters. Applications MPI, PGAS or Hybrid (MPI+PGAS) PGAS Communication calls PGAS Interface MPI Communication calls MPI Interface MVAPICH2-X (InfiniBand channel, Shared Memory channel) InfiniBand Network Figure 1: MVAPICH2-X Architecture The MPI implementation of MVAPICH2-X is based on MVAPICH2, which supports MPI-3 features. The UPC implementation is UPC Language Specification 1.2 standard compliant and is based on Berkeley UPC OpenSHMEM implementation is OpenSHMEM 1.0 standard compliant and is based on Open- SHMEM Reference Implementation 1.0h. CAF implementation is Coarray Fortran Fortran 2015 standard compliant and is based on UH CAF Reference Implementation

5 The current release supports InfiniBand transport interface (inter-node), and Shared Memory Interface (intra-node). The overall architecture of MVAPICH2-X is shown in the Figure 1. This document contains necessary information for users to download, install, test, use, tune and troubleshoot MVAPICH2-X 2.1. We continuously fix bugs and update this document as per user feedback. Therefore, we strongly encourage you to refer to our web page for updates. 2

6 2 Features MVAPICH2-X supports pure MPI programs, MPI+OpenMP programs, UPC programs, OpenSHMEM programs, CAF programs, as well as hybrid MPI(+OpenMP) + PGAS(UPC/OpenSHMEM/CAF) programs. Current version of MVAPICH2-X 2.1 supports UPC, OpenSHMEM and CAF as the PGAS model. Highlevel features of MVAPICH2-X 2.1 are listed below. MPI Features Support for MPI-3 features (NEW) Based on MVAPICH2 2.1 (OFA-IB-CH3 interface). MPI programs can take advantage of all the features enabled by default in OFA-IB-CH3 interface of MVAPICH2 2.1 High performance two-sided communication scalable to multi-thousand nodes Optimized collective communication operations Shared-memory optimized algorithms for barrier, broadcast, reduce and allreduce operations Optimized two-level designs for scatter and gather operations Improved implementation of allgather, alltoall operations High-performance and scalable support for one-sided communication Direct RDMA based designs for one-sided communication Shared memory backed Windows for One-Sided communication Support for truly passive locking for intra-node RMA in shared memory backed windows Multi-threading support Enhanced support for multi-threaded MPI applications Unified Parallel C (UPC) Features UPC Language Specification 1.2 standard compliance (NEW) Based on Berkeley UPC v (contains changes/additions in preparation for UPC 1.3 specification) Optimized RDMA-based implementation of UPC data movement routines Improved UPC memput design for small/medium size messages Support for GNU UPC translator Optimized UPC collectives (Improved performance for upc all broadcast, upc all scatter, upc all gather, upc all gather all, and upc all exchange) OpenSHMEM Features OpenSHMEM 1.0 standard compliance 3

7 (NEW) Based on OpenSHMEM reference implementation v1.0h (NEW) Support for on-demand establishment of connections (NEW) Improved job start up and memory footprint Optimized RDMA-based implementation of OpenSHMEM data movement routines Support for OpenSHMEM shmem ptr functionality Efficient implementation of OpenSHMEM atomics using RDMA atomics Optimized OpenSHMEM put routines for small/medium message sizes Optimized OpenSHMEM collectives (Improved performance for shmem collect, shmem fcollect, shmem barrier, shmem reduce and shmem broadcast) Optimized shmalloc routine Improved intra-node communication performance using shared memory and CMA designs (NEW)CAF Features (NEW) Based on University of Houston CAF implementation (NEW) Efficient point-point read/write operations (NEW) Efficient CO REDUCE and CO BROADCAST collective operations Hybrid Program Features Supports hybrid programming using MPI(+OpenMP), MPI(+OpenMP)+UPC, MPI(+OpenMP)+OpenSHMEM and MPI(+OpenMP)+CAF Compliance to MPI-3, UPC 1.2, OpenSHMEM 1.0 and CAF Fortran 2015 standards Optimized network resource utilization through the unified communication runtime Efficient deadlock-free progress of MPI and UPC/OpenSHMEM/CAF calls Unified Runtime Features Based on MVAPICH2 2.1 (OFA-IB-CH3 interface). MPI, UPC, OpenSHMEM, CAF and Hybrid programs benefit from its features listed below Scalable inter-node communication with highest performance and reduced memory usage Integrated RC/XRC design to get best performance on large-scale systems with reduced/ constant memory footprint RDMA Fast Path connections for efficient small message communication 4

8 Shared Receive Queue (SRQ) with flow control to significantly reduce memory footprint of the library. AVL tree-based resource-aware registration cache Automatic tuning based on network adapter and host architecture Optimized intra-node communication support by taking advantage of shared-memory communication Efficient Buffer Organization for Memory Scalability of Intra-node Communication Automatic intra-node communication parameter tuning based on platform Flexible CPU binding capabilities Portable Hardware Locality (hwloc v1.9) support for defining CPU affinity Efficient CPU binding policies (bunch and scatter patterns, socket and numanode granularity) to specify CPU binding per job for modern multi-core platforms Allow user-defined flexible processor affinity Two modes of communication progress Polling Blocking (enables running multiple processes/processor) Flexible process manager support Support for mpirun rsh, hydra, upcrun and oshrun process managers 5

9 3 Download and Installation Instructions The MVAPICH2-X package can be downloaded from here. Select the link for your distro. All MVAPICH2- X RPMs are relocatable. As as initial technology preview, we are providing RHEL6 and RHEL5 RPMs. We provide RPMs compatible with either OFED and Mellanox OFED 1.5.3, or OFED 3.5, Mellanox OFED 2.2, and the stock RHEL distro InfiniBand packages. 3.1 Example downloading RHEL6 package Below are the steps to download MVAPICH2-X RPMs for RHEL6: $ wget mvapich2-x rhel6.mlnx-2.2.tar.gz $ tar mvapich2-x rhel6.mlnx-2.2.tar.gz $ cd mvapich2-x rhel6.mlnx-2.2 $ ls install.sh mvapich2-x_gnu el6.x86_64.rpm mvapich2-x_intel el6.x86_64.rpm openshmem-osu_gnu el6.x86_64.rpm openshmem-osu_intel el6.x86_64.rpm berkeley_upc-osu_gnu el6.x86_64.rpm berkeley_upc-osu_gnu-gupc el6.x86_64.rpm berkeley_upc-osu_intel el6.x86_64.rpm libcaf-osu_gnu el6.x86_64.rpm osu-micro-benchmarks_gnu el6.x86_64.rpm osu-micro-benchmarks_intel el6.x86_64.rpm README 3.2 Example installing RHEL6 package Running the install.sh script will install the software in /opt/mvapich2-x. The /opt/mvapich2-x/gnu directory contains the software built using gcc distributed with RHEL6. The /opt/mvapich2-x/intel directory contains the software built using Intel 13 compilers. The install.sh script runs the following command: rpm -Uvh *.rpm --force --nodeps This will upgrade any prior versions of MVAPICH2-X that may be present as well as ignore any dependency issues that may be present with the Intel RPMs (some dependencies are available after sourcing the env scripts provided by the intel compiler). These RPMs are relocatable and advanced users may skip the install.sh script to directly use alternate commands to install the desired RPMs. GCC RPMs: 6

10 mvapich2-x gnu el6.x86 64.rpm openshmem-osu gnu el6.x86 64.rpm berkeley upc-osu gnu el6.x86 64.rpm berkeley upc-osu gnu-gupc el6.x86 64.rpm libcaf-osu gnu el6.x86 64.rpm osu-micro-benchmarks gnu el6.x86 64.rpm Intel RPMs: mvapich2-x intel el6.x86 64.rpm openshmem-osu intel el6.x86 64.rpm berkeley upc-osu intel el6.x86 64.rpm osu-micro-benchmarks intel el6.x86 64.rpm Installing with local Berkeley UPC translator support By default, MVAPICH2-X UPC uses the online UPC-to-C translator as the Berkeley UPC does. If your install environment cannot access the Internet, upcc will not work. In this situation, a local translator should be installed. The local Berkeley UPC-to-C translator can be downloaded from After installing it, you should edit the upcc configure file (/opt/mvapich2-x/gnu/etc/upcc.conf or $HOME/.upccrc), and set the translator option to be the path of the local translator (e.g. /usr/local/berkeley upc translator-<version>/targ) Installing with GNU UPC translator support GNU UPC (GUPC) translator support is included in MVAPICH2-X GNU UPC RPM (berkeley upc-osu gnugupc el6.x86 64.rpm) as an optional install item. If it is installed, applications can be compiled directly using upcc, and gupc will be invoked internally. To enable GUPC support, GUPC should be installed separately as a prerequisite, which can be downloaded from By default, MVAPICH2-X expects GUPC to be installed in /usr/local/gupc Installing CAF with OpenUH Compiler The CAF implementation of MVAPICH2-X is based on the OpenUH CAF compiler. Thus an installation of OpenUH compiler is needed. Here are the detailed steps to build CAF support in MVAPICH2-X: Installing OpenUH Compiler 7

11 $ mkdir openuh-install; cd openuh-install $ wget openuh/download/packages/ openuh x86 64-bin.tar.bz2 $ tar xjf openuh x86 64-bin.tar.bz2 Export the PATH of OpenUH ( /openuh-install/openuh /bin) into the environment. Installing MVAPICH2-X CAF Install the MVAPICH2-X RPMs (simply use install.sh, and it will be installed in /opt/mvapich2-x) Copy the libcaf directory from MVAPICH2-X into OpenUH $ cp -a /opt/mvapich2-x/gnu/lib64/gcc-lib /openuh-install /openuh /lib Please us at mvapich-help@cse.ohio-state.edu if your distro does not appear on the list or if you experience any trouble installing the package on your system. 8

12 4 Basic Usage Instructions 4.1 Compile Applications MVAPICH2-X supports MPI applications, PGAS (OpenSHMEM, UPC or CAF) applications and hybrid (MPI+ OpenSHMEM, MPI+UPC or MPI+CAF) applications. User should choose the corresponding compilers according to the applications. These compilers (oshcc, upcc, uhcaf and mpicc) can be found under <MVAPICH2-X INSTALL> /bin folder Compile using mpicc for MPI or MPI+OpenMP Applications Please use mpicc for compiling MPI and MPI+OpenMP applications. Below are examples to build MPI applications using mpicc: $ mpicc -o test test.c This command compiles test.c program into binary execution file test by mpicc. $ mpicc -fopenmp -o hybrid mpi openmp hybrid.c This command compiles a MPI+OpenMP program mpi openmp hybrid.c into binary execution file hybrid by mpicc, when MVAPICH2-X is built with GCC compiler. For Intel compilers, use -openmp instead of -fopenmp; For PGI compilers, use -mp instead of -fopenmp Compile using oshcc for OpenSHMEM or MPI+OpenSHMEM applications Below is an example to build an MPI, an OpenSHMEM or a hybrid application using oshcc: $ oshcc -o test test.c This command compiles test.c program into binary execution file test by oshcc. For MPI+OpenMP hybrid programs, add compile flags -fopenmp, -openmp or -mp according to different compilers, as mentioned in mpicc usage examples Compile using upcc for UPC or MPI+UPC applications Below is an example to build a UPC or a hybrid MPI+UPC application using upcc: $ upcc -o test test.c This command compiles test.c program into binary execution file test by upcc. Note: (1) The UPC compiler generates the following warning if MPI symbols are found in source code. upcc: warning: MPI * symbols seen at link time: should you be using --uses-mpi This warning message can be safely ignored. 9

13 (2) upcc requires a C compiler as the back-end, whose version should be as same as the compiler used by MVAPICH2-X libraries. Take the MVAPICH2-X RHEL6 GCC RPMs for example, the C compiler should be GCC You need to install this version of GCC before using upcc compile using uhcaf for CAF and MPI+CAF applications Below is an example to build an MPI, a CAF or a hybrid application using uhcaf: Download the UH test example $ wget openuh/download/packages/ caf-runtime src.tar.bz2 $ tar xjf caf-runtime src.tar.bz2 $ cd caf-runtime /regression-tests/cases/singles/should-pass Compilation using uhcaf To take advantage of MVAPICH2-X, user needs to specify the MVAPICH2-X as the conduit during the compilation. $ uhcaf --layer=gasnet-mvapich2x -o event test event test.caf 4.2 Run Applications This section provides instructions on how to run applications with MVAPICH2. Please note that on new multi-core architectures, process-to-core placement has an impact on performance. MVAPICH2-X inherits its process-to-core binding capabilities from MVAPICH2. Please refer to (MVAPICH2 User Guide) for process mapping options on multi-core nodes Run using mpirun rsh The MVAPICH team suggests users using this mode of job start-up. mpirun rsh provides fast and scalable job start-up. It scales to multi-thousand node clusters. It can be use to launch MPI, OpenSHMEM, UPC, CAF and hybrid applications. Prerequisites: Either ssh or rsh should be enabled between the front nodes and the computing nodes. In addition to this setup, you should be able to login to the remote nodes without any password prompts. All host names should resolve to the same IP address on all machines. For instance, if a machine s host names resolves to due to the default /etc/hosts on some Linux distributions it leads to incorrect behavior of the library. 10

14 Jobs can be launched using mpirun rsh by specifying the target nodes as part of the command as shown below: $ mpirun rsh -np 4 n0 n0 n1 n1./test This command launches test on nodes n0 and n1, two processes per node. By default ssh is used. $ mpirun rsh -rsh -np 4 n0 n0 n1 n1./test This command launches test on nodes n0 and n1, two processes per each node using rsh instead of ssh. The target nodes can also be specified using a hostfile. $ mpirun rsh -np 4 -hostfile hosts./test The list of target nodes must be provided in the file hosts one per line. MPI or OpenSHMEM ranks are assigned in order of the hosts listed in the hosts file or in the order they are passed to mpirun rsh. i.e., if the nodes are listed as n0 n1 n0 n1, then n0 will have two processes, rank 0 and rank 2; whereas n1 will have rank 1 and 3. This rank distribution is known as cyclic. If the nodes are listed as n0 n0 n1 n1, then n0 will have ranks 0 and 1; whereas n1 will have ranks 2 and 3. This rank distribution is known as block. The mpirun rsh hostfile format allows users to specify a multiplier to reduce redundancy. It also allows users to specify the HCA to be used for communication. The multiplier allows you to save typing by allowing you to specify blocked distribution of MPI ranks using one line per hostname. The HCA specification allows you to force an MPI rank to use a particular HCA. The optional components are delimited by a :. Comments and empty lines are also allowed. Comments start with # and continue to the next newline. Below are few examples of hostfile formats: $ cat hosts # sample hostfile for mpirun_rsh host1 # rank 0 will be placed on host1 host2:2 # rank 1 and 2 will be placed on host 2 host3:hca1 # rank 3 will be on host3 and will used hca1 host4:4:hca2 # ranks 4 through 7 will be on host4 and use hca2 # if the number of processes specified for this job is greater than 8 # then the additional ranks will be assigned to the hosts in a cyclic # fashion. For example, rank 8 will be on host1 and ranks 9 and 10 # will be on host2. Many parameters of the MPI library can be configured at run-time using environmental variables. In order to pass any environment variable to the application, simply put the variable names and values just before the executable name, like in the following example: $ mpirun rsh -np 4 -hostfile hosts ENV1=value ENV2=value./test Note that the environmental variables should be put immediately before the executable. Alternatively, you may also place environmental variables in your shell environment (e.g..bashrc). These will be automatically picked up when the application starts executing. 11

15 4.2.2 Run using oshrun MVAPICH2-X provides oshrun and can be used to launch applications as shown below. $ oshrun -np 2./test This command launches two processes of test on the localhost. A list of target nodes where the processes should be launched can be provided in a hostfile and can be used as shown below. The oshrun hostfile can be in one of the two formats outlined for mpirun rsh earlier in this document. $ oshrun -f hosts -np 2./test Run using upcrun MVAPICH2-X provides upcrun to launch UPC and MPI+UPC applications. To use upcrun, we suggest users to set the following environment: $ export MPIRUN CMD= <path-to-mvapich2-x-install>/bin/mpirun rsh -np %N -hostfile hosts %P %A A list of target nodes where the processes should be launched can be provided in the hostfile named as hosts. The hostfile hosts should follow the same format for mpirun rsh, as described in Section Then upcrun can be used to launch applications as shown below. $ upcrun -n 2./test Run using cafrun Similar to UPC and OpenSHMEM, to run a CAF application, we can use cafrun or mpirun rsh: Export the PATH and LD LIBRARY PATH of the GNU version of MVAPICH2-X (/opt/mvapich2- x/gnu/bin, /opt/mvapich2-x/gnu/lib64) into the environment. $ export UHCAF LAUNCHER OPTS="-hostfile hosts" $ cafrun -n 16 -v./event test Run using Hydra (mpiexec) MVAPICH2-X also distributes the Hydra process manager along with with mpirun rsh. Hydra can be used either by using mpiexec or mpiexec.hydra. The following is an example of running a program using it: $ mpiexec -f hosts -n 2./test This process manager has many features. Please refer to the following web page for more details. the Hydra Process Manager 12

16 5 Hybrid (MPI+PGAS) Applications MVAPICH2-X supports hybrid programming models. Applications can be written using both MPI and PGAS constructs. Rather than using a separate runtime for each of the programming models, MVAPICH2- X supports hybrid programming using a unified runtime and thus provides better resource utilization and superior performance. 5.1 MPI+OpenSHMEM Example A simple example of Hybrid MPI+OpenSHMEM program is shown below. It uses both MPI and OpenSH- MEM constructs to print the sum of ranks of each processes. 1 #include <stdio.h> 2 #include <shmem.h> 3 #include <mpi.h> 4 5 static int sum = 0; 6 int main(int c, char *argv[]) 7 { 8 int rank, size; 9 10 /* SHMEM init */ 11 start_pes(0); /* get rank and size */ 14 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 15 MPI_Comm_size(MPI_COMM_WORLD, &size); /* SHMEM barrier */ 18 shmem_barrier_all(); /* fetch-and-add at root */ 21 shmem_int_fadd(&sum, rank, 0); /* MPI barrier */ 24 MPI_Barrier(MPI_COMM_WORLD); /* root broadcasts sum */ 27 MPI_Bcast(&sum, 1, MPI_INT, 0, MPI_COMM_WORLD); /* print sum */ 30 fprintf(stderr, "(%d): Sum: %d\n", rank, sum); shmem_barrier_all(); 13

17 33 return 0; 34 } start pes in line 10 initializes the runtime for MPI and OpenSHMEM communication. An explicit call to MPI Init is not required. The program uses MPI calls MPI Comm rank and MPI Comm size to get process rank and size, respectively (lines 14-15). MVAPICH2-X assigns same rank for MPI and PGAS model. Thus, alternatively the OpenSHMEM constructs my pe and num pes can also be used to get rank and size, respectively. In line 17, every process does a barrier using OpenSHMEM construct shmem barrier all. After this, every process does a fetch-and-add of the rank to the variable sum in process 0. The sample program uses OpenSHMEM construct shmem int fadd (line 21) for this. Following the fetch-and-add, every process does a barrier using MPI Barrier (line 24). Process 0 then broadcasts sum to all processes using MPI Bcast (line 27). Finally, all processes print the variable sum. Explicit MPI Finalize is not required. The program outputs the following for a four-process run: $$> mpirun_rsh -np 4 -hostfile./hostfile./hybrid_mpi_shmem (0): Sum: 6 (1): Sum: 6 (2): Sum: 6 (3): Sum: 6 The above sample hybrid program is available at <MVAPICH2-X INSTALL>/<gnu intel>/share /examples/hybrid mpi shmem.c 5.2 MPI+UPC Example A simple example of Hybrid MPI+UPC program is shown below. Similarly to the previous example, it uses both MPI and UPC constructs to print the sum of ranks of each UPC thread. 1 #include <stdio.h> 2 #include <upc.h> 3 #include <mpi.h> 4 5 shared [1] int A[THREADS]; 6 int main() { 7 int sum = 0; 8 int rank, size; 9 10 /* get MPI rank and size */ 11 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 12 MPI_Comm_size(MPI_COMM_WORLD, &size); 14

18 13 14 /* UPC barrier */ 15 upc_barrier; /* initialize global array */ 18 A[MYTHREAD] = rank; 19 int *local = ((int *)(&A[MYTHREAD])); /* MPI barrier */ 22 MPI_Barrier(MPI_COMM_WORLD); /* sum up the value with each UPC thread */ 25 MPI_Allreduce(local, &sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); /* print sum */ 28 if (MYTHREAD == 0) 29 fprintf(stderr, "(%d): Sum: %d\n", rank, sum); upc_barrier; return 0; 34 } An explicit call to MPI Init is not required. The program uses MPI calls MPI Comm rank and MPI Comm size to get process rank and size, respectively (lines 11-12). MVAPICH2-X assigns same rank for MPI and PGAS model. Thus, MYTHREAD and THREADS contains the UPC thread rank and UPC thread size respectively, which is equal to the return value of MPI Comm rank and MPI Comm size. In line 15, every UPC thread does a barrier using UPC construct upc barrier. After this, every UPC thread set its MPI rank to one element of a global shared memory array A and this element A[MYTHREAD] has affinity with the UPC thread who set the value of it (line 18). Then a local pointer need to be set to the global shared array element for MPI collective functions. Then every UPC thread does a barrier using MPI Barrier (line 22). After the barrier, MPI Allreduce is called (line 25) to sum up all the rank values and return the results to every UPC thread, in sum variable. Finally, all processes print the variable sum. Explicit MPI Finalize is not required. The program can be compiled using upcc: $$> upcc hybrid_mpi_upc.c -o hybrid_mpi_upc The program outputs the following for a four-process run: $$> mpirun_rsh -n 4 -hostfile hosts./hybrid_mpi_upc (0): Sum: 6 (3): Sum: 6 (1): Sum: 6 (2): Sum:

19 The above sample hybrid program is available at <MVAPICH2-X INSTALL>/<gnu intel>/share /examples/hybrid mpi upc.c 16

20 6 OSU PGAS Benchmarks 6.1 OSU OpenSHMEM Benchmarks We have extended the OSU Micro Benchmark (OMB) suite with tests to measure performance of OpenSH- MEM operations. OSU Micro Benchmarks (OMB-4.4.1) have OpenSHMEM data movement and atomic operation benchmarks. The complete benchmark suite is available along with MVAPICH2-X binary package, in the folder: <MVAPICH2-X INSTALL>/libexec/osu-micro-benchmarks. A brief description for each of the newly added benchmarks is provided below. Put Latency (osu oshm put): This benchmark measures latency of a shmem putmem operation for different data sizes. The user is required to select whether the communication buffers should be allocated in global memory or heap memory, through a parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to write data at PE 1 and then calls shmem quiet. This is repeated for a fixed number of iterations, depending on the data size. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. Both PEs call shmem barrier all after the test for each message size. Get Latency (osu oshm get): This benchmark is similar to the one above except that PE 0 does a shmem getmem operation to read data from PE 1 in each iteration. The average latency per iteration is reported. Put Operation Rate (osu oshm put mr): This benchmark measures the aggregate uni-directional operation rate of OpenSHMEM Put between pairs of PEs, for different data sizes. The user should select for communication buffers to be in global memory and heap memory as with the earlier benchmarks. This test requires number of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, where n is the total number of PEs. The first PE in each pair issues back-to-back shmem putmem operations to its peer PE. The total time for the put operations is measured and operation rate per second is reported. All PEs call shmem barrier all after the test for each message size. Atomics Latency (osu oshm atomics): This benchmark measures the performance of atomic fetch-and-operate and atomic operate routines supported in OpenSHMEM for the integer datatype. The buffers can be selected to be in heap memory or global memory. The PEs are paired like in the case of Put Operation Rate benchmark and the first PE in each pair issues back-to-back atomic operations of a type to its peer PE. The average latency per atomic operation and the aggregate operation rate are reported. This is repeated for each of fadd, finc, add, inc, cswap and swap routines. Collective Latency Tests: OSU Microbenchmarks consists of the following collective latency tests: The latest OMB Version includes the following benchmarks for various OpenSHMEM collective operations (shmem collect, shmem fcollect shmem broadcast, shmem reduce and shmem barrier). 17

21 osu oshm collect - OpenSHMEM Collect Latency Test osu oshm fcollect - OpenSHMEM FCollect Latency Test osu oshm broadcast - OpenSHMEM Broadcast Latency Test osu oshm reduce - OpenSHMEM Reduce Latency Test osu oshm barrier - OpenSHMEM Barrier Latency Test These benchmarks work in the following manner. Suppose users run the osu oshm broadcast benchmark with N processes, the benchmark measures the min, max and the average latency of the shmem broadcast operation across N processes, for various message lengths, over a number of iterations. In the default version, these benchmarks report average latency for each message length. Additionally, the benchmarks the following options: -f can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations -m option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths -i can be used to set the number of iterations to run for each message length 6.2 OSU UPC Benchmarks OSU Microbenchmarks extensions include UPC benchmarks also. Current version (OMB-4.4.1) has benchmarks for upc memput and upc memget. The complete benchmark suite is available along with MVAPICH2- X binary package, in the folder: <MVAPICH2-X INSTALL>/libexec/osu-micro-benchmarks. A brief description for each of the benchmarks is provided below. Put Latency (osu upc memput): This benchmark measures the latency of upc put operation between multiple UPC threads. In this benchmark, UPC threads with ranks less than (THREADS/2) issue upc memput operations to peer UPC threads. Peer threads are identified as (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations, for varying data sizes. The average latency per iteration is reported. A few warm-up iterations are run without timing to ignore any start-up overheads. All UPC threads call upc barrier after the test for each message size. Get Latency (osu upc memget): This benchmark is similar as the osu upc put benchmark that is described above. The difference is that the shared string handling function is upc memget. The average get operation latency per iteration is reported. Collective Latency Tests: OSU Microbenchmarks consists of the following collective latency tests: The latest OMB Version includes the following benchmarks for various UPC collective operations (upc all barrier, upc all broadcast, upc all exchange, upc all gather, upc all gather all, upc all red-uce, and upc all scatter). 18

22 osu upc all barrier - UPC Barrier Latency Test osu upc all broadcast - UPC Broadcast Latency Test osu upc all exchange - UPC Exchange Latency Test osu upc all gather all - UPC GatherAll Latency Test osu upc all gather - UPC Gather Latency Test osu upc all reduce - UPC Reduce Latency Test osu upc all scatter - UPC Scatter Latency Test These benchmarks work in the following manner. Suppose users run the osu upc all broadcast with N processes, the benchmark measures the min, max and the average latency of the upc all broad-cast operation across N processes, for various message lengths, over a number of iterations. In the default version, these benchmarks report average latency for each message length. Additionally, the benchmarks the following options: -f can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations -m option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths -i can be used to set the number of iterations to run for each message length 19

23 7 Runtime Parameters MVAPICH2-X supports all the runtime parameters of MVAPICH2 (OFA-IB-CH3). A comprehensive list of all runtime parameters of MVAPICH2 2.1 can be found in User Guide. Runtime parameters specific to MVAPICH2-X are listed below. 7.1 UPC Runtime Parameters UPC SHARED HEAP SIZE Class: Run time Default: 64M Set UPC Shared Heap Size 7.2 OpenSHMEM Runtime Parameters OOSHM USE SHARED MEM Class: Run time Default: 1 Enable/Disable shared memory scheme for intra-node communication OOSHM SYMMETRIC HEAP SIZE Class: Run time Default: 512M Set OpenSHMEM Symmetric Heap Size OSHM USE CMA Class: Run time Default: 1 Enable/Disable CMA based intra-node communication design 20

24 8 FAQ and Troubleshooting with MVAPICH2-X Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact us by sending an to mvapich-help@cse.ohio-state.edu. 8.1 General Questions and Troubleshooting Compilation Errors with upcc Current version of upcc available with MVAPICH2-X package gives a compilation error if the gcc version is not Please install gcc version to fix this Unresponsive upcc By default, upcc compiler driver will transparently use Berkeley s HTTP-based public UPC-to-C translator during compilation. If your system is behind a firewall or not connected to Internet, upcc can become unresponsive. This can be solved by using a local installation of UPC translator, which can be downloaded from here. The translator can be compiled and installed using the following commands: $ make $ make install PREFIX=<translator-install-path> After this, upcc can be instructed to use this translator. $ upcc -translator=<translator-install-path> hello.upc -o hello Shared memory limit for OpenSHMEM / MPI+OpenSHMEM programs By default, the symmetric heap in OpenSHMEM is allocated in shared memory. The maximum amount of shared memory in a node is limited by the memory available in /dev/shm. Usually the default system configuration for /dev/shm is 50% of main memory. Thus, programs which specify heap size larger than the total available memory in /dev/shm will give an error. For example, if the shared memory limit is 8 GB, the combined symmetric heap size of all intra-node processes shall not exceed 8 GB. Users can change the available shared memory by remounting /dev/shm with the desired limit. Alternatively, users can control the heap size using OOSHM SYMMETRIC HEAP SIZE (Section 7.2.2) or disable shared memory by setting OOSHM USE SHARED MEM=0 (Section 7.2.1). Please be aware that setting a very large shared memory limit or disabling shared memory will have a performance impact. 21

25 8.1.4 Install MVAPICH2-X to a specific location MVAPICH2-X RPMs are relocatable. Please use the --prefix option during RPM installation for installing MVAPICH2-X into a specific location. An example is shown below: $ rpm -Uvh --prefix <specific-location> mvapich2-x gnu el6.x86 64.rpm openshmem-osu gnu el6.x86 64.rpm berkeley upc-osu gnu el6.x86 64.rpm osu-micro-benchmarks gnu el6.x86 64.rpm 22

MVAPICH2-X 2.3 User Guide

MVAPICH2-X 2.3 User Guide -X 2.3 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2017 Network-Based

More information

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

MVAPICH MPI and Open MPI

MVAPICH MPI and Open MPI CHAPTER 6 The following sections appear in this chapter: Introduction, page 6-1 Initial Setup, page 6-2 Configure SSH, page 6-2 Edit Environment Variables, page 6-5 Perform MPI Bandwidth Test, page 6-8

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

RDMA for Memcached User Guide

RDMA for Memcached User Guide 0.9.5 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Optimizing & Tuning Techniques for Running MVAPICH2 over IB Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Tech Computer Center Documentation

Tech Computer Center Documentation Tech Computer Center Documentation Release 0 TCC Doc February 17, 2014 Contents 1 TCC s User Documentation 1 1.1 TCC SGI Altix ICE Cluster User s Guide................................ 1 i ii CHAPTER 1

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15 Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at Presented by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu hcp://www.cse.ohio-

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Job Startup at Exascale: Challenges and Solutions

Job Startup at Exascale: Challenges and Solutions Job Startup at Exascale: Challenges and Solutions Sourav Chakraborty Advisor: Dhabaleswar K (DK) Panda The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Tremendous increase

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Practical Introduction to Message-Passing Interface (MPI) October 1st, 2015 By: Pier-Luc St-Onge Partners and Sponsors 2 Setup for the workshop 1. Get a user ID and password paper (provided in class):

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo .. Programming Distributed Memory Machines in MPI and UPC Kenjiro Taura University of Tokyo 1 / 57 Distributed memory machines chip (socket, node, CPU) (physical) core hardware thread (virtual core, CPU)

More information

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector A brief Introduction to MPI 2 What is MPI? Message Passing Interface Explicit parallel model All parallelism is explicit:

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab.

Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC). University of Texas at Austin Tuesday

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism

More information

Tool for Analysing and Checking MPI Applications

Tool for Analysing and Checking MPI Applications Tool for Analysing and Checking MPI Applications April 30, 2010 1 CONTENTS CONTENTS Contents 1 Introduction 3 1.1 What is Marmot?........................... 3 1.2 Design of Marmot..........................

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Dmitry Durnov 15 February 2017

Dmitry Durnov 15 February 2017 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern

More information

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual Mellanox GPUDirect RDMA User Manual Rev 1.0 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Mellanox ScalableSHMEM User Manual

Mellanox ScalableSHMEM User Manual Mellanox ScalableSHMEM User Manual Rev 2.2 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

MUST. MPI Runtime Error Detection Tool

MUST. MPI Runtime Error Detection Tool MUST MPI Runtime Error Detection Tool April 18, 2012 1 CONTENTS CONTENTS Contents 1 Introduction 3 2 Installation 3 2.1 P n MPI................................. 4 2.2 GTI..................................

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

SCALABLE HYBRID PROTOTYPE

SCALABLE HYBRID PROTOTYPE SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform

More information

Introduction to parallel computing with MPI

Introduction to parallel computing with MPI Introduction to parallel computing with MPI Sergiy Bubin Department of Physics Nazarbayev University Distributed Memory Environment image credit: LLNL Hybrid Memory Environment Most modern clusters and

More information

Tutorial: parallel coding MPI

Tutorial: parallel coding MPI Tutorial: parallel coding MPI Pascal Viot September 12, 2018 Pascal Viot Tutorial: parallel coding MPI September 12, 2018 1 / 24 Generalities The individual power of a processor is still growing, but at

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

Shifter on Blue Waters

Shifter on Blue Waters Shifter on Blue Waters Why Containers? Your Computer Another Computer (Supercomputer) Application Application software libraries System libraries software libraries System libraries Why Containers? Your

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Message-Passing Computing

Message-Passing Computing Chapter 2 Slide 41þþ Message-Passing Computing Slide 42þþ Basics of Message-Passing Programming using userlevel message passing libraries Two primary mechanisms needed: 1. A method of creating separate

More information

Bringing a scientific application to the distributed world using PGAS

Bringing a scientific application to the distributed world using PGAS Bringing a scientific application to the distributed world using PGAS Performance, Portability and Usability of Fortran Coarrays Jeffrey Salmond August 15, 2017 Research Software Engineering University

More information

MUST. MPI Runtime Error Detection Tool

MUST. MPI Runtime Error Detection Tool MUST MPI Runtime Error Detection Tool November 9, 2011 1 CONTENTS CONTENTS Contents 1 Introduction 3 2 Installation 3 2.1 P n MPI................................. 4 2.2 GTI..................................

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar K. Panda Department of Computer Science and

More information

Mellanox ScalableSHMEM User Manual

Mellanox ScalableSHMEM User Manual Mellanox ScalableSHMEM User Manual Rev 2.1 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Message Passing Interface: Basic Course

Message Passing Interface: Basic Course Overview of DM- HPC2N, UmeåUniversity, 901 87, Sweden. April 23, 2015 Table of contents Overview of DM- 1 Overview of DM- Parallelism Importance Partitioning Data Distributed Memory Working on Abisko 2

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

Advantages to Using MVAPICH2 on TACC HPC Clusters

Advantages to Using MVAPICH2 on TACC HPC Clusters Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How

More information

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In

More information

NovoalignMPI User Guide

NovoalignMPI User Guide MPI User Guide MPI is a messaging passing version of that allows the alignment process to be spread across multiple servers in a cluster or other network of computers 1. Multiple servers can be used to

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Beginner's Guide for UK IBM systems

Beginner's Guide for UK IBM systems Beginner's Guide for UK IBM systems This document is intended to provide some basic guidelines for those who already had certain programming knowledge with high level computer languages (e.g. Fortran,

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com.

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com. December 2003 Provided by ClusterWorld for Jeff Squyres cw.squyres.com www.clusterworld.com Copyright 2004 ClusterWorld, All Rights Reserved For individual private use only. Not to be reproduced or distributed

More information

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for

More information

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI Sourav Chakraborty, Hari Subramoni, Jonathan Perkins, Ammar A. Awan and Dhabaleswar K. Panda Department of Computer Science and Engineering

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Solution of Exercise Sheet 2

Solution of Exercise Sheet 2 Solution of Exercise Sheet 2 Exercise 1 (Cluster Computing) 1. Give a short definition of Cluster Computing. Clustering is parallel computing on systems with distributed memory. 2. What is a Cluster of

More information

Advanced Job Launching. mapping applications to hardware

Advanced Job Launching. mapping applications to hardware Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and

More information

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and D. K. Panda Presented by Lei Chai Network Based

More information

MIC Lab Parallel Computing on Stampede

MIC Lab Parallel Computing on Stampede MIC Lab Parallel Computing on Stampede Aaron Birkland and Steve Lantz Cornell Center for Advanced Computing June 11 & 18, 2013 1 Interactive Launching This exercise will walk through interactively launching

More information

MPICH User s Guide Version Mathematics and Computer Science Division Argonne National Laboratory

MPICH User s Guide Version Mathematics and Computer Science Division Argonne National Laboratory MPICH User s Guide Version 3.1.4 Mathematics and Computer Science Division Argonne National Laboratory Pavan Balaji Wesley Bland William Gropp Rob Latham Huiwei Lu Antonio J. Peña Ken Raffenetti Sangmin

More information

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual Mellanox GPUDirect RDMA User Manual Rev 1.2 www.mellanox.com NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS-IS

More information

SHMEM/PGAS Developer Community Feedback for OFI Working Group

SHMEM/PGAS Developer Community Feedback for OFI Working Group SHMEM/PGAS Developer Community Feedback for OFI Working Group Howard Pritchard, Cray Inc. Outline Brief SHMEM and PGAS intro Feedback from developers concerning API requirements Endpoint considerations

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Unified Parallel C, UPC

Unified Parallel C, UPC Unified Parallel C, UPC Jarmo Rantakokko Parallel Programming Models MPI Pthreads OpenMP UPC Different w.r.t. Performance/Portability/Productivity 1 Partitioned Global Address Space, PGAS Thread 0 Thread

More information

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of

More information

Blue Waters Programming Environment

Blue Waters Programming Environment December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available

More information