Op#miza#on and Tuning Collec#ves in MVAPICH2

Size: px
Start display at page:

Download "Op#miza#on and Tuning Collec#ves in MVAPICH2"

Transcription

1 Op#miza#on and Tuning Collec#ves in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: state.edu h<p:// state.edu/~subramon

2 Outline Introduc#on CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 2

3 MPI Collec#ve Opera#ons MPI Standard defines colleccve communicacon primicves to implement group communicacon operacons CollecCves such as MPI_Bcast, MPI_Allreduce are commonly used across parallel applicacons, owing to their ease of use and performance portability Processor and network architecture is constantly evolving mulc- core/many- core architectures, InfiniBand FDR, etc. CriCcal to design new algorithms to implement and opcmize colleccve operacons on emerging systems 3

4 MPI Collec#ve Opera#ons Performance of colleccve operacons is sensicve to various factors: Message length Number of processes in the communicator Processor architecture Number of CPU sockets per node Number of cores per CPU socket Network technology InfiniBand DDR/QDR/FDR Single- Rail, MulC- Rail Important to carefully tune various designs to implement colleccve operacons efficiently 4

5 Conven#onal Collec#ve Communica#on Algorithms Basic colleccve communicacon algorithms: Binomial Tree (MPI_Bcast, MPI_Reduce, MPI_Sca<er, MPI_Gather) Recursive Doubling (MPI_Allreduce, MPI_Allgather) Ring Exchange (MPI_Allgather) Bruck Algorithm, Pairwise Exchange (MPI_Alltoall) Combining basic algorithms to implement more complex schemes: Sca<er- Allgather (MPI_Bcast) Reduce- Sca<er- Allgather (MPI_Allreduce) Reduce- Sca<er- Gather (MPI_Reduce) 5

6 Collec#ve Communica#on Algorithms for Mul#core Systems Basic MulC- core Aware designs: Hierarchical communicator system Intra- node communicator Includes all processes that share the same address space Lowest rank process is the node- leader Inter- leader communicator MPICH implements many colleccves in a hierarchical manner. Intra- node phases are implemented via simple point- to- point operacons CriCcal to design efficient schemes on emerging mulc- /many- core systems 6

7 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 7

8 Collec#ve Communica#on in MVAPICH2 Collec#ve Algorithms in MV2 Conven#onal (Flat) Mul#- Core Aware Designs Inter- Node (Pt- to- Pt) Inter- Node Communica#on Inter- Node (Hardware- Mcast) Intra- Node Communica#on Intra- Node (Pt- to- Pt) Intra- Node (Shared- Memory) Intra- Node (Kernel- Assisted) Run- Cme flags: All shared- memory based colleccves: MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast- based colleccves: MV2_USE_MCAST (Default : OFF) 8

9 Collec#ve Communica#on in MVAPICH2 MVAPICH2 relies on highly opcmized and tuned shared- memory based opcmizacons for several important colleccves MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Barrier MVAPICH2 also uses a combinacon of hierarchical designs to opcmize MPI_Gather and MPI_ScaYer An opcmized variant of recursive- doubling is used to improve the latency of MPI_Allgather colleccve MVAPICH2 also relies on hardware mulccast based designs to opcmize MPI_Bcast and MPI_ScaYer Kernel- assisted mechanisms are also being used to opcmize performance of colleccves, such as MPI_Allgather, in MVAPICH2 9

10 OSU MicroBenchmarks for Collec#ves Executable osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_barrier osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_sca<er osu_sca<er osu_sca<erv Latency Benchmark MPI_Allgather MPI_Allgatherv MPI_Allreduce MPI_Alltoall MPI_Alltoallv MPI_Barrier MPI_Bcast MPI_Gather MPI_Gatherv MPI_Reduce MPI_Reduce_sca<er MPI_Sca<er MPI_Sca<erv 10

11 OSU MicroBenchmarks for Collec#ves Running OMB colleccve benchmarks: mpirun_rsh np # - hosaile hosts./<benchmark name> Reports the average Cme taken to complete the colleccve operacon Latency averaged across all processes, across 1,000 iteracons AddiConal Run- Cme opcons: Users may choose to view addiconal latency stacsccs, such as Max and Min latency, by enabling the `` f opcon mpirun_rsh np # - hosaile hosts./<benchmark name> - f ``- m opcon allows users to specify the maximum amount of memory and the message lengths used by the benchmark mpirun_rsh np # - hosaile hosts./<benchmark name> - m # (Useful to prevent seg- faults with large messages, on systems with limited memory) 11

12 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 12

13 Tuning Collec#ve Opera#ons in MVAPICH2 MVAPICH2 relies on several shared- memory- based, hardware- mulccast- based and kernel assisted designs to opcmize colleccves CriCcal to tune such designs to ensure low communicacon latency for colleccve operacons, across various message lengths, number of processes, different systems. CollecCve designs are broadly classified as: - Communica#on algorithm (Binomial, Recursive- Doubling, etc.) - Communica#on mechanism (Shared- memory, Limic, Mcast, etc.) 13

14 Hardware Mul#cast- aware MPI_Bcast on TACC Stampede Latency (us) Small Messages (102,400 Cores) Default Mul#cast Message Size (Bytes) % Latency (us) Large Messages (102,400 Cores) Default Mul#cast 2K 8K 32K Message Size (Bytes) 128K Latency (us) Byte Message Default Mul#cast 80% Latency (us) KByte Message Default Mul#cast Number of Nodes Number of Nodes MCAST- based designs improve latency of MPI_Bcast by up to 85% Use MV2_USE_MCAST =1 to enable MCAST- based designs 14

15 Enabling Hardware Mul#cast- aware MulCcast is applicable to MPI_Bcast MPI_Sca<er MPI_Allreduce Parameter Descrip#on Default Nature MV2_USE_MCAST = enable- mcast Enables hardware MulCcast features Configure flag to enable Disabled Enabled 15

16 MPI_ScaYer - Benefits of using Hardware- Mcast Latency (usec) % Sca<er- Default 512 Processes Message Length (Bytes) Latency (usec) Sca<er- Mcast 1,024 Processes 75% Message Length (Bytes) Enabling MCAST- based designs for MPI_Sca<er improves small message latency by up to 75% Use MV2_USE_MCAST =1 to enable MCAST- based designs 16

17 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 Benefits of using Hardware MulCcast Effects of Shared- memory based communica#on Tuning the performance of MPI_Allgather Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 17

18 Effects of Shared- memory based communica#on Shared- Memory specific data structures created to opcmize colleccve communicacon Tunable parameter Applicable to all MPI CollecCves (Except vector variants) For opcmal performance this feature should be enabled Parameter Descrip#on Default Nature MV2_USE_SHMEM_COLL = 1 Enables shared memory opcmizacons for colleccves Enabled 18

19 Effects of Shared- memory based communica#on Latency (us) Processes, MPI_Bcast MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Size (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Bcast, with 4,096 MPI processes improves by up to 75% through opcmized, tuned shared- memory- based designs 75% 19

20 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Allreduce MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Length (Bytes) 66% MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Allreduce improves by up to 66% through opcmized, tuned shared- memory- based designs 20

21 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Gather MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL=1 85% Message Length (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Gather improves by up to 85% through opcmized, tuned shared- memory- based designs 21

22 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communica#on - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 22

23 Tuning the performance of MPI_Allgather MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs Parameter MV2_ALLGATHER_REVERSE_RANKING=1 Descrip#on Enables allgather reverse ranking opcmizacon 23

24 Tuning the performance of MPI_Allgather Latency (us) % MV2_ALLGATHER_REVERSE_RANKING=1 MV2_ALLGATHER_REVERSE_RANKING= K Message Size (Bytes) MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs 24

25 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for collec#ves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 25

26 Collec#ve Tuning Using LiMIC2 LIMIC2 module enables MVAPICH2 to use kernel- assisted data transfers MPI Jobs onen run in fully subscribed mode and can involve significant intranode communicacon Such pa<erns are good indicators of enabling LIMIC2 to speedup intranode tranfers CollecCves such as MPI_Allgather benefits from the use of LIMIC2 Parameter Descrip#on Default Nature MV2_SMP_USE_LIMIC2 Enables shared memory opcmizacons for colleccves Enabled - - with- limic2 Configure flag disabled 26

27 Collec#ve Tuning Using LiMIC2 Allgather Case Study Latency (us) Shmem- Gather Limic2- Gather 20 % K 64K 128K 256K 512K 1M Message Size (Bytes) MPI_Allgather relies on the ring- exchange pa<ern for large messages. CriCcal to opcmize intra- node phases of the ring exchange on mulc- core systems Zero- copy LiMIC transfers improve performance by up to 20% for large message MPI_Allgather operacons 27

28 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking collec#ves in MVAPICH2 Summary 28

29 Improved Bcast in MVAPICH2-2.0a Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Average of 25% improvement for 1024 processes and an average of 22% improvement for 2048 processes 29

30 Non- Blocking Collec#ves in MVAPICH2 MVAPICH2 supports for MPI- 3 Non- Blocking CollecCve communicacon and Neighborhood colleccve communicacon primicves MVAPICH2 1.9 and MVPICH2 2.0a MPI- 3 colleccves in MVAPICH2 can use either the Gen2 or the nemesis interfaces, over InfiniBand MVAPICH2 implements non- blocking colleccves either in a mulc- core- aware hierarchical manner, or via a basic flat approach ApplicaCon developers can use MPI- 3 colleccves to achieve computacon/ communicacon overlap Upcoming releases of MVAPICH2 will include support for non- blocking colleccves based on Mellanox CORE- Direct interface 30

31 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 31

32 Summary CriCcal to opcmize and MPI colleccve operacons on emerging mulc- core systems and high- speed networks Necessary to tune various colleccve designs to ensure low communicacon latency on modern commodity systems MVAPICH2 relies on several opcmized and tuned designs to deliver low communicacon latency for colleccves MVAPICH2 has been tuned across several architectures, considering various processor and network architectures 32

33 Web Pointers NOWLAB Web Page hyp://nowlab.cse.ohio- state.edu MVAPICH Web Page hyp://mvapich.cse.ohio- state.edu 33

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h

More information

Advantages to Using MVAPICH2 on TACC HPC Clusters

Advantages to Using MVAPICH2 on TACC HPC Clusters Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar K. Panda Department of Computer Science and

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Optimizing & Tuning Techniques for Running MVAPICH2 over IB Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu

More information

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI T. Hoefler 1,2, A. Lumsdaine 1 and W. Rehm 2 1 Open Systems Lab 2 Computer Architecture Group Indiana University Technical

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Enhancement of LIMIC-Based Collectives for Multi-core Clusters

Enhancement of LIMIC-Based Collectives for Multi-core Clusters Enhancement of LIMIC-Based Collectives for Multi-core Clusters A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

More information

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab.

Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC). University of Texas at Austin Tuesday

More information

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer

More information

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries? How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries? A Tutorial at MVAPICH User Group Meeting 216 by The MVAPICH Group The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS 6230: High-Performance Computing and Parallelization Introduction to MPI CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA

More information

Comparing Ethernet and Soft RoCE for MPI Communication

Comparing Ethernet and Soft RoCE for MPI Communication IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 7-66, p- ISSN: 7-77Volume, Issue, Ver. I (Jul-Aug. ), PP 5-5 Gurkirat Kaur, Manoj Kumar, Manju Bala Department of Computer Science & Engineering,

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

Intra and Inter Communicators

Intra and Inter Communicators Intra and Inter Communicators Groups A group is a set of processes The group have a size And each process have a rank Creating a group is a local operation Why we need groups To make a clear distinction

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D. Sto; Parker UCLA Outline 1. IntroducCon 2. Map- Reduce 3. Map- Reduce-

More information

MPI Performance Engineering through the Integration of MVAPICH and TAU

MPI Performance Engineering through the Integration of MVAPICH and TAU MPI Performance Engineering through the Integration of MVAPICH and TAU Allen D. Malony Department of Computer and Information Science University of Oregon Acknowledgement Research work presented in this

More information

Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters

Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters by Ying Qian A thesis submitted to the Department of Electrical and Computer Engineering in

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

GPU-Aware Intranode MPI_Allreduce

GPU-Aware Intranode MPI_Allreduce GPU-Aware Intranode MPI_Allreduce Iman Faraji ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ifaraji@queensuca Ahmad Afsahi ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ahmadafsahi@queensuca

More information

MPI: Overview, Performance Optimizations and Tuning

MPI: Overview, Performance Optimizations and Tuning MPI: Overview, Performance Optimizations and Tuning A Tutorial at HPC Advisory Council Conference, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MPI Collective communication

MPI Collective communication MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication

More information

Topology-Aware Rank Reordering for MPI Collectives

Topology-Aware Rank Reordering for MPI Collectives Topology-Aware Rank Reordering for MPI Collectives Seyed H. Mirsadeghi and Ahmad Afsahi ECE Department, Queen s University, Kingston, ON, Canada, K7L 3N6 Email: {s.mirsadeghi, ahmad.afsahi}@queensu.ca

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2018 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Implementing MPI-2 Extended Collective Operations 1

Implementing MPI-2 Extended Collective Operations 1 Implementing MPI-2 Extended Collective Operations Pedro Silva and João Gabriel Silva Dependable Systems Group, Dept. de Engenharia Informática, Universidade de Coimbra, Portugal ptavares@dsg.dei.uc.pt,

More information

MA471. Lecture 5. Collective MPI Communication

MA471. Lecture 5. Collective MPI Communication MA471 Lecture 5 Collective MPI Communication Today: When all the processes want to send, receive or both Excellent website for MPI command syntax available at: http://www-unix.mcs.anl.gov/mpi/www/ 9/10/2003

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction 1 of 18 11/1/2006 3:59 PM Cornell Theory Center Discussion: MPI Collective Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of

More information

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011 CESM (Community Earth System Model) Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:

More information

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Gurkirat Kaur, Manoj Kumar 1, Manju Bala 2 1 Department of Computer Science & Engineering, CTIEMT Jalandhar, Punjab, India 2 Department of Electronics

More information

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism 1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background

More information

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily

More information

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Lei Chai Ping Lai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries A Tutorial at MVAPICH User Group 217 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast Amith R. Mamidala Lei Chai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

IXPUG 16. Dmitry Durnov, Intel MPI team

IXPUG 16. Dmitry Durnov, Intel MPI team IXPUG 16 Dmitry Durnov, Intel MPI team Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2 Intel MPI 2017 Beta U1 is available! Key

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Programming with MPI Collectives

Programming with MPI Collectives Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:

More information

Optimising Communication on the Cray XE6

Optimising Communication on the Cray XE6 Optimising Communication on the Cray XE6 Outline MPICH2 Releases for XE Day in the Life of an MPI Message Gemini NIC Resources Eager Message Protocol Rendezvous Message Protocol Important MPI environment

More information

Optimizing non-blocking Collective Operations for InfiniBand

Optimizing non-blocking Collective Operations for InfiniBand Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective

More information

Process Mapping for MPI Collective Communications

Process Mapping for MPI Collective Communications Process Mapping for MPI Collective Communications Jin Zhang, Jidong Zhai, Wenguang Chen, and Weimin Zheng Department of Computer Science and Technology, Tsinghua University, China {jin-zhang2,dijd3}@mails.tsinghua.edu.cn,

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

Open MPI for Cray XE/XK Systems

Open MPI for Cray XE/XK Systems Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D

More information

Topology Awareness in the Tofu Interconnect Series

Topology Awareness in the Tofu Interconnect Series Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks

More information

Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-core Systems

Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-core Systems Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-core Systems Sourav Chakraborty, Hari Subramoni, Dhabaleswar K Panda Department of Computer Science and Engineering, The Ohio State University

More information

Basic MPI Communications. Basic MPI Communications (cont d)

Basic MPI Communications. Basic MPI Communications (cont d) Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each

More information

Experiences*Using*MVAPICH*in*a*Produc8on* HPC*Environment*at*TACC*

Experiences*Using*MVAPICH*in*a*Produc8on* HPC*Environment*at*TACC* Experiences*Using*MVAPICH*in*a*Produc8on* HPC*Environment*at*TACC* Karl%W.%Schulz%% Director,%Scien4fic%Applica4ons% Texas%Advanced%Compu4ng%Center%(TACC)% % MVAPICH User s Group! August 2013, Columbus,

More information

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

How to Tune and Extract Higher Performance with MVAPICH2 Libraries? How to Tune and Extract Higher Performance with MVAPICH2 Libraries? An XSEDE ECSS webinar by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Distributed Memory Parallel Programming

Distributed Memory Parallel Programming COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS Ying Qian Mohammad J. Rashti Ahmad Afsahi Department of Electrical and Computer Engineering, Queen s University Kingston, ON, CANADA

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information