Op#miza#on and Tuning Collec#ves in MVAPICH2
|
|
- Victor Lambert
- 5 years ago
- Views:
Transcription
1 Op#miza#on and Tuning Collec#ves in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: state.edu h<p:// state.edu/~subramon
2 Outline Introduc#on CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 2
3 MPI Collec#ve Opera#ons MPI Standard defines colleccve communicacon primicves to implement group communicacon operacons CollecCves such as MPI_Bcast, MPI_Allreduce are commonly used across parallel applicacons, owing to their ease of use and performance portability Processor and network architecture is constantly evolving mulc- core/many- core architectures, InfiniBand FDR, etc. CriCcal to design new algorithms to implement and opcmize colleccve operacons on emerging systems 3
4 MPI Collec#ve Opera#ons Performance of colleccve operacons is sensicve to various factors: Message length Number of processes in the communicator Processor architecture Number of CPU sockets per node Number of cores per CPU socket Network technology InfiniBand DDR/QDR/FDR Single- Rail, MulC- Rail Important to carefully tune various designs to implement colleccve operacons efficiently 4
5 Conven#onal Collec#ve Communica#on Algorithms Basic colleccve communicacon algorithms: Binomial Tree (MPI_Bcast, MPI_Reduce, MPI_Sca<er, MPI_Gather) Recursive Doubling (MPI_Allreduce, MPI_Allgather) Ring Exchange (MPI_Allgather) Bruck Algorithm, Pairwise Exchange (MPI_Alltoall) Combining basic algorithms to implement more complex schemes: Sca<er- Allgather (MPI_Bcast) Reduce- Sca<er- Allgather (MPI_Allreduce) Reduce- Sca<er- Gather (MPI_Reduce) 5
6 Collec#ve Communica#on Algorithms for Mul#core Systems Basic MulC- core Aware designs: Hierarchical communicator system Intra- node communicator Includes all processes that share the same address space Lowest rank process is the node- leader Inter- leader communicator MPICH implements many colleccves in a hierarchical manner. Intra- node phases are implemented via simple point- to- point operacons CriCcal to design efficient schemes on emerging mulc- /many- core systems 6
7 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 7
8 Collec#ve Communica#on in MVAPICH2 Collec#ve Algorithms in MV2 Conven#onal (Flat) Mul#- Core Aware Designs Inter- Node (Pt- to- Pt) Inter- Node Communica#on Inter- Node (Hardware- Mcast) Intra- Node Communica#on Intra- Node (Pt- to- Pt) Intra- Node (Shared- Memory) Intra- Node (Kernel- Assisted) Run- Cme flags: All shared- memory based colleccves: MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast- based colleccves: MV2_USE_MCAST (Default : OFF) 8
9 Collec#ve Communica#on in MVAPICH2 MVAPICH2 relies on highly opcmized and tuned shared- memory based opcmizacons for several important colleccves MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Barrier MVAPICH2 also uses a combinacon of hierarchical designs to opcmize MPI_Gather and MPI_ScaYer An opcmized variant of recursive- doubling is used to improve the latency of MPI_Allgather colleccve MVAPICH2 also relies on hardware mulccast based designs to opcmize MPI_Bcast and MPI_ScaYer Kernel- assisted mechanisms are also being used to opcmize performance of colleccves, such as MPI_Allgather, in MVAPICH2 9
10 OSU MicroBenchmarks for Collec#ves Executable osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_barrier osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_sca<er osu_sca<er osu_sca<erv Latency Benchmark MPI_Allgather MPI_Allgatherv MPI_Allreduce MPI_Alltoall MPI_Alltoallv MPI_Barrier MPI_Bcast MPI_Gather MPI_Gatherv MPI_Reduce MPI_Reduce_sca<er MPI_Sca<er MPI_Sca<erv 10
11 OSU MicroBenchmarks for Collec#ves Running OMB colleccve benchmarks: mpirun_rsh np # - hosaile hosts./<benchmark name> Reports the average Cme taken to complete the colleccve operacon Latency averaged across all processes, across 1,000 iteracons AddiConal Run- Cme opcons: Users may choose to view addiconal latency stacsccs, such as Max and Min latency, by enabling the `` f opcon mpirun_rsh np # - hosaile hosts./<benchmark name> - f ``- m opcon allows users to specify the maximum amount of memory and the message lengths used by the benchmark mpirun_rsh np # - hosaile hosts./<benchmark name> - m # (Useful to prevent seg- faults with large messages, on systems with limited memory) 11
12 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 12
13 Tuning Collec#ve Opera#ons in MVAPICH2 MVAPICH2 relies on several shared- memory- based, hardware- mulccast- based and kernel assisted designs to opcmize colleccves CriCcal to tune such designs to ensure low communicacon latency for colleccve operacons, across various message lengths, number of processes, different systems. CollecCve designs are broadly classified as: - Communica#on algorithm (Binomial, Recursive- Doubling, etc.) - Communica#on mechanism (Shared- memory, Limic, Mcast, etc.) 13
14 Hardware Mul#cast- aware MPI_Bcast on TACC Stampede Latency (us) Small Messages (102,400 Cores) Default Mul#cast Message Size (Bytes) % Latency (us) Large Messages (102,400 Cores) Default Mul#cast 2K 8K 32K Message Size (Bytes) 128K Latency (us) Byte Message Default Mul#cast 80% Latency (us) KByte Message Default Mul#cast Number of Nodes Number of Nodes MCAST- based designs improve latency of MPI_Bcast by up to 85% Use MV2_USE_MCAST =1 to enable MCAST- based designs 14
15 Enabling Hardware Mul#cast- aware MulCcast is applicable to MPI_Bcast MPI_Sca<er MPI_Allreduce Parameter Descrip#on Default Nature MV2_USE_MCAST = enable- mcast Enables hardware MulCcast features Configure flag to enable Disabled Enabled 15
16 MPI_ScaYer - Benefits of using Hardware- Mcast Latency (usec) % Sca<er- Default 512 Processes Message Length (Bytes) Latency (usec) Sca<er- Mcast 1,024 Processes 75% Message Length (Bytes) Enabling MCAST- based designs for MPI_Sca<er improves small message latency by up to 75% Use MV2_USE_MCAST =1 to enable MCAST- based designs 16
17 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 Benefits of using Hardware MulCcast Effects of Shared- memory based communica#on Tuning the performance of MPI_Allgather Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 17
18 Effects of Shared- memory based communica#on Shared- Memory specific data structures created to opcmize colleccve communicacon Tunable parameter Applicable to all MPI CollecCves (Except vector variants) For opcmal performance this feature should be enabled Parameter Descrip#on Default Nature MV2_USE_SHMEM_COLL = 1 Enables shared memory opcmizacons for colleccves Enabled 18
19 Effects of Shared- memory based communica#on Latency (us) Processes, MPI_Bcast MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Size (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Bcast, with 4,096 MPI processes improves by up to 75% through opcmized, tuned shared- memory- based designs 75% 19
20 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Allreduce MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Length (Bytes) 66% MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Allreduce improves by up to 66% through opcmized, tuned shared- memory- based designs 20
21 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Gather MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL=1 85% Message Length (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Gather improves by up to 85% through opcmized, tuned shared- memory- based designs 21
22 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communica#on - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 22
23 Tuning the performance of MPI_Allgather MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs Parameter MV2_ALLGATHER_REVERSE_RANKING=1 Descrip#on Enables allgather reverse ranking opcmizacon 23
24 Tuning the performance of MPI_Allgather Latency (us) % MV2_ALLGATHER_REVERSE_RANKING=1 MV2_ALLGATHER_REVERSE_RANKING= K Message Size (Bytes) MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs 24
25 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for collec#ves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 25
26 Collec#ve Tuning Using LiMIC2 LIMIC2 module enables MVAPICH2 to use kernel- assisted data transfers MPI Jobs onen run in fully subscribed mode and can involve significant intranode communicacon Such pa<erns are good indicators of enabling LIMIC2 to speedup intranode tranfers CollecCves such as MPI_Allgather benefits from the use of LIMIC2 Parameter Descrip#on Default Nature MV2_SMP_USE_LIMIC2 Enables shared memory opcmizacons for colleccves Enabled - - with- limic2 Configure flag disabled 26
27 Collec#ve Tuning Using LiMIC2 Allgather Case Study Latency (us) Shmem- Gather Limic2- Gather 20 % K 64K 128K 256K 512K 1M Message Size (Bytes) MPI_Allgather relies on the ring- exchange pa<ern for large messages. CriCcal to opcmize intra- node phases of the ring exchange on mulc- core systems Zero- copy LiMIC transfers improve performance by up to 20% for large message MPI_Allgather operacons 27
28 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking collec#ves in MVAPICH2 Summary 28
29 Improved Bcast in MVAPICH2-2.0a Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Average of 25% improvement for 1024 processes and an average of 22% improvement for 2048 processes 29
30 Non- Blocking Collec#ves in MVAPICH2 MVAPICH2 supports for MPI- 3 Non- Blocking CollecCve communicacon and Neighborhood colleccve communicacon primicves MVAPICH2 1.9 and MVPICH2 2.0a MPI- 3 colleccves in MVAPICH2 can use either the Gen2 or the nemesis interfaces, over InfiniBand MVAPICH2 implements non- blocking colleccves either in a mulc- core- aware hierarchical manner, or via a basic flat approach ApplicaCon developers can use MPI- 3 colleccves to achieve computacon/ communicacon overlap Upcoming releases of MVAPICH2 will include support for non- blocking colleccves based on Mellanox CORE- Direct interface 30
31 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 31
32 Summary CriCcal to opcmize and MPI colleccve operacons on emerging mulc- core systems and high- speed networks Necessary to tune various colleccve designs to ensure low communicacon latency on modern commodity systems MVAPICH2 relies on several opcmized and tuned designs to deliver low communicacon latency for colleccves MVAPICH2 has been tuned across several architectures, considering various processor and network architectures 32
33 Web Pointers NOWLAB Web Page hyp://nowlab.cse.ohio- state.edu MVAPICH Web Page hyp://mvapich.cse.ohio- state.edu 33
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationDesigning High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning
5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering
More informationOp#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2
Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h
More informationAdvantages to Using MVAPICH2 on TACC HPC Clusters
Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationOverview of the MVAPICH Project: Latest Status and Future Roadmap
Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesigning Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *
Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar K. Panda Department of Computer Science and
More informationOverview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap
Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h
More informationDesigning High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters
Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar
More informationEfficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning
Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationEfficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationHigh-Performance Broadcast for Streaming and Deep Learning
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction
More informationOptimizing & Tuning Techniques for Running MVAPICH2 over IB
Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu
More informationImplementation and Performance Analysis of Non-Blocking Collective Operations for MPI
Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI T. Hoefler 1,2, A. Lumsdaine 1 and W. Rehm 2 1 Open Systems Lab 2 Computer Architecture Group Indiana University Technical
More informationSupport for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth
Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU
More informationHigh-Performance Training for Deep Learning and Computer Vision HPC
High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDesigning Shared Address Space MPI libraries in the Many-core Era
Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationEnhancement of LIMIC-Based Collectives for Multi-core Clusters
Enhancement of LIMIC-Based Collectives for Multi-core Clusters A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University
More informationUsing MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming
Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h
More informationMVAPICH2 and MVAPICH2-MIC: Latest Status
MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationCUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters
CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty
More informationMVAPICH2 Project Update and Big Data Acceleration
MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationHigh-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT
High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky
More informationCoupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications
Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationHigh Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction
More informationAdvanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab.
Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC). University of Texas at Austin Tuesday
More informationA Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation
A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer
More informationHow to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?
How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries? A Tutorial at MVAPICH User Group Meeting 216 by The MVAPICH Group The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationCS 6230: High-Performance Computing and Parallelization Introduction to MPI
CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA
More informationComparing Ethernet and Soft RoCE for MPI Communication
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 7-66, p- ISSN: 7-77Volume, Issue, Ver. I (Jul-Aug. ), PP 5-5 Gurkirat Kaur, Manoj Kumar, Manju Bala Department of Computer Science & Engineering,
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationStudy. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement
More informationNUMA-Aware Shared-Memory Collective Communication for MPI
NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important
More informationIntra and Inter Communicators
Intra and Inter Communicators Groups A group is a set of processes The group have a size And each process have a rank Creating a group is a local operation Why we need groups To make a clear distinction
More informationGPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks
GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,
More informationMap-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D. Sto; Parker UCLA Outline 1. IntroducCon 2. Map- Reduce 3. Map- Reduce-
More informationMPI Performance Engineering through the Integration of MVAPICH and TAU
MPI Performance Engineering through the Integration of MVAPICH and TAU Allen D. Malony Department of Computer and Information Science University of Oregon Acknowledgement Research work presented in this
More informationDesign and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters
Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters by Ying Qian A thesis submitted to the Department of Electrical and Computer Engineering in
More informationSR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory
More informationImproving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationGPU-Aware Intranode MPI_Allreduce
GPU-Aware Intranode MPI_Allreduce Iman Faraji ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ifaraji@queensuca Ahmad Afsahi ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ahmadafsahi@queensuca
More informationMPI: Overview, Performance Optimizations and Tuning
MPI: Overview, Performance Optimizations and Tuning A Tutorial at HPC Advisory Council Conference, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationMPI Collective communication
MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication
More informationTopology-Aware Rank Reordering for MPI Collectives
Topology-Aware Rank Reordering for MPI Collectives Seyed H. Mirsadeghi and Ahmad Afsahi ECE Department, Queen s University, Kingston, ON, Canada, K7L 3N6 Email: {s.mirsadeghi, ahmad.afsahi}@queensu.ca
More informationCS 470 Spring Mike Lam, Professor. Distributed Programming & MPI
CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI
More informationLiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster
LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering
More informationIn the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.
In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department
More informationCS 470 Spring Mike Lam, Professor. Distributed Programming & MPI
CS 470 Spring 2018 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationImplementing MPI-2 Extended Collective Operations 1
Implementing MPI-2 Extended Collective Operations Pedro Silva and João Gabriel Silva Dependable Systems Group, Dept. de Engenharia Informática, Universidade de Coimbra, Portugal ptavares@dsg.dei.uc.pt,
More informationMA471. Lecture 5. Collective MPI Communication
MA471 Lecture 5 Collective MPI Communication Today: When all the processes want to send, receive or both Excellent website for MPI command syntax available at: http://www-unix.mcs.anl.gov/mpi/www/ 9/10/2003
More informationMVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationMPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits
MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based
More informationCornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction
1 of 18 11/1/2006 3:59 PM Cornell Theory Center Discussion: MPI Collective Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate
More informationCPMD Performance Benchmark and Profiling. February 2014
CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationJob Startup at Exascale:
Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core
More informationMemory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of
More informationCESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011
CESM (Community Earth System Model) Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,
More informationHigh-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters
High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationBridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani
Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:
More informationComparing Ethernet & Soft RoCE over 1 Gigabit Ethernet
Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Gurkirat Kaur, Manoj Kumar 1, Manju Bala 2 1 Department of Computer Science & Engineering, CTIEMT Jalandhar, Punjab, India 2 Department of Electronics
More informationRuntime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism
1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background
More informationThe Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy
The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily
More informationDesigning An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems
Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Lei Chai Ping Lai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science
More informationAdvancing MPI Libraries to the Many-core Era: Designs and Evalua<ons with MVAPICH2
Advancing MPI Libraries to the Many-core Era: Designs and Evalua
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries A Tutorial at MVAPICH User Group 217 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationEfficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast
Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast Amith R. Mamidala Lei Chai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State
More informationImplementing MPI on Windows: Comparison with Common Approaches on Unix
Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationApplication-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering
More informationIXPUG 16. Dmitry Durnov, Intel MPI team
IXPUG 16 Dmitry Durnov, Intel MPI team Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2 Intel MPI 2017 Beta U1 is available! Key
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationProgramming with MPI Collectives
Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:
More informationOptimising Communication on the Cray XE6
Optimising Communication on the Cray XE6 Outline MPICH2 Releases for XE Day in the Life of an MPI Message Gemini NIC Resources Eager Message Protocol Rendezvous Message Protocol Important MPI environment
More informationOptimizing non-blocking Collective Operations for InfiniBand
Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective
More informationProcess Mapping for MPI Collective Communications
Process Mapping for MPI Collective Communications Jin Zhang, Jidong Zhai, Wenguang Chen, and Weimin Zheng Department of Computer Science and Technology, Tsinghua University, China {jin-zhang2,dijd3}@mails.tsinghua.edu.cn,
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationOpen MPI for Cray XE/XK Systems
Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D
More informationTopology Awareness in the Tofu Interconnect Series
Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks
More informationContention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-core Systems
Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-core Systems Sourav Chakraborty, Hari Subramoni, Dhabaleswar K Panda Department of Computer Science and Engineering, The Ohio State University
More informationBasic MPI Communications. Basic MPI Communications (cont d)
Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each
More informationExperiences*Using*MVAPICH*in*a*Produc8on* HPC*Environment*at*TACC*
Experiences*Using*MVAPICH*in*a*Produc8on* HPC*Environment*at*TACC* Karl%W.%Schulz%% Director,%Scien4fic%Applica4ons% Texas%Advanced%Compu4ng%Center%(TACC)% % MVAPICH User s Group! August 2013, Columbus,
More informationHow to Tune and Extract Higher Performance with MVAPICH2 Libraries?
How to Tune and Extract Higher Performance with MVAPICH2 Libraries? An XSEDE ECSS webinar by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationScalability and Performance of MVAPICH2 on OakForest-PACS
Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDistributed Memory Parallel Programming
COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationMULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS
MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS Ying Qian Mohammad J. Rashti Ahmad Afsahi Department of Electrical and Computer Engineering, Queen s University Kingston, ON, CANADA
More informationDELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)
DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More information