High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Similar documents
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Designing High Performance DSM Systems using InfiniBand Features

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Intra-MIC MPI Communication using MVAPICH2: Early Experience

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Advanced RDMA-based Admission Control for Modern Data-Centers

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Unifying UPC and MPI Runtimes: Experience with MVAPICH

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

High Performance MPI on IBM 12x InfiniBand Architecture

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Future Routing Schemes in Petascale clusters

Memcached Design on High Performance RDMA Capable Interconnects

Performance Evaluation of InfiniBand with PCI Express

MVAPICH2 and MVAPICH2-MIC: Latest Status

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Accelerating HPL on Heterogeneous GPU Clusters

Performance Evaluation of InfiniBand with PCI Express

MVAPICH2 Project Update and Big Data Acceleration

Unified Runtime for PGAS and MPI over OFED

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Scaling with PGAS Languages

High Performance Migration Framework for MPI Applications on HPC Cloud

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

MVAPICH2-MIC on Beacon:

MM5 Modeling System Performance Research and Profiling. March 2009

2008 International ANSYS Conference

A Case for High Performance Computing with Virtual Machines

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Optimizing LS-DYNA Productivity in Cluster Environments

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

Interconnect Your Future

CP2K Performance Benchmark and Profiling. April 2011

MILC Performance Benchmark and Profiling. April 2013

High Performance MPI-2 One-Sided Communication over InfiniBand

Designing Shared Address Space MPI libraries in the Many-core Era

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

High-Performance Training for Deep Learning and Computer Vision HPC

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

High Performance MPI-2 One-Sided Communication over InfiniBand

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

STAR-CCM+ Performance Benchmark and Profiling. July 2014

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Overview of the MVAPICH Project: Latest Status and Future Roadmap

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Birds of a Feather Presentation

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

High-Performance Broadcast for Streaming and Deep Learning

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

NAMD Performance Benchmark and Profiling. January 2015

Memory Management Strategies for Data Serving with RDMA

Per-call Energy Saving Strategies in All-to-all Communications

TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

NAMD Performance Benchmark and Profiling. November 2010

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Transcription:

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky (3), Sayantan Sur (1) and Dhabaleswar. K. Panda (1) (1) Computer Science & Engineering Department, The Ohio State University (2) The Ohio Supercomputer Center (3) San Diego Supercomputer Center

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 2

Introduction Parallel applications can scale beyond 100,000 cores InfiniBand is commonly used across commodity clusters Message Passing Interface (MPI) is the de-facto programming model Tsubame Supercomputer 73,278 cores 3

Collective Communication in MPI MPI-2.2 defines blocking collective operations limits performance and scalability of dense operations(alltoall) MPI-3 may support non-blocking collectives Hoefler et. al, proposed host-based approaches Latest ConnectX-2 adapters from Mellanox supports network offload features Study benefits with real scientific libraries P3DFFT 4

Overview of InfiniBand MQ M C Q Send Send CQ Wait Application Task List Send InfiniBand HCA Send Q Recv Q Collective Offload Applications can offload task-lists Send Send Wait to the NIC A CQE gets created on the MCQ after execution Problems: Alltoall is extremely communication intensive Size of task-list is limited Directly affects Alltoall scalability Physical Link Data Recv CQ Subramoni et. Al, Hot Interconnects 2010 HotI '10 5

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 6

Motivation MPI_Ialtoall Compute Computation Overlap? MPI_Wait Rank0 Rank1 Rank n Challenge: progress the collective schedules in an asynchronous manner with: performance portability minimal host processor intervention acceptable communication latency 7

Overlap (Higher is Better) Design Space of Collective Algorithms Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload) Latency (Lower is Better) Portability (Higher is Better)

Problem Statement Can we leverage network offload to design MPI_Ialltoall? Will network offload help overlap with collectives? Can network offload improve application throughput? Can we re-design scientific libraries(such as 3DFFT) to leverage our proposed MPI_Ialltoall? 9

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 10

Creating Task-lists with the trigger operation Trigger Application Task List Wait Send Send Send Send Task list with multiple phases A phase has send, wait & trigger tasks Progress thread calls ibv_get_cq_event() and blocks Trigger task generates an interrupt which signals the progress thread MQ M C Q Trigger Send CQ InfiniBand HCA Send Q Wait Send Send Recv Q Send Send Physical Link Recv CQ Trigger Send CQ Trigger Recv CQ 11

Designing Scalable Offload Alltoall Application Thread MPI_Init PI_Ialltoall MPI_Ialltoall returns Create Task-List Offload Progress Thread Post-list; ibv_get_cq _event() Compute Alltoall Complete Trigger; Post-list; Ibv_get_cq _event() MPI_Wait Trigger; End of list; 12

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 13

Parallel 3DFFT Library Applications in areas related to simulations of turbulence, Astrophysics etc. rely heavily on 3DFFT P3DFFT from San Diego Supercomputer Center (SDSC) is a portable, high performance implementation of 3DFFT (http://code.google.com/p/p3dfft/) P3DFFT uses a 2D pencil decomposition to maximize parallelism P3DFFT relies on expensive large message Alltoall operations to implement the transpose operations 14

Re-designing P3DFFT for Overlap V1 V2 V3 X X Y ( Y YZ Z X X Y Y Y Z Z X X Y Y YZ Z Intra-Node Inter-Node A 1DFFT along A Dimension V1 A B V2 A-B Transpose X X Y Y X X Y Y Y Z (I) Z X X Y Y Z (I) Two Parallel Transpose Operations Y Z... Y Z (I) Two Parallel Transpose Operations 15

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 16

Experimental Setup Intel Xeon E5640 (2.53 GHz), 12 GB memory per node MT26428 QDR ConnectX-2 with PCI-Ex interfaces, 171- port Mellanox QDR switch, OFED 1.5.1 RHEL 5.4, 2.6.18-164.e15 kernel version MVAPICH2 A High Performance MPI implementation over InfiniBand and other RDMA networks (v1.6) http://mvapich.cse.ohio-state.edu/ Used by more than 1580 organizations world-wide 17

Micro-Benchmark Evaluations Overlap Benchmark Measure average MPI_Ialltoall latency Overlap Percentage: start_overlap_timer() MPI_Ialtoall(..) while(time < alltoall_latency) MPI_Wait(..) update timer end_overlap_timer() Throughput Benchmark start_throughput_timer() MPI_Ialtoall(..) CBLAS_DGEMM() MPI_Wait(..) end_throughput_timer() 18

Overlap Percentage (%) 100 90 80 70 60 50 40 30 20 10 0 Communication/Computation Overlap 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Alltoall-Offload Alltoall-Host-Based-Test-10 Alltoall-Host-Based-Test-1000 Alltoall-Host-Based-Test-5000 Alltoall Overlap Comparison with 256 Processes Alltoall-Offload delivers near perfect communication/computation overlap for all messages in a portable manner 19

Throughput (GFLOPS) DGEMM Throughput Comparison 6000 5000 4000 3000 2000 1000 Serial 0 500 1500 2500 3500 CBLAS-DGEMM Problem size (N) Alltoall-Offload Host-Based Theoretical Peak CBLAS-DGEMM overlapped with Offload-Ialltoall delivers better throughput upto 110% when compared to Host-Based Ialltoall with 512 processes 20

Latency (msec) 2500 2000 1500 1000 Latency Comparison Alltoall-Default-Host Alltoall-Offload Alltoall-Host-Based Alltoall-Host-Based-Thread 500 0 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Alltoall Latency Comparison with 256 Processes Alltoall-Offload delivers good overlap, without sacrificing on communication latency! 21

Run-Time (s) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Parallel 3D FFT Performance Blocking H-Test Offload 23% ` 512 600 720 800 Problem Size P3DFFT Application kernel run-time Comparison with 128 processes P3DFFT with Offload-Ialltoall performs about 13.5% better than default P3DFFT and about 12% better than P3DFFT with Host-based-Test ` 13% Lower is Better 22

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 23

Conclusions and Future Work Proposed MPI_Ialltoall shows near perfect (99%) overlap Throughput of applications improves significantly through offload-based non-blocking collectives P3DFFT s run-time improved by up to 23% Future work: Extend Offload-based techniques for other MPI collectives and study their benefits with real applications Support for Offload-based collectives will be available in future MVAPICH2 releases 24

Thank you! http://mvapich.cse.ohio-state.edu (1) {kandalla, subramon, surs, panda}@cse.ohio-state.edu (2) ktomko@osc.edu (3) dmitry@sdsc.edu (1) Network-Based Computing Laboratory, Ohio State University (2) The Ohio Supercomputer Center (3) San Diego Supercomputer Center 25