Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Similar documents
GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Rock em Graphic Cards

ECE 574 Cluster Computing Lecture 17

Advanced OpenMP. Other threading APIs

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Heterogeneous Computing

LAMMPSCUDA GPU Performance. April 2011

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

NAMD GPU Performance Benchmark. March 2011

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

n N c CIni.o ewsrg.au

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Masterpraktikum Scientific Computing

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Interconnect Your Future

Sharing High-Performance Devices Across Multiple Virtual Machines

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Building the Most Efficient Machine Learning System

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

WebCL Overview and Roadmap

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Mellanox GPUDirect RDMA User Manual

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

The rcuda middleware and applications

Interconnect Your Future

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

High Performance Computing

Solutions for Scalable HPC

Mellanox GPUDirect RDMA User Manual

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

OpenCL in Action. Ofer Rosenberg

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

Birds of a Feather Presentation

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Unified Runtime for PGAS and MPI over OFED

GPUs as better MPI Citizens

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Instructions for setting up OpenCL Lab

The Future of Interconnect Technology

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Building the Most Efficient Machine Learning System

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

CP2K Performance Benchmark and Profiling. April 2011

OpenCL. Computation on HybriLIT Brief introduction and getting started

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Accelerating HPL on Heterogeneous GPU Clusters

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

AMBER 11 Performance Benchmark and Profiling. July 2011

OpenCL Training Course

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Paving the Road to Exascale Computing. Yossi Avni

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Paving the Road to Exascale

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

2008 International ANSYS Conference

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

OPENCL AT NVIDIA BEST PRACTICES, LEARNINGS AND PLANS Nikhil Joshi and Karthik Raghavan Ravi, May 8, 2017

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance Training for Deep Learning and Computer Vision HPC

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

NAMD Performance Benchmark and Profiling. November 2010

OpenCL The Open Standard for Heterogeneous Parallel Programming

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Mellanox GPUDirect RDMA User Manual

NAMD Performance Benchmark and Profiling. January 2015

OCTOPUS Performance Benchmark and Profiling. June 2015

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Operational Robustness of Accelerator Aware MPI

Transcription:

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

MPI and I/O from GPU vs. CPU Traditional CPU point-of-view InfiniBand adapter is a PCIe device (typically x4 lanes) GPU accelerators also PCIe devices (traditionally x16 lanes) GPU Mem GPU Traditionally CPU is workhorse for HPC FLOP workload GPU not in picture Memory copy needed for incoming and outgoing NW messages Hos t Host Mem PCIe slot s IB NW Today Top Top500 system delivers Linpack performance from GPU 2

GPUDirect To make data transfer from GPU to communication network efficient Eliminate internal copying and overhead by the host CPU Potential benefits when transferring data over IB interconnect Application requirement: MPI comm. of CUDA pinned memory GPU Mem GPU Memory copy needed for incoming and outgoing NW messages Hos t Host Mem PCIe slot s IB NW Single pinned memory for GPU and IB 3

Outline Setting up GPUDirect Prerequisites Step-by-step guide Setting up benchmarks CUDA example OpenCL example Results and analysis Exploiting GPUDirect with full-scale GPU accelerated apps Future directions 4

Pre-requisites Hardware x86 platform NVIDIA Tesla cards (Fermi generation) QDR IB adapters QDR switch Software OS: RHEL 5 Update 4 + IB software: OEFD 1.5.x with GPUDirect support MPI: MVAPICH version 1.5+ NVIDIA Development Driver for Linux version 256.35 NVIDIA CUDA Toolkit 3.2+ (incl. OpenCL driver) Reference: AMBER 11 (PMEMD) with GPUDirect white paper 5

Setup Kernel installation 1. Install the required kernel rpm 2. Modify the boot loader configuration 3. Reboot the machine with new kernel Install IB driver (with updates, if applicable) 1. Mount the ISO files 2. Run the installation script (with all option for Mellanox OFED) 3. Restart the driver Reference: AMBER 11 (PMEMD) with GPUDirect white paper 6

Setup Contd. NVIDIA driver and SDK installation 1. Install NVIDIA Development Driver for x86_64 platforms 2. Load NVIDIA driver to confirm installation 3. Ensure the correct version for the driver 4. Install CUDA Toolkit and SDK (version 3.2 and above) 5. Run sample codes from the SDK to ensure GPU devices are setup and accessible 7

Monitoring the GPUDirect Activities How to ensure if the activity is taking place or not Small message sizes may not use RDMA => no GPUDirect IB core parameters to observe /sys/module/ib_core/parameters/gpu_direct_enable As root, set 0 for disable and 1 for enable /sys/module/ib_core/parameters/gpu_direct_pages Values should change after the activity /sys/module/ib_core/parameters/gpu_direct_shares Values should change after the activity Observe the parameters while running an application watch -d -n5 cat /sys/module/ib_core/parameters/* 8

Enabling and Disabling GPUDirect Method 1: environment variables Enable GPUDirect export MV2_RNDV_PROTOCOL=RPUT export MV2_USE_RDMA_ONE_SIDED=1 Disable GPUDirect export MV2_RNDV_PROTOCOL=R3 export MV2_USE_RDMA_ONE_SIDED=0 Method 2: ib_core parameters Enable/Disable GPUDirect (on each node as root; 1=enable, 0=disable) echo 1 > /sys/module/ib_core/parameters/gpu_direct_enable 9

MPI and CUDA Micro-benchmark Modified version of the OSU Bandwidth microbenchmark with GPUDirect #ifdef PINNED if (myid == 0) printf("using PINNED host memory!\n"); cudamallochost( (void**) &s_buf1, MYBUFSIZE); cudamallochost( (void**) &r_buf1, MYBUFSIZE); #else if (myid == 0) printf("using PAGEABLE host memory!\n"); s_buf1 = (char*) malloc(mybufsize); r_buf1 = (char*) malloc(mybufsize); #endif 10

MPI and OpenCL Micro-benchmark Modified version of the OSU Bandwidth microbenchmark with GPUDirect #ifdef PINNED // Get platform and device information cl_platform_id platform_id = NULL; cl_device_id device_id = NULL; cl_uint ret_num_devices; cl_uint ret_num_platforms; cl_int ret = clgetplatformids(1, &platform_id, &ret_num_platforms); ret = clgetdeviceids( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices); // Create an OpenCL context cl_context context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clcreatecommandqueue(context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem s_mem = clcreatebuffer(context, CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR,, CL_MEM_COPY_HOST_PTR, MYBUFSIZE, NULL, &ret); cl_mem r_mem = clcreatebuffer(context, CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR, CL_MEM_COPY_HOST_PTR, MYBUFSIZE, NULL, &ret); // pinned memory (blocked call) s_buf1 = (char *) clenqueuemapbuffer(command_queue, s_mem, CL_TRUE, CL_MAP_WRITE, 0, MYBUFSIZE, 0, NULL, NULL, &ret); r_buf1 = (char *) clenqueuemapbuffer(command_queue, r_mem, CL_TRUE, CL_MAP_WRITE, 0, MYBUFSIZE, 0, NULL, NULL, &ret); 11

Target Platform Setup 12

Bandwidth Results Note: PCIe transfers to/from the GPU are not included in these tests 13

Latency Results Note: PCIe transfers to/from the GPU are not included in these tests 14

OpenCL with GPUDirect Identical results as CUDA pinned version For performance and for IB counters Experiments with different memory referencing schemes did not impact performance CL_MEM_WRITE_ONLY CL_MEM_READ_ONLY CL_MEM_READ_WRITE CL_MEM_ALLOC_HOST_PTR 15

GPUDirect for HPC Applications (I) Benefits reported for HPC applications by vendor studies Nvidia-Mellanox, GPUDirect Technology Brief, 2010 QLogic White Paper, Maximizing GPU Cluster Performance, 2011 We experimented with various multi-gpu applications, for example: Himeno benchmark NAMD AMBER11 16

GPUDirect for HPC Applications (II) Our experiments are far less successful due to possible issues GPGPU accelerated applications currently do not share communication and computation data Historical reasons Message size constraints Recall impact of GPUDirect for certain message sizes System setup Recall impact of environment variables Experimental setup Input configurations for applications Compute Send data GPU Acc. Recv. data Send N/W Recv N/W 17

Solving Possible Pitfalls (I) Does the application communicate using CUDA pinned memory? This is a requirement MPI library build configuration Must have RDMA communication enabled Some custom MPI configurations may not work For example, our Nemesis config of MVAPICH2 does not show GPUDirect traffic for larger applications (e.g. AMBER11) 18

Solving Possible Pitfalls (II) Message sizes need to be above the RDMA threshold Find the message sizes in the application (e.g. using a performance tool) Use environment variables to control the RDMA threshold; for example: export MV2_VBUF_TOTAL_SIZE=10000 export MV2_IBA_EAGER_THRESHOLD=10000 Test to ensure performance does not drop 19

Future Directions Reduce further copying overhead and host involvement Can CUDA 4.0 GPUDirect 2.0 be exploited? GPU Mem GPU Possible changes in benchmarking Showcase how GPUDirect can be exploited Memory copy NOT needed for incoming and outgoing NW messages Hos t PCIe slot s IB NW Possible benefits to applications Currently being investigated Host Mem 20

Summary GPUDirect 1.0 very first step for reducing traditional CPU<->GPU overheads for MPI communication Application development process oblivious to improved features Should GPU always rely on CPU for access to system resources? Yes not much needs to be done incremental changes should suffice No a lot needs to be done R&D in programming and runtime systems needed to guide application developers 21

Acknowledgements Gilad Shainer, Mellanox Pak Lui, Mellanox DK Panda, Ohio State University Sayantun Sur, Ohio State University Timothy Lanfear, NVIDIA 22