Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Size: px

Start display at page:

Download "Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters"

Noel Harrison
6 years ago
Views:

1 2015 IEEE International Conference on Cluster Computing Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu and Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State University {hamidouche.2, venkatesh.19, awan.10, subramoni.1, chu.368, Abstract GPUDirect RDMA (GDR) brings the highperformance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as Device ). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure truly one-sided communication for the different inter-/intranode configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Hostto-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively. Keywords PGAS, OpenSHMEM, GPU Direct RDMA, CUDA I. INTRODUCTION The emergence of accelerators such as NVIDIA General Purpose Graphic Processing Units (GPGPU or GPU in short) is changing the landscape of supercomputing systems. This trend is evident in the TOP500 list released in July 2015, where 90 systems make use of accelerator/co-processor technology [1]. GPUs, being PCIe devices, have their own memory space and require data to be transferred to their memory through specific mechanisms. The Compute Unified Device Architecture (CUDA) [2] API is the most popular programming framework available for users to take advantage of GPUs. It provides mechanisms to compute on GPU, synchronize threads on GPU, and move data between CPU and GPU. In addition to the generic CUDA APIs, auxiliary features such as GPUDirect help expedite data transfers to/from GPU memory. GPUDirect is a set of features that enable efficient data movement among GPUs as well as between GPUs and peer PCI Express (PCIe) devices. CUDA 5.0 introduced the GPUDirect RDMA (GDR) feature, which allows InfiniBand network adapters to directly read from or write to GPU device memory while completely bypassing the host [3]. This has the potential to yield significant performance benefits especially in the presence of multiple communication configurations that GPU devices expose. In these heterogeneous systems data can be transferred from Host-to-Host (H-H), Device-to-Device (D- D), Host-to-Device (H-D), and Device-to-Host (D-H). Further these configurations can be either intra-node or inter-node. Scientific applications use CUDA, in conjunction with high-level programming models like Message Passing Interface (MPI) / Partitioned Global Address Space (PGAS). Usually, CUDA is used for the kernel computation and data movement between local CPU host and GPU device. MPI/PGAS is responsible for inter-process communication. Several MPI implementations [4, 5] now allow direct communication from GPU device memory and transparently improve performance of GPU-GPU communication using techniques like CUDA IPC, GPUDirect RDMA and pipelining; thus enabling applications to achieve better performance [6, 7]. However, several researchers have shown that message passing paradigm may not be the best fit for all classes of applications. PGAS programming models, with their light weight one-sided communication and low overhead synchronization semantics, present an attractive alternative for developing dataintensive applications that may have an irregular communication pattern [8 10]. They have also been shown to benefit bandwidth limited applications [11]. There are two categories of PGAS models: 1) language-based such as Unified Parallel C (UPC) [12] and Co-Array Fortran [13], and 2) library-based, such as OpenSHMEM [14]. OpenSHMEM is an effort to bring together a variety of SHMEM and SHMEM-like implementations into an open standard. The OpenSHMEM memory model allows application developers to allocate and manage data objects within symmetric memory regions which are accessible to other processing elements (PEs) via standard OpenSHMEM library functions. It provides better programmability by allowing processes to access a data variable at a remote process by specifying the corresponding local symmetric variable. For current OpenSHMEM application programs that involve data movement between GPUs, the developer has to separately manage the data movement between GPU device memory and main memory at each process using CUDA, as well as the data movement between processes using OpenSH- MEM. In other words, the current OpenSHMEM standard does not support symmetric allocation for heterogeneous memory systems like GPU-based clusters. These shortcomings severely /15 $ IEEE DOI /CLUSTER

Configurations (D-D, H-D, D-H) (D-D, H-D, D-H) (D-D,H-D, D-H) D-D (D-D, H-D, D-H) (D-D, H-D, D-H) Schemes user cudamemcpy user cudamemcpy IPC pipeline IPC, GDR GDR, pipeline, proxy Performance Poor

2 TABLE I. FEATURES, DESIGNS AND CONFIGURATION SUPPORTS COMPARISON BETWEEN EXISTING AND PROPOSED SOLUTIONS Naive Host-based Pipeline[15] Proposed Intranode Internode Intranode Internode Intranode Internode Configurations (D-D, H-D, D-H) (D-D, H-D, D-H) (D-D,H-D, D-H) D-D (D-D, H-D, D-H) (D-D, H-D, D-H) Schemes user cudamemcpy user cudamemcpy IPC pipeline IPC, GDR GDR, pipeline, proxy Performance Poor Poor Medium Poor Good Good True-one Sided Poor Poor Good Poor Good Good Productivity Poor Good Good limit the programmability of the OpenSHMEM model. It requires users to employ intricate pipelining designs and take advantage of advanced features like CUDA IPC to achieve optimal GPU-GPU communication performance which negates the programmability aspects of PGAS. Furthermore, the current model nullifies the benefits of asynchronous one-sided communication by requiring the target process to perform a CUDA memory copy from the host OpenSHMEM memory to the Device CUDA memory in order to complete the transfer. A. Motivation Recently, researchers have proposed simple extensions to OpenSHMEM [15] and UPC [16] memory models to allow symmetric allocation on GPU memories. The extension for OpenSHMEM mainly introduces the concept of Domain which is used when calling shmalloc to specify where the symmetric allocation will be performed on Host or GPU. As MPI libraries do, using the Unified Virtual Addressing (UVA) CUDA feature, the authors have also proposed a CUDA-Aware OpenSHMEM runtime that hides the complexity of GPU programming. It transparently uses a pipeline technique involving cudamemcpy D-H, InfiniBand H-H and cudamemcpy H-D for efficient communication. Although, this simple extension ensures productivity and reduces the burden of the programmer, the runtime design has undesirable aspects. From Figure 1, we can clearly see that this design is non-optimal for the current GPU clusters with GPUDirect RDMA capability: 1) It requires the involvement of target process in the last step of the pipeline to perform the cudamemcpy. This requirement removes the true one-sided communication of OpenSHMEM semantic and introduces an implicit synchronization between the source and target, reducing the computation/communication overlap potential. 2) Although the pipeline design is efficient for large message sizes, it has an overhead in latency for small message range which reduces its efficiency. Finally, stateof-the-art designs primarily consider communication within a single domain (H-H or D-D configurations) and do not optimize the inter-domain communication which involves H-D and D-H configurations. TABLE II. LATENCIES OF 4 BYTES put OPERATION AT IB AND OPENSHMEM LEVELS FOR INTERNODE DATA MOVEMENT BETWEEN HOSTS AND GPUS (IN USEC) 4 Bytes Host-Host (usec) GPU-GPU (usec) IB Send/Recv OpenSHMEM Put On the other hand, as indicated in [17, 18], GDR has the potential to deliver very low latency compared to transfers staged through the host without involvement from the remote process. However, its bandwidth is severely limited when compared to the bandwidth that an InfiniBand HCA offers. Table II shows the inefficiency of the current OpenSHMEM runtimes for GPU systems with GDR. At the same time it Fig. 1. Internode Host-based Pipeline Design [15] highlights the potential impact of GDR in the data movement from/to GPUs. B. Challenges and Contributions Existing solutions and runtimes of OpenSHMEM model have not been designed with GPU/GDR capability-awareness in mind. In other words, the current solutions are unsuitable for GDR-enabled systems and thus achieve sub-optimal performance. The limitations posed by the state-of-the-art techniques lead us to the following challenges: Can OpenSHMEM memory model support communication with heterogeneous memories such as H-D and D-H on NVIDIA GPU clusters? Is it possible to have designs to achieve truly one-sided communication to/from GPUs? Can new designs be proposed to efficiently take advantage of GPUDirect RDMA feature? What are the alternative designs for intra-node and inter-node with the different configurations: H-H, D- D, H-D, and D-H? Can the proposed OpenSHMEM runtime improve the performance of applications? Building on top of the domain-based extension to the Open- SHMEM memory model, in this paper, we tackle the above challenges and propose a novel framework to efficiently design an OpenSHMEM runtime for GPU based systems using GDR. To the best of our knowledge, this is the first paper exploiting GDR features in designing an efficient OpenSHMEM runtime. This paper makes the following contributions: Propose GDR-based designs to efficiently support OpenSHMEM communication from/to GPUs for all configurations. 79

3 Design novel and efficient truly one-sided communication runtime for both intra-node and inter-node configurations. Propose hybrid and proxy-based designs to overcome current hardware limitations on GPU based systems. Redesign LBM application to use OpenSHMEM directly from/to GPU memories and show the benefits of such designs on end application. Table I highlights and compares features and designs of existing OpenSHMEM solutions for GPU cluster with the proposed solution. Naive refers to the basic OpenSHMEM model, where the users explicitly manage and copy the data from/to GPUs while the internode communications are exclusively happening Host to Host. Host-based Pipeline is the CUDA- Aware OpenSHMEM design proposed in [15]. Designed on top of MVAPICH2-X [19], the evaluation results show that the proposed framework achieves 2.5X and 7X latency improvement for small and medium message range for intra-node and inter-node communications, respectively. On 64 GPU nodes, we show 19% improvement in the execution time of Stencil2D application kernel from SHOC suite. LBM application shows 53% and 45% improvement on the execution time of the evolution phase on 32 and 64 GPU nodes, respectively. II. BACKGROUND A. GPU Node Architecture and GPUDirect Technology Current generation GPUs from NVIDIA are connected as peripheral devices on the I/O bus (PCI express). Communication between GPU and host, and between two GPUs happens over the PCIe bus. NVIDIA s GPUDirect technology provides a set of features that enable efficient communication among GPUs used by different processes and between GPUs and other devices like network adapters. With CUDA 4.1, NVIDIA addressed the problem of inter-process GPU-to-GPU communication within a node through CUDA Inter-Process Communication (IPC). A process can map and directly access GPU memory owned by other processes on the same node, similar to shared memory on the host. The data movement between the process device memories can happen without involving main memory. In CUDA 5.0, GPUDirect has been extended to allow third party PCIe devices to directly read/write data from/to GPU device memory. This feature is called GDR and is currently supported with Mellanox InfiniBand network adapters. This provides a fast path for moving data in the GPU device memory on to the network that completely bypasses the host. These transfers between GPU and IB adapter are implemented as peer-to-peer (P2P) PCIe transfers. B. GPUDirect RDMA and PCIe Bottlenecks Although GDR provides a low latency path for inter-node GPU-GPU data movement, its performance for large data transfers is limited by the bandwidth supported for PCIe P2P exchanges on modern node architectures [17]. The performance of P2P transfers in MB/sec and as a percentage of peak FDR IB bandwidth, is presented in Table III. We consider scenarios where the IB card and GPU are connected on the same socket (intra-socket configuration) as well as the intersocket configuration where the IB and the GPU are connected to different sockets on the same node. These issues severely limit the performance achieved by GPUDirect RDMA for large message transfers. Performance of both P2P write and read operations are severely limited when the devices are connected to different sockets. Note that these artifacts are specific to the node architecture and not the GPU or the IB adapter. Similar limitations can arise between any two PCIe devices which involve P2P transfers. TABLE III. PEER-TO-PEER PERFORMANCE ON IVYBRIDGE ARCHITECTURES AND PERCENTAGE OF BANDWIDTH OFFERED BY FDR IB ADAPTER (6,397 MB/S) IvyBridge (IVB) Intra-Socket Inter-Socket P2P Read 3,421 MB/s (54%) 247 MB/s (4%) P2P Write 6,396 MB/s (100%) 1,179 MB/s (19%) C. PGAS and OpenSHMEM Programming Models Partitioned Global Address Space (PGAS) models provide a logical shared memory abstraction on a physically distributed memory system making it easier to program. SHMEM is one such popular PGAS model with several successful implementations. OpenSHMEM [14] is an effort to standardize and make SHMEM more widely useful for the community. OpenSHMEM operates on a symmetric memory address space and allows processes or processing elements (PE) to see each other s variables with a common name, with each PE having its own local copy of the variables. These are called symmetric variables and are allocated collectively. As in C, symmetric objects can be global or static variables and can also be allocated dynamically from a symmetric heap using routines like shmalloc and shmemalign. OpenSHMEM defines one-sided point-to-point (put and get) and collective communication operations for data movement between symmetric variables. The put operations in OpenSHMEM return when the data has been copied out of the source buffer; they need not be complete at the target. The completion at target is ensured using explicit synchronization. The get operations return only when data is available for use in the local buffer and hence do not require additional synchronization. OpenSHMEM also provides atomics and lock routines that allow implementation of critical regions. D. MVAPICH2-X Runtime MVAPICH2-X [19, 20] provides a unified highperformance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for the PGAS programming model. This minimizes the development overheads that have been a substantial deterrent in porting MPI applications to PGAS models. The unified runtime also delivers superior performance compared to using separate MPI and PGAS libraries by optimizing use of network and memory resources [21, 22]. III. GDR-AWARE OPENSHMEM RUNTIME DESIGNS The availability of a high performance runtime is a key factor for wider acceptance of any programming model. Efficiency of designs for OpenSHMEM communication routines can vary widely based on the features and constraints of the underlying GPU programming platform and hardware configuration. In this section, we discuss and propose different alternatives to design efficient and true one-sided GDR-Aware 80

OpenSHMEM communication for both intra-node and internode configurations. A.

memory. Like symmetric memory on the host, the GPU heap size is controlled by a runtime parameter.

This allows us to take advantage of the best path and avoid the P2P hardware bottleneck described in Section II-B.

To reduce the cost of memory registration which is an expensive operation we utilize the registration cache in MVAPICH2-X.

In the last step of the initialization, the memory descriptors and the IPC handlers are exchanged between all processes.

4 OpenSHMEM communication for both intra-node and internode configurations. A. Enhanced Initialization and Heap Allocation The memory initialization of OpenSHMEM was extended to allow processes to create a symmetric heap on the GPU memory in addition to the heap created on host memory. Like symmetric memory on the host, the GPU heap size is controlled by a runtime parameter. As described in later subsections, we propose hybrid designs that choose between GDR and CUDA IPC for different configurations. This allows us to take advantage of the best path and avoid the P2P hardware bottleneck described in Section II-B. In order to enable RDMA support with the GDR path, each process registers both the heaps with the IB HCA. This creates the memory descriptors (lkey and rkey). To reduce the cost of memory registration which is an expensive operation we utilize the registration cache in MVAPICH2-X. For intra-node configurations, each process creates CUDA IPC handler for its GPU heap. In the last step of the initialization, the memory descriptors and the IPC handlers are exchanged between all processes. Each process creates a local table with RDMA descriptors for both Host and GPU heaps as well as the IPC handlers of other processes. To perform symmetric allocations, user calls shmalloc(size, domain). Explicitly specifying the domain is the only modification needed to make an OpenSHMEM program CUDA-Aware. Thanks to this CUDA-Aware OpenSHMEM concept, no other changes are needed and the same API calls can be used for both Host and GPU communications. B. Design Alternatives for Intranode Communication In this section, we describe the different techniques and schemes that enable efficient intra-node communication. Although we use the put operation here for elucidating the concepts, our designs are equally applicable to the get operation as well. Fig. 2. Intranode Host-to-Device (H-D) using GPUDirect RDMA Host-to-Device (H-D): Figure 2 illustrates the designs used for H-D put communication. During an H-D put operation, the source buffer is on the Host heap while the destination buffer is on the Device (GPU) heap. Using the UVA feature, the source process determines that the destination address is on the GPU. It then translates the local destination address to remote address using the information exchanged during initialization and performs a lookup for the memory descriptors as well as the IPC handler of the remote process. This information is stored in a local table. Depending on the message size, the source process posts a GDR transfer or a CUDA IPC copy. Due to the P2P bottleneck, we define different thresholds where the GDR capabilities can be used for a read or write operation. These thresholds are runtime parameters and can be tuned for different architectures. If the message size is less than the GDR threshold, the source process posts an RDMA write operation to the target process. This operation will perform a loopback as both source and target processes are on the same node. Further, as the remote address is on the GPU, the HCA will directly write the data to the GPU by-passing the host memory while using the GDR capability. However if the message size is larger than GDR threshold for a put operation, the source process uses the IPC handler information about the destination buffer and performs a cudamemcpy. Same logic and operations are performed for a D-H get operation. Note that both designs enable truly one-sided operations where the target process is not involved in the communication progress. Fig. 3. Intranode Device-to-Host (D-H) Hybrid Design with GDR Loopback and cudamemcpy Device-to-Host (D-H): In this configuration, the source buffer is on the GPU heap whereas the destination buffer is on the Host heap. Similar to H-D configuration, we propose a hybrid design for D-H configuration. The GDR code path uses the same logic, with the only difference is the threshold as this operation involves a P2P read from the GPU. However, this configuration faces a challenge on the IPC design part. Indeed, IPC maps only device buffers to other process memory and not host buffers; on the other hand cudamemcpy requires both addresses to belong the same process. One possible solution is to involve the target process and ask it to perform the copy from the mapped IPC source buffer to his local host memory. Although this solution might exhibit good performance, clearly it is not suitable as it violates the one-sided requirement. A better alternative design which ensures performance and the true one-sided property is depicted in Figure 3. In this design, the source process first calls shmem ptr to find the shared memory address of the destination buffer. Using the shared memory address, the source process directly performs a cudamemcpy from its device buffer to the shared memory buffer. As this shared memory address corresponds to the destination buffer, no more operations are needed to complete the data transfer. 81

5 Device-to-Device (D-D): D-D communication can utilize the hybrid design proposed earlier for H-D communication with both source and destination buffers being present on the GPU heap. However, it uses the least GDR threshold (read or write). C. Internode Communication Alternative Designs For inter-node communication, we propose a hybrid design that uses three different protocols. Each protocol exhibits different behavior depending on the message size, the communication pattern (put/get, H-D, D-H, and D-D), and the configuration of the nodes i.e the placement of the GPU and the IB HCA (same socket or different sockets). Direct GDR Protocol: This is used for small and medium message ranges with intra-socket node configuration. It is used for both put and get operation for H-D, D-H, and D-D configurations. This design ensures very low latency as it posts an RDMA operation directly from the source buffer to the destination buffer irrespective of their location (GPU or Host). Similar to the loopback design, the Direct GDR uses different thresholds for put and get operations. Figure 4, using a solid green line, illustrates the path used by this design for a D-D configuration (same design applies for other configurations). As we can see, this design ensures the true one-sided property of OpenSHMEM for GPU communication. Fig. 4. Inter-Node Designs with GDR: Direct GDR and Pipeline GDR write Pipeline GDR write Protocol: As indicated in Table III, the bottleneck when the HCA writes to GPU memory is not severe for intra-socket configuration. Based on these observations, we propose an improved pipeline design in which the data is copied onto the pre-registered buffers on the host using CUDA IPC to avoid the P2P read bottleneck. It is then directly written to the destination GPU memory using GDR write as shown in Figure 4 with the dotted lines. This design avoids the P2P read bottleneck and thus targets only a sub-set of configurations. This design is used for put D-D and D-H operations in the context of intra-socket node configuration. In our earlier work [17], we have used similar design in the context of MPI two sided operations where it uses rendezvous protocol to synchronize the sender and receiver processes. However, the current design is truly one-sided as the target process is not involved in the communication. Further, as the put operation returns when the source buffer is ready for reuse, with the proposed design, the source process returns from a put operation as soon as the last IPC cudamemcpy is complete and the RDMA write operation is posted. Proxy-based Protocol: The two above designs are not sufficient to efficiently handle all configurations in terms of node configurations and communication patterns. As indicated above, in earlier work [17], we addressed the P2P limitation for the different node configuration in the context of MPI two-sided communication using hybrid designs involving GPUDirect RDMA and host-based pipelining. Key additional features desired in the context of the OpenSHMEM model are asynchronous progress and truly one-sided communication. One way to achieve this is to use the service thread per process available with the reference implementation of OpenSHMEM, to progress communication. However, this will take additional CPU resources at each process and involves locking overheads as the service thread and the main process share the communication channel. In light of these requirements and challenges, we propose a proxy-based framework to support OpenSHMEM model on GDR-enabled clusters. The proxy uses CUDA IPC memory copies and RDMA transfers to efficiently move data from/to GPUs, working around the bottlenecks. This extends our work to address similar issues in the context of Intel s many-core co-processors [23]. Figure 5 depicts a high-level representation of the proxy-based framework. The key difference in the proposed framework from our earlier work is the way memory accesses are managed and how progress is made. When the symmetric heaps are created, the proxy process running on the node will map the memory from all the processes running on the node into its address space using the CUDA IPC API. When there are multiple GPU devices per node, the proxy maintains a context on each GPU and keeps a mapping between MPI processes and the GPU device each of them uses. We avoid overheads of context switching, as the IPC mapping is performed only during the heap creation. When an OpenSHMEM one-sided communication operation is issued, the source process passes a signal to remote proxy with information about the source and target buffers. However the target process is not involved in the progress which leads to a totally asynchronous and true onesided behavior. The proxy uses different communication paths depending on the location of the source buffer, destination buffer, the architecture of the node (location of GPU and IB card), and the communication pattern. For instance, with a D-D get operation, on intra-socket node configuration, the source process signals the remote proxy to copy the data from device to its own local memory using IPC and then write the data directly to the source GPU in a pipelined manner as depicted Figure 5. In other words the remote proxy will execute the reverse pipeline GDR write protocol. In an inter-socket node configuration, the remote proxy sends the data to the source preregistered host memory which will be staged to the final destination with a local IPC cudamemcpy. Note that we chose to get the source process involved in the progress as a shmem get operation is blocking and source has to wait for the completion. However the target process is not involved in the progress which leads to a totally asynchronous and true one-sided behavior. The proxy progresses communication operations on behalf of all processes on the node. As proxy is used only for large message communication, a single proxy is enough to saturate the PCIe and network bandwidths. Small messages are handled directly by the processes themselves. 82

6 Fig. 5. Proxy-based Framework to support efficient large message size communication with OpenSHMEM on GPU clusters D. Support for Atomic Operations OpenSHMEM model proposes atomic operations such as fetch-and-add and compare-and-swap to implement locks and critical sections as well as synchronization methods. For 64 bits size, InfiniBand HCAs offer high performance support of these atomic operations at hardware level. We took advantage of this feature with GDR to to perform these atomics when the data is on GPUs. For data size less than 64 bits we borrow the host design which uses mask techniques to take advantage of the RDMA atomic support. Because atomics are mostly used for data size less than or equal to 64 bits, this paper does not handle atomics with large data size. IV. APPLICATION REDESIGN WITH OPENSHMEM In order to evaluate the benefits of our benchmark, we have redesigned the GPU version of the Lattice Boltzmann Method (LBM) application with OpenSHMEM. The existing version of GPULBM is a parallel distributed CUDA implementation for multiphase flows with large density ratios [24]. This version is CUDA-Aware MPI that uses MPI two-sided (send/recv) to exchange the data between processes. It is an iterative application which operates on 3D Data grids. Data decomposition is done along the Z axis. The iterative phase is called Evolution. Evolution time is the time spent in the main loop of the code which dominates the runtime as the application runs for large number of iterations in a real world run. The Evolution phase involves three exchanges in each timestep/iteration: exchange of laplacian of the phase, phi (1 element); exchange of phase distribution function, f (1 element); and another exchange of phase and momentum distribution functions, f and g (6 elements). The size of messages involved in each exchange can be obtained as the product of X and Y dimensions of the grid, the corresponding number of elements involved in the exchange and the size of datatype float. To redesign an OpenSHMEM or Hybrid MPI+OpenSHMEM version of the application, first we need to handle the memory allocation. To do so we track all the GPU memory allocations (cudamalloc) and compute the accumulate size and replace them with a symmetric heap allocation with the GPU domain using the OpenSHMEM shmalloc call. Then for the different MPI point-point operations we replace them with the appropriate shem putmem calls to ensure light-weight and asynchronous exchange of the data from/to GPUs. V. PERFORMANCE EVALUATION A. Experimental Setup Wilkes cluster has been used for the performance evaluation. Wilkes was deployed in November 2013 and is the fastest cluster in the United Kingdom s academia. The cluster is partitioned with different configurations. For our purpose we use the Tesla partition which has 128 nodes and each node is a 6-core dual-socket Intel IvyBridge equipped with 2 Tesla K20 NVIDIA GPUs and 2 FDR IB HCAs. For the evaluation we compare our designs labeled Enhanced-GDR with the earlier work with a host-based pipeline (Host-Pipeline) scheme presented in [15]. The Enhanced-GDR framework is a hybrid scheme that includes all the designs presented in Section III. Depending on the message size and the configuration, the framework is tuned to select the best protocol. B. Micro-benchmark Level Evaluation We first present the results using the OMB-GPU [25] micro-benchmark suite that extends the OMB suite with GPU support. Performance has been shown using point-to-point micro-benchmarks for both Put and Get operations on different configurations. Intranode Evaluation: Figure 6 illustrates the performance of Put and Get operations on H-D configuration. For small and medium size ranges, for both Put and Get operations, the GDR based design achieves over 2X improvement compared to the default design. As shown in Figures 6(a) and 6(c), with 4 Bytes message size, GDR based design achieves a latency of 2.4μs and 2.02μs for Put and Get operations, respectively. In comparison, the default design which uses copies based on CUDA IPC results in a latency of 6.2μs for the same message size. For larger message sizes, the performance of H-D Put operation (Figure 6(b)) and Get D- H operation (Figure 7(d)) are on par with the default design as all of them use the CUDA IPC copy scheme. However, the situation is quite different for large message transfers with Put D-H and Get H-D as shown in Figures 7(b) and 6(d), respectively. Indeed, taking advantage of the shared memory design presented in Section III, the proposed scheme reduces the latency of large transfers by 40%. Internode Evaluation: Figure 8(a) shows that the GDRbased solution, with its Direct GDR protocol introduced in Section III, reduces the latency of the put operation for small message sizes by 7X. The latency of a shmem putmem of 8 bytes from GPU to remote GPU is reduced from 20.9μs to 3.13μs. Further a 2KB message size transfer is achieved in under 4μs. Similar benefits and trends are observed with the Get operation as shown in Figure 8(c). When the message size crosses GDR thresholds, the framework automatically selects the pipeline schemes to avoid the P2P bottlenecks. For the Put operation with large message, the Pipeline GDR write protocols are selected, and as the slowest path is the cudamemcpy which exists with both existing and new designs, both designs lead to the same performance as shown in Figure 8(b). For the Get operation the remote proxy is involved in the progress following the proxy-based protocol. From Figure 8(d), we can clearly see that our proposed design does not incur any overhead and efficiently avoids the P2P bottlenecks. 83

7 (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 6. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for intra-node H-D Put and Get operations (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 7. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for intra-node D-H Put and Get operations As indicated in earlier sections, the existing solution does not handle inter-domain configurations such as H-D and D- H with inter-node communication. Thus, Figure 9 shows only the performance of the proposed design. Similar to the D- D configuration, Put and Get operations achieve very good performance for small and medium message ranges for H-D and D-H configurations. The proposed design achieves 2.81μs for an inter-node H-D Put operation for 8 Bytes and 3.7μs for 4KB transfers. Overlap Evaluation: To demonstrate true one-sidedness of the proposed design, we evaluate the overlap achieved during a Put operation. The benchmark uses 2 processes; the source process issues a put operation to the target process which is busy computing. As shown in Figures 10(a) and 10(b), with both medium (8KBytes) and large (1MBytes) respectively, compared to the existing solution where the communication time increases with the computation time at the target, the proposed design maintains the same communication time regardless of the target behavior. This asserts that the proposed design achieves a true one-sided communication progress and leads to 100% overlap. Note that both designs are evaluated without enabling the service thread available with the OpenSHMEM implementation. The service thread might help the existing design to achieve a better overlap however as stated in section III, it will lead to a significant degradation in application efficiency as threads will consume half of the CPU resources. C. Application Level Evaluation In addition to the redesigned LBM application version presented in Section IV, we considered the Stencil2D application benchmark from the Scalable Heterogeneous Computing (SHOC) Benchmark Suite to evaluate the benefits of the proposed runtime on application performance. Figure 11 clearly shows the advantage of our designs on the execution time of the Stencil2D benchmark. The reported numbers are based on the median metric with double precision for 1,000 internal iterations. We are reporting an average of ten iterations (program runs). With 1K x 1K input size, as shown in Figure 11(a), we are able to improve the performance by 24%, 18% and 14% on 16, 32 and 64 GPU nodes, respectively. With large input size (2K x 2K), the proposed design reduces the execution time by 20% and 19% on 32 and 64 GPU nodes respectively. Using the redesigned version of the LBM application, we evaluate the performance of the evolution phase. We use both Strong and Weak scaling as depicted in Figures 12(a) and 12(b), respectively. With the weak scaling experiment, we keep the input size per GPU as 64 x 64 x 64 and the processes grid as balanced as possible. For example with 64 processes, we distribute on the grid as 4 x 4 x 4. From the figures we can see that the proposed design achieves 70%, 53% and 45% improvement on 16, 32 and 64 GPUs, respectively with the strong scaling experiment. Note that the performance degradation at 32 and 64 nodes compared to 8 and 16 nodes, is due to the increase in the communication time with scale. Indeed, with scale and with small input size, the communication time is greater than the computation. With the other experiment, we show 39% and 30% improvement with 32 and 64 GPU nodes, respectively. We are not able to include large input size evaluation due to the limit on amount of memory that GPU can register, This is a configuration limit on Wilkes system. VI. RELATED WORK There have been several efforts in making programming models accelerator and co-processor-aware. The work in [6] proposed efficient GPU to GPU communication by overlapping RDMA data transfer with CUDA memory inside MPI library. The authors in [17, 18] extended this work with GPUDirect 84

(a) Put - D-H (b) Put - H-D (c) Get - H-D (d) Get - D-H Latency performance of the proposed GDR-based designs for inter-node D-H and H-D configuration with Put and Get

8 (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 8. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for inter-node D-D Put and Get operations Fig. 9. (a) Put - D-H (b) Put - H-D (c) Get - H-D (d) Get - D-H Latency performance of the proposed GDR-based designs for inter-node D-H and H-D configuration with Put and Get operations Fig. 10. (a) Small Message - 8KB One-sided achievement with an asynchronous progress comparison (b) Large Message - 1MB Fig. 11. (a) Input Size = 1Kx1K Execution Time of the Stencil2D SHOC Application Benchmark (b) Input Size = 2Kx2K 85

Fig. 12. (a) Input Size = 128x128x128 Execution Time of the Evolution Phase on the LBM Application (b) Input Size per GPU = 64x64x64 RDMA support for MPI libraries.

Several designs have been proposed to alleviate the burden of data movement and buffer management between different address spaces in the existing programming models, such as MPI and PGAS.

9 Fig. 12. (a) Input Size = 128x128x128 Execution Time of the Evolution Phase on the LBM Application (b) Input Size per GPU = 64x64x64 RDMA support for MPI libraries. CUDA support for X10 was introduced as part of the Asynchronous PGAS (APGAS) model, which enables writing single source efficient code for heterogeneous and multi-core architectures [26]. Several designs have been proposed to alleviate the burden of data movement and buffer management between different address spaces in the existing programming models, such as MPI and PGAS. MVAPICH2-GPU [6, 7, 27] has been proposed to allow GPU to GPU data communication using the standard MPI interfaces. The rcuda framework offers GPU virtualization through the use of custom calls replacing CUDA calls which can be used by machines across clusters as well [28]. Similarly MIC-RO [29] enables sharing and using multiple Intel Many Integrated Core (MIC) cards across nodes. The work in [30] proposed concepts that allow an HCA to access GPU memory similar to GDR. However, their concepts require specific hardware and cannot be applied to production ready HPC systems. Using a proxy-based design to forward communication between two points has been widely used. Most of the frameworks for virtualizing remote GPU [28, 31] are based on a proxy designs on the remote hosts. The authors in [23, 32] proposed a proxy-based framework to overcome the hardware limitations on MIC-based clusters. We distinguish this work from related efforts as being the first to harness NVIDIA s GDR feature for the OpenSHMEM library. The proposed hybrid solutions benefit from the best of both GDR and host-assisted GPU communication. In addition, the proposed framework has a unique design as compared to existing stateof-the-art because it allows to maintain the true one-sided property of the OpenSHMEM model while working around architectural bottlenecks. VII. CONCLUSION AND FUTURE WORK In this paper, we presented novel designs for the Open- SHMEM runtime that takes advantage of the GDR technology for intra-node and inter-node communication on NVIDIA GPU clusters. We also presented a framework with alternative and hybrid designs that ensures the true one-sided property of OpenSHMEM programming models, enables asynchronous progress, and at the same time works around the hardware limitations. The experimental results show 7X improvement in the latency of inter-node D-D communication for small and medium message sizes. The proposed framework achieved an inter-node D-D communication of 8 Bytes with 3.13μs and 2 KB message under 4μs. Likewise up to 3.6X improvement is seen for intra-node configurations. With Stencil2D application kernel form SHOC suite, 19% improvement in the execution time is seen on 64 GPU nodes. As part of the performance evaluation, we redesigned the LBM application to use Open- SHMEM model directly from/to GPU buffers that showed 53% and 45% improvement on the execution time of the evolution phase on 32 and 64 GPU nodes, respectively. In future, we plan to extend our designs to UPC programming models as well as redesign other applications for the proposed GDRaware OpenSHMEM runtime. VIII. ACKNOWLEDGMENT This research is supported in part by National Science Foundation grants #OCI and #CCF We would like to thank Filippo Spiga from University of Cambridge for providing access to the Wilkes testbed. REFERENCES [1] TOP 500 Supercomputer Sites, [2] NVIDIA, NVIDIA CUDA Compute Unified Device Architecture, home new.html. [3] NVIDIA GPUDirect RDMA. [Online]. Available: [4] MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE, [5] Open MPI : Open Source High Performance Computing, [6] H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, in Int l Supercomputing Conference (ISC), [7] H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, in IEEE Cluster 11, [8] G. Cong, G. Almasi, and V. Saraswat, Fast PGAS Implementation of Distributed Graph Algorithms, in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 10. Washington, DC, USA: IEEE Computer Society, 2010, pp [Online]. Available: [9] S. Olivier and J. Prins, Scalable Dynamic Load Balancing Using UPC, in Proceedings of the th International Conference on Parallel Processing, ser. ICPP 08. Washington, DC, USA: IEEE Computer 86

10 Society, 2008, pp [Online]. Available: http: //dx.doi.org/ /icpp [10] J. Zhang, B. Behzad, and M. Snir, Optimizing the Barnes-Hut algorithm in UPC, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 11. New York, NY, USA: ACM, 2011, pp. 75:1 75:11. [Online]. Available: [11] C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, Optimizing Bandwidth Limited Problems Using Onesided Communication and Overlap, in Proceedings of the 20th international conference on Parallel and distributed processing, ser. IPDPS 06. Washington, DC, USA: IEEE Computer Society, 2006, pp [Online]. Available: id= [12] UPC Consortium, UPC Language Specifications, v1.2, Lawrence Berkeley National Lab, Tech Report LBNL , [Online]. Available: upc/publications/lbnl pdf [13] Co-Array Fortran, [14] OpenSHMEM, OpenSHMEM Application Programming Interface, [15] S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda, Extending OpenSHMEM for GPU Computing, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, May 2013, pp [16] M. Luo, H. Wang, and D. K. Panda, Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand, in International Conference on Partitioned Global Address Space Programming Models (PGAS 12), October [17] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. Panda, Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, in Parallel Processing (ICPP), nd International Conference on, Oct 2013, pp [18] R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and D. Panda, Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters, in IEEE International Conference on High Performance Computing (HiPC), December [19] MVAPICH2-X: Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems, [20] D. K. Panda, K. Tomko, K. Schulz, and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI library for HPC, in Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int l Conference on Supercomputing (WSSPE), [21] J. Jose, M. Luo, S. Sur, and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, in The 4th Conference on Partitioned Global Address Space (PGAS), [22] J. Jose, K. Kandalla, M. Luo, and D. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, in 41st International Conference on Parallel Processing (ICPP), [23] S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and D. K. D. Panda, MVAPICH-PRISM: A Proxy-based Communication Framework Using InfiniBand and SCIF for Intel MIC Clusters, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC 13, 2013, pp. 54:1 54:11. [24] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, Cluster Computing, IEEE International Conference on, vol. 0, pp. 1 7, [25] D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda, OMB-GPU: A Micro-benchmark Suite for Evaluating MPI Libraries on GPU Clusters, in Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface, ser. EuroMPI 12, 2012, pp [26] D. Cunningham, R. Bordawekar, and V. Saraswat, GPU Programming in a High Level Language: Compiling X10 to CUDA, in Proceedings of the 2011 ACM SIGPLAN X10 Workshop, ser. X10 11, 2011, pp. 8:1 8:10. [27] S. Potluri, H. Wang, D. Bureddy, A. K. Singh, C. Rosales, and D. K. Panda, Optimizaing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, in Proceedings of the International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), in conjunction with International Parallel and Distributed Processing Symposim (IPDPS 12), [28] J. Duato, A. Pena, F. Silla, J. Fernandez, R. Mayo, and E. Quintana-Orti, Enabling CUDA Acceleration Within Virtual Machines Using rcuda, in High Performance Computing (HiPC), th International Conference on, Dec 2011, pp [29] K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and D. K. Panda, MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand, in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ser. ICS 13, 2013, pp [30] L. Oden and H. Fröning, GGAS: global GPU address spaces for efficient communication in heterogeneous clusters, in 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013, 2013, pp [Online]. Available: [31] S. Xiao, P. Balaji, J. Dinan, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W.-C. Feng, Transparent Accelerator Migration in a Virtualized GPU Environment, in Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012), ser. CCGRID 12, 2012, pp [32] J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and D. Panda, High Performance OpenSH- MEM for Xeon Phi Clusters: Extensions, Runtime Designs and Application Co-design, in Cluster Computing (CLUSTER), 2014 IEEE International Conference on, Sept 2014, pp

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda