Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Size: px
Start display at page:

Download "Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters"

Transcription

1 2015 IEEE International Conference on Cluster Computing Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu and Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State University {hamidouche.2, venkatesh.19, awan.10, subramoni.1, chu.368, Abstract GPUDirect RDMA (GDR) brings the highperformance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as Device ). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure truly one-sided communication for the different inter-/intranode configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Hostto-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively. Keywords PGAS, OpenSHMEM, GPU Direct RDMA, CUDA I. INTRODUCTION The emergence of accelerators such as NVIDIA General Purpose Graphic Processing Units (GPGPU or GPU in short) is changing the landscape of supercomputing systems. This trend is evident in the TOP500 list released in July 2015, where 90 systems make use of accelerator/co-processor technology [1]. GPUs, being PCIe devices, have their own memory space and require data to be transferred to their memory through specific mechanisms. The Compute Unified Device Architecture (CUDA) [2] API is the most popular programming framework available for users to take advantage of GPUs. It provides mechanisms to compute on GPU, synchronize threads on GPU, and move data between CPU and GPU. In addition to the generic CUDA APIs, auxiliary features such as GPUDirect help expedite data transfers to/from GPU memory. GPUDirect is a set of features that enable efficient data movement among GPUs as well as between GPUs and peer PCI Express (PCIe) devices. CUDA 5.0 introduced the GPUDirect RDMA (GDR) feature, which allows InfiniBand network adapters to directly read from or write to GPU device memory while completely bypassing the host [3]. This has the potential to yield significant performance benefits especially in the presence of multiple communication configurations that GPU devices expose. In these heterogeneous systems data can be transferred from Host-to-Host (H-H), Device-to-Device (D- D), Host-to-Device (H-D), and Device-to-Host (D-H). Further these configurations can be either intra-node or inter-node. Scientific applications use CUDA, in conjunction with high-level programming models like Message Passing Interface (MPI) / Partitioned Global Address Space (PGAS). Usually, CUDA is used for the kernel computation and data movement between local CPU host and GPU device. MPI/PGAS is responsible for inter-process communication. Several MPI implementations [4, 5] now allow direct communication from GPU device memory and transparently improve performance of GPU-GPU communication using techniques like CUDA IPC, GPUDirect RDMA and pipelining; thus enabling applications to achieve better performance [6, 7]. However, several researchers have shown that message passing paradigm may not be the best fit for all classes of applications. PGAS programming models, with their light weight one-sided communication and low overhead synchronization semantics, present an attractive alternative for developing dataintensive applications that may have an irregular communication pattern [8 10]. They have also been shown to benefit bandwidth limited applications [11]. There are two categories of PGAS models: 1) language-based such as Unified Parallel C (UPC) [12] and Co-Array Fortran [13], and 2) library-based, such as OpenSHMEM [14]. OpenSHMEM is an effort to bring together a variety of SHMEM and SHMEM-like implementations into an open standard. The OpenSHMEM memory model allows application developers to allocate and manage data objects within symmetric memory regions which are accessible to other processing elements (PEs) via standard OpenSHMEM library functions. It provides better programmability by allowing processes to access a data variable at a remote process by specifying the corresponding local symmetric variable. For current OpenSHMEM application programs that involve data movement between GPUs, the developer has to separately manage the data movement between GPU device memory and main memory at each process using CUDA, as well as the data movement between processes using OpenSH- MEM. In other words, the current OpenSHMEM standard does not support symmetric allocation for heterogeneous memory systems like GPU-based clusters. These shortcomings severely /15 $ IEEE DOI /CLUSTER

2 TABLE I. FEATURES, DESIGNS AND CONFIGURATION SUPPORTS COMPARISON BETWEEN EXISTING AND PROPOSED SOLUTIONS Naive Host-based Pipeline[15] Proposed Intranode Internode Intranode Internode Intranode Internode Configurations (D-D, H-D, D-H) (D-D, H-D, D-H) (D-D,H-D, D-H) D-D (D-D, H-D, D-H) (D-D, H-D, D-H) Schemes user cudamemcpy user cudamemcpy IPC pipeline IPC, GDR GDR, pipeline, proxy Performance Poor Poor Medium Poor Good Good True-one Sided Poor Poor Good Poor Good Good Productivity Poor Good Good limit the programmability of the OpenSHMEM model. It requires users to employ intricate pipelining designs and take advantage of advanced features like CUDA IPC to achieve optimal GPU-GPU communication performance which negates the programmability aspects of PGAS. Furthermore, the current model nullifies the benefits of asynchronous one-sided communication by requiring the target process to perform a CUDA memory copy from the host OpenSHMEM memory to the Device CUDA memory in order to complete the transfer. A. Motivation Recently, researchers have proposed simple extensions to OpenSHMEM [15] and UPC [16] memory models to allow symmetric allocation on GPU memories. The extension for OpenSHMEM mainly introduces the concept of Domain which is used when calling shmalloc to specify where the symmetric allocation will be performed on Host or GPU. As MPI libraries do, using the Unified Virtual Addressing (UVA) CUDA feature, the authors have also proposed a CUDA-Aware OpenSHMEM runtime that hides the complexity of GPU programming. It transparently uses a pipeline technique involving cudamemcpy D-H, InfiniBand H-H and cudamemcpy H-D for efficient communication. Although, this simple extension ensures productivity and reduces the burden of the programmer, the runtime design has undesirable aspects. From Figure 1, we can clearly see that this design is non-optimal for the current GPU clusters with GPUDirect RDMA capability: 1) It requires the involvement of target process in the last step of the pipeline to perform the cudamemcpy. This requirement removes the true one-sided communication of OpenSHMEM semantic and introduces an implicit synchronization between the source and target, reducing the computation/communication overlap potential. 2) Although the pipeline design is efficient for large message sizes, it has an overhead in latency for small message range which reduces its efficiency. Finally, stateof-the-art designs primarily consider communication within a single domain (H-H or D-D configurations) and do not optimize the inter-domain communication which involves H-D and D-H configurations. TABLE II. LATENCIES OF 4 BYTES put OPERATION AT IB AND OPENSHMEM LEVELS FOR INTERNODE DATA MOVEMENT BETWEEN HOSTS AND GPUS (IN USEC) 4 Bytes Host-Host (usec) GPU-GPU (usec) IB Send/Recv OpenSHMEM Put On the other hand, as indicated in [17, 18], GDR has the potential to deliver very low latency compared to transfers staged through the host without involvement from the remote process. However, its bandwidth is severely limited when compared to the bandwidth that an InfiniBand HCA offers. Table II shows the inefficiency of the current OpenSHMEM runtimes for GPU systems with GDR. At the same time it Fig. 1. Internode Host-based Pipeline Design [15] highlights the potential impact of GDR in the data movement from/to GPUs. B. Challenges and Contributions Existing solutions and runtimes of OpenSHMEM model have not been designed with GPU/GDR capability-awareness in mind. In other words, the current solutions are unsuitable for GDR-enabled systems and thus achieve sub-optimal performance. The limitations posed by the state-of-the-art techniques lead us to the following challenges: Can OpenSHMEM memory model support communication with heterogeneous memories such as H-D and D-H on NVIDIA GPU clusters? Is it possible to have designs to achieve truly one-sided communication to/from GPUs? Can new designs be proposed to efficiently take advantage of GPUDirect RDMA feature? What are the alternative designs for intra-node and inter-node with the different configurations: H-H, D- D, H-D, and D-H? Can the proposed OpenSHMEM runtime improve the performance of applications? Building on top of the domain-based extension to the Open- SHMEM memory model, in this paper, we tackle the above challenges and propose a novel framework to efficiently design an OpenSHMEM runtime for GPU based systems using GDR. To the best of our knowledge, this is the first paper exploiting GDR features in designing an efficient OpenSHMEM runtime. This paper makes the following contributions: Propose GDR-based designs to efficiently support OpenSHMEM communication from/to GPUs for all configurations. 79

3 Design novel and efficient truly one-sided communication runtime for both intra-node and inter-node configurations. Propose hybrid and proxy-based designs to overcome current hardware limitations on GPU based systems. Redesign LBM application to use OpenSHMEM directly from/to GPU memories and show the benefits of such designs on end application. Table I highlights and compares features and designs of existing OpenSHMEM solutions for GPU cluster with the proposed solution. Naive refers to the basic OpenSHMEM model, where the users explicitly manage and copy the data from/to GPUs while the internode communications are exclusively happening Host to Host. Host-based Pipeline is the CUDA- Aware OpenSHMEM design proposed in [15]. Designed on top of MVAPICH2-X [19], the evaluation results show that the proposed framework achieves 2.5X and 7X latency improvement for small and medium message range for intra-node and inter-node communications, respectively. On 64 GPU nodes, we show 19% improvement in the execution time of Stencil2D application kernel from SHOC suite. LBM application shows 53% and 45% improvement on the execution time of the evolution phase on 32 and 64 GPU nodes, respectively. II. BACKGROUND A. GPU Node Architecture and GPUDirect Technology Current generation GPUs from NVIDIA are connected as peripheral devices on the I/O bus (PCI express). Communication between GPU and host, and between two GPUs happens over the PCIe bus. NVIDIA s GPUDirect technology provides a set of features that enable efficient communication among GPUs used by different processes and between GPUs and other devices like network adapters. With CUDA 4.1, NVIDIA addressed the problem of inter-process GPU-to-GPU communication within a node through CUDA Inter-Process Communication (IPC). A process can map and directly access GPU memory owned by other processes on the same node, similar to shared memory on the host. The data movement between the process device memories can happen without involving main memory. In CUDA 5.0, GPUDirect has been extended to allow third party PCIe devices to directly read/write data from/to GPU device memory. This feature is called GDR and is currently supported with Mellanox InfiniBand network adapters. This provides a fast path for moving data in the GPU device memory on to the network that completely bypasses the host. These transfers between GPU and IB adapter are implemented as peer-to-peer (P2P) PCIe transfers. B. GPUDirect RDMA and PCIe Bottlenecks Although GDR provides a low latency path for inter-node GPU-GPU data movement, its performance for large data transfers is limited by the bandwidth supported for PCIe P2P exchanges on modern node architectures [17]. The performance of P2P transfers in MB/sec and as a percentage of peak FDR IB bandwidth, is presented in Table III. We consider scenarios where the IB card and GPU are connected on the same socket (intra-socket configuration) as well as the intersocket configuration where the IB and the GPU are connected to different sockets on the same node. These issues severely limit the performance achieved by GPUDirect RDMA for large message transfers. Performance of both P2P write and read operations are severely limited when the devices are connected to different sockets. Note that these artifacts are specific to the node architecture and not the GPU or the IB adapter. Similar limitations can arise between any two PCIe devices which involve P2P transfers. TABLE III. PEER-TO-PEER PERFORMANCE ON IVYBRIDGE ARCHITECTURES AND PERCENTAGE OF BANDWIDTH OFFERED BY FDR IB ADAPTER (6,397 MB/S) IvyBridge (IVB) Intra-Socket Inter-Socket P2P Read 3,421 MB/s (54%) 247 MB/s (4%) P2P Write 6,396 MB/s (100%) 1,179 MB/s (19%) C. PGAS and OpenSHMEM Programming Models Partitioned Global Address Space (PGAS) models provide a logical shared memory abstraction on a physically distributed memory system making it easier to program. SHMEM is one such popular PGAS model with several successful implementations. OpenSHMEM [14] is an effort to standardize and make SHMEM more widely useful for the community. OpenSHMEM operates on a symmetric memory address space and allows processes or processing elements (PE) to see each other s variables with a common name, with each PE having its own local copy of the variables. These are called symmetric variables and are allocated collectively. As in C, symmetric objects can be global or static variables and can also be allocated dynamically from a symmetric heap using routines like shmalloc and shmemalign. OpenSHMEM defines one-sided point-to-point (put and get) and collective communication operations for data movement between symmetric variables. The put operations in OpenSHMEM return when the data has been copied out of the source buffer; they need not be complete at the target. The completion at target is ensured using explicit synchronization. The get operations return only when data is available for use in the local buffer and hence do not require additional synchronization. OpenSHMEM also provides atomics and lock routines that allow implementation of critical regions. D. MVAPICH2-X Runtime MVAPICH2-X [19, 20] provides a unified highperformance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. It enables developers to port parts of large MPI applications that are suited for the PGAS programming model. This minimizes the development overheads that have been a substantial deterrent in porting MPI applications to PGAS models. The unified runtime also delivers superior performance compared to using separate MPI and PGAS libraries by optimizing use of network and memory resources [21, 22]. III. GDR-AWARE OPENSHMEM RUNTIME DESIGNS The availability of a high performance runtime is a key factor for wider acceptance of any programming model. Efficiency of designs for OpenSHMEM communication routines can vary widely based on the features and constraints of the underlying GPU programming platform and hardware configuration. In this section, we discuss and propose different alternatives to design efficient and true one-sided GDR-Aware 80

4 OpenSHMEM communication for both intra-node and internode configurations. A. Enhanced Initialization and Heap Allocation The memory initialization of OpenSHMEM was extended to allow processes to create a symmetric heap on the GPU memory in addition to the heap created on host memory. Like symmetric memory on the host, the GPU heap size is controlled by a runtime parameter. As described in later subsections, we propose hybrid designs that choose between GDR and CUDA IPC for different configurations. This allows us to take advantage of the best path and avoid the P2P hardware bottleneck described in Section II-B. In order to enable RDMA support with the GDR path, each process registers both the heaps with the IB HCA. This creates the memory descriptors (lkey and rkey). To reduce the cost of memory registration which is an expensive operation we utilize the registration cache in MVAPICH2-X. For intra-node configurations, each process creates CUDA IPC handler for its GPU heap. In the last step of the initialization, the memory descriptors and the IPC handlers are exchanged between all processes. Each process creates a local table with RDMA descriptors for both Host and GPU heaps as well as the IPC handlers of other processes. To perform symmetric allocations, user calls shmalloc(size, domain). Explicitly specifying the domain is the only modification needed to make an OpenSHMEM program CUDA-Aware. Thanks to this CUDA-Aware OpenSHMEM concept, no other changes are needed and the same API calls can be used for both Host and GPU communications. B. Design Alternatives for Intranode Communication In this section, we describe the different techniques and schemes that enable efficient intra-node communication. Although we use the put operation here for elucidating the concepts, our designs are equally applicable to the get operation as well. Fig. 2. Intranode Host-to-Device (H-D) using GPUDirect RDMA Host-to-Device (H-D): Figure 2 illustrates the designs used for H-D put communication. During an H-D put operation, the source buffer is on the Host heap while the destination buffer is on the Device (GPU) heap. Using the UVA feature, the source process determines that the destination address is on the GPU. It then translates the local destination address to remote address using the information exchanged during initialization and performs a lookup for the memory descriptors as well as the IPC handler of the remote process. This information is stored in a local table. Depending on the message size, the source process posts a GDR transfer or a CUDA IPC copy. Due to the P2P bottleneck, we define different thresholds where the GDR capabilities can be used for a read or write operation. These thresholds are runtime parameters and can be tuned for different architectures. If the message size is less than the GDR threshold, the source process posts an RDMA write operation to the target process. This operation will perform a loopback as both source and target processes are on the same node. Further, as the remote address is on the GPU, the HCA will directly write the data to the GPU by-passing the host memory while using the GDR capability. However if the message size is larger than GDR threshold for a put operation, the source process uses the IPC handler information about the destination buffer and performs a cudamemcpy. Same logic and operations are performed for a D-H get operation. Note that both designs enable truly one-sided operations where the target process is not involved in the communication progress. Fig. 3. Intranode Device-to-Host (D-H) Hybrid Design with GDR Loopback and cudamemcpy Device-to-Host (D-H): In this configuration, the source buffer is on the GPU heap whereas the destination buffer is on the Host heap. Similar to H-D configuration, we propose a hybrid design for D-H configuration. The GDR code path uses the same logic, with the only difference is the threshold as this operation involves a P2P read from the GPU. However, this configuration faces a challenge on the IPC design part. Indeed, IPC maps only device buffers to other process memory and not host buffers; on the other hand cudamemcpy requires both addresses to belong the same process. One possible solution is to involve the target process and ask it to perform the copy from the mapped IPC source buffer to his local host memory. Although this solution might exhibit good performance, clearly it is not suitable as it violates the one-sided requirement. A better alternative design which ensures performance and the true one-sided property is depicted in Figure 3. In this design, the source process first calls shmem ptr to find the shared memory address of the destination buffer. Using the shared memory address, the source process directly performs a cudamemcpy from its device buffer to the shared memory buffer. As this shared memory address corresponds to the destination buffer, no more operations are needed to complete the data transfer. 81

5 Device-to-Device (D-D): D-D communication can utilize the hybrid design proposed earlier for H-D communication with both source and destination buffers being present on the GPU heap. However, it uses the least GDR threshold (read or write). C. Internode Communication Alternative Designs For inter-node communication, we propose a hybrid design that uses three different protocols. Each protocol exhibits different behavior depending on the message size, the communication pattern (put/get, H-D, D-H, and D-D), and the configuration of the nodes i.e the placement of the GPU and the IB HCA (same socket or different sockets). Direct GDR Protocol: This is used for small and medium message ranges with intra-socket node configuration. It is used for both put and get operation for H-D, D-H, and D-D configurations. This design ensures very low latency as it posts an RDMA operation directly from the source buffer to the destination buffer irrespective of their location (GPU or Host). Similar to the loopback design, the Direct GDR uses different thresholds for put and get operations. Figure 4, using a solid green line, illustrates the path used by this design for a D-D configuration (same design applies for other configurations). As we can see, this design ensures the true one-sided property of OpenSHMEM for GPU communication. Fig. 4. Inter-Node Designs with GDR: Direct GDR and Pipeline GDR write Pipeline GDR write Protocol: As indicated in Table III, the bottleneck when the HCA writes to GPU memory is not severe for intra-socket configuration. Based on these observations, we propose an improved pipeline design in which the data is copied onto the pre-registered buffers on the host using CUDA IPC to avoid the P2P read bottleneck. It is then directly written to the destination GPU memory using GDR write as shown in Figure 4 with the dotted lines. This design avoids the P2P read bottleneck and thus targets only a sub-set of configurations. This design is used for put D-D and D-H operations in the context of intra-socket node configuration. In our earlier work [17], we have used similar design in the context of MPI two sided operations where it uses rendezvous protocol to synchronize the sender and receiver processes. However, the current design is truly one-sided as the target process is not involved in the communication. Further, as the put operation returns when the source buffer is ready for reuse, with the proposed design, the source process returns from a put operation as soon as the last IPC cudamemcpy is complete and the RDMA write operation is posted. Proxy-based Protocol: The two above designs are not sufficient to efficiently handle all configurations in terms of node configurations and communication patterns. As indicated above, in earlier work [17], we addressed the P2P limitation for the different node configuration in the context of MPI two-sided communication using hybrid designs involving GPUDirect RDMA and host-based pipelining. Key additional features desired in the context of the OpenSHMEM model are asynchronous progress and truly one-sided communication. One way to achieve this is to use the service thread per process available with the reference implementation of OpenSHMEM, to progress communication. However, this will take additional CPU resources at each process and involves locking overheads as the service thread and the main process share the communication channel. In light of these requirements and challenges, we propose a proxy-based framework to support OpenSHMEM model on GDR-enabled clusters. The proxy uses CUDA IPC memory copies and RDMA transfers to efficiently move data from/to GPUs, working around the bottlenecks. This extends our work to address similar issues in the context of Intel s many-core co-processors [23]. Figure 5 depicts a high-level representation of the proxy-based framework. The key difference in the proposed framework from our earlier work is the way memory accesses are managed and how progress is made. When the symmetric heaps are created, the proxy process running on the node will map the memory from all the processes running on the node into its address space using the CUDA IPC API. When there are multiple GPU devices per node, the proxy maintains a context on each GPU and keeps a mapping between MPI processes and the GPU device each of them uses. We avoid overheads of context switching, as the IPC mapping is performed only during the heap creation. When an OpenSHMEM one-sided communication operation is issued, the source process passes a signal to remote proxy with information about the source and target buffers. However the target process is not involved in the progress which leads to a totally asynchronous and true onesided behavior. The proxy uses different communication paths depending on the location of the source buffer, destination buffer, the architecture of the node (location of GPU and IB card), and the communication pattern. For instance, with a D-D get operation, on intra-socket node configuration, the source process signals the remote proxy to copy the data from device to its own local memory using IPC and then write the data directly to the source GPU in a pipelined manner as depicted Figure 5. In other words the remote proxy will execute the reverse pipeline GDR write protocol. In an inter-socket node configuration, the remote proxy sends the data to the source preregistered host memory which will be staged to the final destination with a local IPC cudamemcpy. Note that we chose to get the source process involved in the progress as a shmem get operation is blocking and source has to wait for the completion. However the target process is not involved in the progress which leads to a totally asynchronous and true one-sided behavior. The proxy progresses communication operations on behalf of all processes on the node. As proxy is used only for large message communication, a single proxy is enough to saturate the PCIe and network bandwidths. Small messages are handled directly by the processes themselves. 82

6 Fig. 5. Proxy-based Framework to support efficient large message size communication with OpenSHMEM on GPU clusters D. Support for Atomic Operations OpenSHMEM model proposes atomic operations such as fetch-and-add and compare-and-swap to implement locks and critical sections as well as synchronization methods. For 64 bits size, InfiniBand HCAs offer high performance support of these atomic operations at hardware level. We took advantage of this feature with GDR to to perform these atomics when the data is on GPUs. For data size less than 64 bits we borrow the host design which uses mask techniques to take advantage of the RDMA atomic support. Because atomics are mostly used for data size less than or equal to 64 bits, this paper does not handle atomics with large data size. IV. APPLICATION REDESIGN WITH OPENSHMEM In order to evaluate the benefits of our benchmark, we have redesigned the GPU version of the Lattice Boltzmann Method (LBM) application with OpenSHMEM. The existing version of GPULBM is a parallel distributed CUDA implementation for multiphase flows with large density ratios [24]. This version is CUDA-Aware MPI that uses MPI two-sided (send/recv) to exchange the data between processes. It is an iterative application which operates on 3D Data grids. Data decomposition is done along the Z axis. The iterative phase is called Evolution. Evolution time is the time spent in the main loop of the code which dominates the runtime as the application runs for large number of iterations in a real world run. The Evolution phase involves three exchanges in each timestep/iteration: exchange of laplacian of the phase, phi (1 element); exchange of phase distribution function, f (1 element); and another exchange of phase and momentum distribution functions, f and g (6 elements). The size of messages involved in each exchange can be obtained as the product of X and Y dimensions of the grid, the corresponding number of elements involved in the exchange and the size of datatype float. To redesign an OpenSHMEM or Hybrid MPI+OpenSHMEM version of the application, first we need to handle the memory allocation. To do so we track all the GPU memory allocations (cudamalloc) and compute the accumulate size and replace them with a symmetric heap allocation with the GPU domain using the OpenSHMEM shmalloc call. Then for the different MPI point-point operations we replace them with the appropriate shem putmem calls to ensure light-weight and asynchronous exchange of the data from/to GPUs. V. PERFORMANCE EVALUATION A. Experimental Setup Wilkes cluster has been used for the performance evaluation. Wilkes was deployed in November 2013 and is the fastest cluster in the United Kingdom s academia. The cluster is partitioned with different configurations. For our purpose we use the Tesla partition which has 128 nodes and each node is a 6-core dual-socket Intel IvyBridge equipped with 2 Tesla K20 NVIDIA GPUs and 2 FDR IB HCAs. For the evaluation we compare our designs labeled Enhanced-GDR with the earlier work with a host-based pipeline (Host-Pipeline) scheme presented in [15]. The Enhanced-GDR framework is a hybrid scheme that includes all the designs presented in Section III. Depending on the message size and the configuration, the framework is tuned to select the best protocol. B. Micro-benchmark Level Evaluation We first present the results using the OMB-GPU [25] micro-benchmark suite that extends the OMB suite with GPU support. Performance has been shown using point-to-point micro-benchmarks for both Put and Get operations on different configurations. Intranode Evaluation: Figure 6 illustrates the performance of Put and Get operations on H-D configuration. For small and medium size ranges, for both Put and Get operations, the GDR based design achieves over 2X improvement compared to the default design. As shown in Figures 6(a) and 6(c), with 4 Bytes message size, GDR based design achieves a latency of 2.4μs and 2.02μs for Put and Get operations, respectively. In comparison, the default design which uses copies based on CUDA IPC results in a latency of 6.2μs for the same message size. For larger message sizes, the performance of H-D Put operation (Figure 6(b)) and Get D- H operation (Figure 7(d)) are on par with the default design as all of them use the CUDA IPC copy scheme. However, the situation is quite different for large message transfers with Put D-H and Get H-D as shown in Figures 7(b) and 6(d), respectively. Indeed, taking advantage of the shared memory design presented in Section III, the proposed scheme reduces the latency of large transfers by 40%. Internode Evaluation: Figure 8(a) shows that the GDRbased solution, with its Direct GDR protocol introduced in Section III, reduces the latency of the put operation for small message sizes by 7X. The latency of a shmem putmem of 8 bytes from GPU to remote GPU is reduced from 20.9μs to 3.13μs. Further a 2KB message size transfer is achieved in under 4μs. Similar benefits and trends are observed with the Get operation as shown in Figure 8(c). When the message size crosses GDR thresholds, the framework automatically selects the pipeline schemes to avoid the P2P bottlenecks. For the Put operation with large message, the Pipeline GDR write protocols are selected, and as the slowest path is the cudamemcpy which exists with both existing and new designs, both designs lead to the same performance as shown in Figure 8(b). For the Get operation the remote proxy is involved in the progress following the proxy-based protocol. From Figure 8(d), we can clearly see that our proposed design does not incur any overhead and efficiently avoids the P2P bottlenecks. 83

7 (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 6. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for intra-node H-D Put and Get operations (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 7. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for intra-node D-H Put and Get operations As indicated in earlier sections, the existing solution does not handle inter-domain configurations such as H-D and D- H with inter-node communication. Thus, Figure 9 shows only the performance of the proposed design. Similar to the D- D configuration, Put and Get operations achieve very good performance for small and medium message ranges for H-D and D-H configurations. The proposed design achieves 2.81μs for an inter-node H-D Put operation for 8 Bytes and 3.7μs for 4KB transfers. Overlap Evaluation: To demonstrate true one-sidedness of the proposed design, we evaluate the overlap achieved during a Put operation. The benchmark uses 2 processes; the source process issues a put operation to the target process which is busy computing. As shown in Figures 10(a) and 10(b), with both medium (8KBytes) and large (1MBytes) respectively, compared to the existing solution where the communication time increases with the computation time at the target, the proposed design maintains the same communication time regardless of the target behavior. This asserts that the proposed design achieves a true one-sided communication progress and leads to 100% overlap. Note that both designs are evaluated without enabling the service thread available with the OpenSHMEM implementation. The service thread might help the existing design to achieve a better overlap however as stated in section III, it will lead to a significant degradation in application efficiency as threads will consume half of the CPU resources. C. Application Level Evaluation In addition to the redesigned LBM application version presented in Section IV, we considered the Stencil2D application benchmark from the Scalable Heterogeneous Computing (SHOC) Benchmark Suite to evaluate the benefits of the proposed runtime on application performance. Figure 11 clearly shows the advantage of our designs on the execution time of the Stencil2D benchmark. The reported numbers are based on the median metric with double precision for 1,000 internal iterations. We are reporting an average of ten iterations (program runs). With 1K x 1K input size, as shown in Figure 11(a), we are able to improve the performance by 24%, 18% and 14% on 16, 32 and 64 GPU nodes, respectively. With large input size (2K x 2K), the proposed design reduces the execution time by 20% and 19% on 32 and 64 GPU nodes respectively. Using the redesigned version of the LBM application, we evaluate the performance of the evolution phase. We use both Strong and Weak scaling as depicted in Figures 12(a) and 12(b), respectively. With the weak scaling experiment, we keep the input size per GPU as 64 x 64 x 64 and the processes grid as balanced as possible. For example with 64 processes, we distribute on the grid as 4 x 4 x 4. From the figures we can see that the proposed design achieves 70%, 53% and 45% improvement on 16, 32 and 64 GPUs, respectively with the strong scaling experiment. Note that the performance degradation at 32 and 64 nodes compared to 8 and 16 nodes, is due to the increase in the communication time with scale. Indeed, with scale and with small input size, the communication time is greater than the computation. With the other experiment, we show 39% and 30% improvement with 32 and 64 GPU nodes, respectively. We are not able to include large input size evaluation due to the limit on amount of memory that GPU can register, This is a configuration limit on Wilkes system. VI. RELATED WORK There have been several efforts in making programming models accelerator and co-processor-aware. The work in [6] proposed efficient GPU to GPU communication by overlapping RDMA data transfer with CUDA memory inside MPI library. The authors in [17, 18] extended this work with GPUDirect 84

8 (a) Put - Small Messages (b) Put - Large Messages (c) Get - Small Messages (d) Get - Large Messages Fig. 8. Comparison of latency performance using the existing host-based Pipelining and the proposed GDR-based designs for inter-node D-D Put and Get operations Fig. 9. (a) Put - D-H (b) Put - H-D (c) Get - H-D (d) Get - D-H Latency performance of the proposed GDR-based designs for inter-node D-H and H-D configuration with Put and Get operations Fig. 10. (a) Small Message - 8KB One-sided achievement with an asynchronous progress comparison (b) Large Message - 1MB Fig. 11. (a) Input Size = 1Kx1K Execution Time of the Stencil2D SHOC Application Benchmark (b) Input Size = 2Kx2K 85

9 Fig. 12. (a) Input Size = 128x128x128 Execution Time of the Evolution Phase on the LBM Application (b) Input Size per GPU = 64x64x64 RDMA support for MPI libraries. CUDA support for X10 was introduced as part of the Asynchronous PGAS (APGAS) model, which enables writing single source efficient code for heterogeneous and multi-core architectures [26]. Several designs have been proposed to alleviate the burden of data movement and buffer management between different address spaces in the existing programming models, such as MPI and PGAS. MVAPICH2-GPU [6, 7, 27] has been proposed to allow GPU to GPU data communication using the standard MPI interfaces. The rcuda framework offers GPU virtualization through the use of custom calls replacing CUDA calls which can be used by machines across clusters as well [28]. Similarly MIC-RO [29] enables sharing and using multiple Intel Many Integrated Core (MIC) cards across nodes. The work in [30] proposed concepts that allow an HCA to access GPU memory similar to GDR. However, their concepts require specific hardware and cannot be applied to production ready HPC systems. Using a proxy-based design to forward communication between two points has been widely used. Most of the frameworks for virtualizing remote GPU [28, 31] are based on a proxy designs on the remote hosts. The authors in [23, 32] proposed a proxy-based framework to overcome the hardware limitations on MIC-based clusters. We distinguish this work from related efforts as being the first to harness NVIDIA s GDR feature for the OpenSHMEM library. The proposed hybrid solutions benefit from the best of both GDR and host-assisted GPU communication. In addition, the proposed framework has a unique design as compared to existing stateof-the-art because it allows to maintain the true one-sided property of the OpenSHMEM model while working around architectural bottlenecks. VII. CONCLUSION AND FUTURE WORK In this paper, we presented novel designs for the Open- SHMEM runtime that takes advantage of the GDR technology for intra-node and inter-node communication on NVIDIA GPU clusters. We also presented a framework with alternative and hybrid designs that ensures the true one-sided property of OpenSHMEM programming models, enables asynchronous progress, and at the same time works around the hardware limitations. The experimental results show 7X improvement in the latency of inter-node D-D communication for small and medium message sizes. The proposed framework achieved an inter-node D-D communication of 8 Bytes with 3.13μs and 2 KB message under 4μs. Likewise up to 3.6X improvement is seen for intra-node configurations. With Stencil2D application kernel form SHOC suite, 19% improvement in the execution time is seen on 64 GPU nodes. As part of the performance evaluation, we redesigned the LBM application to use Open- SHMEM model directly from/to GPU buffers that showed 53% and 45% improvement on the execution time of the evolution phase on 32 and 64 GPU nodes, respectively. In future, we plan to extend our designs to UPC programming models as well as redesign other applications for the proposed GDRaware OpenSHMEM runtime. VIII. ACKNOWLEDGMENT This research is supported in part by National Science Foundation grants #OCI and #CCF We would like to thank Filippo Spiga from University of Cambridge for providing access to the Wilkes testbed. REFERENCES [1] TOP 500 Supercomputer Sites, [2] NVIDIA, NVIDIA CUDA Compute Unified Device Architecture, home new.html. [3] NVIDIA GPUDirect RDMA. [Online]. Available: [4] MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE, [5] Open MPI : Open Source High Performance Computing, [6] H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, in Int l Supercomputing Conference (ISC), [7] H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, in IEEE Cluster 11, [8] G. Cong, G. Almasi, and V. Saraswat, Fast PGAS Implementation of Distributed Graph Algorithms, in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 10. Washington, DC, USA: IEEE Computer Society, 2010, pp [Online]. Available: [9] S. Olivier and J. Prins, Scalable Dynamic Load Balancing Using UPC, in Proceedings of the th International Conference on Parallel Processing, ser. ICPP 08. Washington, DC, USA: IEEE Computer 86

10 Society, 2008, pp [Online]. Available: http: //dx.doi.org/ /icpp [10] J. Zhang, B. Behzad, and M. Snir, Optimizing the Barnes-Hut algorithm in UPC, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 11. New York, NY, USA: ACM, 2011, pp. 75:1 75:11. [Online]. Available: [11] C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, Optimizing Bandwidth Limited Problems Using Onesided Communication and Overlap, in Proceedings of the 20th international conference on Parallel and distributed processing, ser. IPDPS 06. Washington, DC, USA: IEEE Computer Society, 2006, pp [Online]. Available: id= [12] UPC Consortium, UPC Language Specifications, v1.2, Lawrence Berkeley National Lab, Tech Report LBNL , [Online]. Available: upc/publications/lbnl pdf [13] Co-Array Fortran, [14] OpenSHMEM, OpenSHMEM Application Programming Interface, [15] S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda, Extending OpenSHMEM for GPU Computing, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, May 2013, pp [16] M. Luo, H. Wang, and D. K. Panda, Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand, in International Conference on Partitioned Global Address Space Programming Models (PGAS 12), October [17] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. Panda, Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, in Parallel Processing (ICPP), nd International Conference on, Oct 2013, pp [18] R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and D. Panda, Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters, in IEEE International Conference on High Performance Computing (HiPC), December [19] MVAPICH2-X: Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems, [20] D. K. Panda, K. Tomko, K. Schulz, and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI library for HPC, in Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int l Conference on Supercomputing (WSSPE), [21] J. Jose, M. Luo, S. Sur, and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, in The 4th Conference on Partitioned Global Address Space (PGAS), [22] J. Jose, K. Kandalla, M. Luo, and D. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, in 41st International Conference on Parallel Processing (ICPP), [23] S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and D. K. D. Panda, MVAPICH-PRISM: A Proxy-based Communication Framework Using InfiniBand and SCIF for Intel MIC Clusters, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC 13, 2013, pp. 54:1 54:11. [24] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, Cluster Computing, IEEE International Conference on, vol. 0, pp. 1 7, [25] D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda, OMB-GPU: A Micro-benchmark Suite for Evaluating MPI Libraries on GPU Clusters, in Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface, ser. EuroMPI 12, 2012, pp [26] D. Cunningham, R. Bordawekar, and V. Saraswat, GPU Programming in a High Level Language: Compiling X10 to CUDA, in Proceedings of the 2011 ACM SIGPLAN X10 Workshop, ser. X10 11, 2011, pp. 8:1 8:10. [27] S. Potluri, H. Wang, D. Bureddy, A. K. Singh, C. Rosales, and D. K. Panda, Optimizaing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, in Proceedings of the International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), in conjunction with International Parallel and Distributed Processing Symposim (IPDPS 12), [28] J. Duato, A. Pena, F. Silla, J. Fernandez, R. Mayo, and E. Quintana-Orti, Enabling CUDA Acceleration Within Virtual Machines Using rcuda, in High Performance Computing (HiPC), th International Conference on, Dec 2011, pp [29] K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and D. K. Panda, MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand, in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ser. ICS 13, 2013, pp [30] L. Oden and H. Fröning, GGAS: global GPU address spaces for efficient communication in heterogeneous clusters, in 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013, 2013, pp [Online]. Available: [31] S. Xiao, P. Balaji, J. Dinan, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W.-C. Feng, Transparent Accelerator Migration in a Virtualized GPU Environment, in Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012), ser. CCGRID 12, 2012, pp [32] J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and D. Panda, High Performance OpenSH- MEM for Xeon Phi Clusters: Extensions, Runtime Designs and Application Co-design, in Cluster Computing (CLUSTER), 2014 IEEE International Conference on, Sept 2014, pp

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15 Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at Presented by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu hcp://www.cse.ohio-

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects Dissertation Presented in Partial Fulfillment of the Requirements for the Degree

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 213 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

GPU-Aware Intranode MPI_Allreduce

GPU-Aware Intranode MPI_Allreduce GPU-Aware Intranode MPI_Allreduce Iman Faraji ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ifaraji@queensuca Ahmad Afsahi ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ahmadafsahi@queensuca

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Per-call Energy Saving Strategies in All-to-all Communications

Per-call Energy Saving Strategies in All-to-all Communications Computer Science Technical Reports Computer Science 2011 Per-call Energy Saving Strategies in All-to-all Communications Vaibhav Sundriyal Iowa State University, vaibhavs@iastate.edu Masha Sosonkina Iowa

More information

High-Performance Heterogeneity/ Energy-Aware Communication for Multi-Petaflop HPC Systems

High-Performance Heterogeneity/ Energy-Aware Communication for Multi-Petaflop HPC Systems High-Performance Heterogeneity/ Energy-Aware Communication for Multi-Petaflop HPC Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance DSM Systems using InfiniBand Features

Designing High Performance DSM Systems using InfiniBand Features Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results

More information

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters * Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar K. Panda Department of Computer Science and

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h

More information

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering

More information

Job Startup at Exascale: Challenges and Solutions

Job Startup at Exascale: Challenges and Solutions Job Startup at Exascale: Challenges and Solutions Sourav Chakraborty Advisor: Dhabaleswar K (DK) Panda The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Tremendous increase

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 217 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS INFINIBAND HOST CHANNEL ADAPTERS (HCAS) WITH PCI EXPRESS ACHIEVE 2 TO 3 PERCENT LOWER LATENCY FOR SMALL MESSAGES COMPARED WITH HCAS USING 64-BIT, 133-MHZ

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and D. K. Panda Presented by Lei Chai Network Based

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox booth (SC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

ROCm: An open platform for GPU computing exploration

ROCm: An open platform for GPU computing exploration UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI Sourav Chakraborty, Hari Subramoni, Jonathan Perkins, Ammar A. Awan and Dhabaleswar K. Panda Department of Computer Science and Engineering

More information