Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Size: px
Start display at page:

Download "Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase"

Transcription

1 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Hari Subramoni, and Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering, The Ohio State University {lu.9, shankar.5, gugnani.2, subramoni.1, Abstract The performance of Hadoop components can be significantly improved by leveraging advanced features such as Remote Direct Memory Access (RDMA) on modern HPC clusters, where high-performance networks like InfiniBand (IB) and RoCE have been deployed widely. With the emergence of high-performance computing in the cloud (HPC Cloud), highperformance networks have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology. With these advancements in HPC Cloud networking technologies, it is high time to investigate the design opportunities and impact of networking architectures (different generations of IB, 4GigE, 4G-RoCE) and protocols (TCP/IP, IPoIB, RC, UD, Hybrid) in accelerating Hadoop components over highperformance networks. In this paper, we propose a network architecture and multi-protocol aware Hadoop RPC design, that can take advantage of RC and UD protocols for IB and RoCE. A hybrid transport design with RC and UD is proposed which can deliver memory scalability and performance for Hadoop RPC. We present a comprehensive performance analysis on five bare-metal IB/RoCE clusters and one SR-IOV enabled cluster in the Chameleon Cloud. Our performance evaluations reveal that our proposed designs can achieve up to 12.5x performance improvement for Hadoop RPC over IPoIB. Further, we integrate our RPC engine into Apache HBase, and demonstrate that we can accelerate YCSB workloads by up to 3.6x. Other insightful observations on performance characteristics of different HPC Cloud networking technologies are also shared in this paper. I. INTRODUCTION Recent worldwide studies by IDC [7] reported that 67% of HPC centers perform high-performance data analytics (HPDA). This necessitates high-performance designs of Big Data middleware like Apache Hadoop for such modern HPC clusters. As a distributed processing framework, there is a large amount of data movement within and between different Hadoop components including HDFS, MapReduce, YARN and HBase. This makes the performance of Hadoop components sensitive to the underlying networking technologies. Highperformance networking technologies such as InfiniBand (IB), 1/4GigE, RDMA over Converged Ethernet (RoCE) have gained importance for designing the high-performance compute (HPC) clusters. In addition to providing high bandwidth and low latency, IB and RoCE provide advanced features like Remote Direct Memory Access (RDMA), different communication protocols including Reliable Connection (RC) and Unreliable Datagram (UD) that have different performance and memory characteristics. With the emergence of highperformance computing in the cloud (HPC Cloud), approximately 35% of HPC today is being done in the cloud envi- This research is supported in part by National Science Foundation grants #CNS , #IIS , and #CNS It used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number #OCI /$31. c 2 IEEE (CloudCom ) ronment [19]. Thus, advanced networking technologies such as IB, 1/4GigE, and RoCE have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR- IOV) [17] technology, including OpenStack-based HPC cloud such as Chameleon [1] and Amazon EC2. Prior research has shown that communication libraries such as MPI can deliver near bare-metal performance on SR-IOV enabled InfiniBand clusters [9, 23]. With these advancements in HPC Cloud networking technologies, it is high time to investigate the design opportunities and impact of networking architectures and protocols on accelerating Hadoop components over highperformance networks. A. Motivation As shown in Figure 1, there are three dimensions that need to be considered for designing high-performance communication libraries for Big Data middleware. The scope of the first dimension i.e., network architecture, are the various HPC cloud networking technologies such as different generations of IB including the IB QDR (Gbps), FDR (56Gbps), EDR (1Gbps), and 4 Gig Ethernet/RoCE interconnects. Based on the network interconnect available in the cluster, different highperformance network protocols such as TCP/IP over Ethernet/IB (IPoIB) and RC/UD over IB/RoCE constitute the second dimension. The third dimension spans the type of available HPC cloud environments: bare metal nodes and virtualized nodes with SR-IOV support. Fig. 1: HPC Cloud Networking Technologies Recent studies [6, 8, 1, 13, 15, 18, 21] have shed light on the possible performance improvements for different Big Data computing middleware by taking advantage of RDMA over IB with RC protocol. They do not take into account how one needs to adapt the design based on the cost-performance characteristics of the underlying high-performance interconnect protocols; And how we can enhance the performance and memory scalability of the proposed solutions further by taking advantage of a variety of modern transport protocols. Another / $31. 2 IEEE DOI 1.119/CloudCom

2 study [11] shows the trade-offs between RC and UD on IB for designing a high-performance key-value store. But it does not exploit the benefits of latest networking technologies like 4G-RoCE. And these recent studies have not yet explored what is the impact of these networking technologies on accelerating communication in Hadoop components on modern virtualized SR-IOV-based HPC clouds. These issues lead us to the following broad challenge: Can existing designs of Hadoop components over InfiniBand need to be made aware of the underlying architectural trends and take advantage of the support for modern transport protocols that InfiniBand and RoCE provide? B. Contributions In this paper, we take up the broad challenge of designing a novel multi-protocol (RC, UD, and hybrid protocols over IB and RoCE) based high performance and scalable Hadoop RPC engine, which is a fundamental communication mechanism in Hadoop Ecosystem. To illustrate its impact and benefits, we further integrate this design into Apache HBase and evaluate it with Yahoo! YCSB [4] application benchmarks. The major contributions of this paper are: (1) Propose multi-protocol (RC, UD, and Hybrid) based and architecture-aware designs over InfiniBand and RoCE for Hadoop RPC based communication, which further accelerate HBase middleware. (2) Study the impact of different architectures of highperformance interconnects (IB QDR, FDR, and EDR; 4Gig Ethernet, and RoCE) and protocols (TCP/IP, IPoIB, RC, UD, Hybrid) on the performance of Hadoop RPC and HBase. (3) Conduct extensive evaluations with the proposed designs on five bare-metal IB/RoCE clusters and one SR-IOV enabled IB cluster in HPC cloud. Performance evaluation shows that our proposed design can improve the Hadoop RPC throughput by up to 12.5x compared to running default Hadoop RPC with the IPoIB UD transport protocol. Further, our scalability analysis reveals that our proposed design with UD protocol achieves good memory scalability. Our RC and UD based hybrid communication scheme can deliver performance comparable to that of RC, while maintaining a steady memory footprint as that of UD. The performance comparison on Chameleon Cloud shows that Hadoop RPC running over SR-IOV has some overhead compared to native IB FDR on small messages in terms of latency, while the gap becomes very small (around 6%) for large messages. The integrated design with HBase illustrates that our proposed designs can accelerate HBase set/get operations. We demonstrate that our RPC designs can achieve up to 3.6x improvement in the HBase throughput for YCSB A, B, and C workloads, as compared to default HBase running with IPoIB over 36 nodes on SDSC Comet. To the best our knowledge, this is the first paper to study the impact of HPC Cloud networking technologies (architectures and protocols) on the design of Big Data processing middleware such as Hadoop RPC and HBase. The rest of the paper is organized as follows. Section II presents the necessary background and related works. Section III presents our proposed design. Section V describes our detailed evaluation. Finally, we conclude in Section VI with some description of possible future work. II. BACKGROUND AND RELATED WORK In this section, we provide the necessary background information for this paper and related works in literature. InfiniBand and RoCE: InfiniBand Architecture [2] defines a switched network fabric for interconnecting processing and I/O nodes, using a queue-based model. It supports two communication semantics: Channel Semantics (Send-Receive communication) over RC and UD; and Memory Semantics (RDMA communication) over RC. Both semantics can perform zerocopy transfers from source-to-destination buffers without additional host-level memory copies. RC is connection-oriented and requires a dedicated queue pair (QP) for each destination process while the connection-less UD transport uses a single QP for all. InfiniBand adapters can also be used to run applications that use TCP/IP over the InfiniBand network in the IP-over-InfiniBand (IPoIB) mode [3]. Leveraging these transport services have been demonstrated to be beneficial for several applications that are commonly run on HPC clusters. For example, Koop et. al [12] proposed designs in MPI library to take advantage of connection-less UD transport to reduce memory usage for connections while maintaining the same performance as that of RC. RoCE allows RDMA over highperformance Ethernet networks, and can work with both RC and UD protocols. SR-IOV: SR-IOV is an industry standard which specifies native I/O virtualization capabilities in PCI Express (PCIe) adapters. It presents a PCIe device as multiple virtual devices and each virtual device can be dedicated to a single VM. It has been demonstrated that SR-IOV performs significantly better than other software-based I/O virtualization solutions [9]. InfiniBand can support SR-IOV in HPC clouds. Accelerating Big Data Middleware on HPC clusters: With the increased need to design Big Data middleware capable that can deliver high-performance on modern HPC clusters, several studies have been devoted to exploring the advantages of re-designing Hadoop middleware components to be RDMAaware. For instance, Islam et al. proposed RDMA-enhanced designs [8] to improve HDFS performance on InfiniBand clusters. Similarly, [18, 21] proposed high-performance data shuffling using RDMA in MapReduce for InfiniBand clusters. Recent studies [6, 13, 2] have also shown that RPC and HBase can be improved significantly by leveraging the advanced features offered by high-performance networks. Highperformance communication schemes by taking advantage of MPI for improving the performance of RPC and Hadoop-like Big Data processing workloads were proposed in [14]. III. PROPOSED DESIGN In this section, we describe the network architecture-aware and multi-protocol aware design for Hadoop RPC and HBase. A. Hadoop RPC Architecture Overview Figure 2 depicts a high-level overview of the two Hadoop RPC designs: (1) the default Hadoop RPC running IP protocol over InfiniBand (IPoIB) or TCP/IP over 4 GigE, and, (2) our proposed high-performance and hybrid Hadoop RPC engine. The Java-based Hadoop RPC system supports two serialization formats: (1) Writable serialization, that implements a simple and efficient serialization protocol based on DataInput and DataOutput interfaces in Java, and, (2) Protocol buffers [5], 311

3 that provides a fast and flexible mechanism for serializing userdefined structured data objects. From Figure 2, we can see that default Hadoop employs a Java NIO-based RPC engine that interfaces with the Java Sockets library running with IPoIB. In contrast, our proposed RPC engine employs a re-designed communication sub-system that enables Hadoop components to native InfiniBand/RoCE primitives directly [13]. Based on the RPC design in [13], in this paper, we design a customized, scalable Hadoop RPC plugin module. Our design has the following main components. SEDA-based Thread Management: Staged Event-Driven Architecture (SEDA) [22] is widely used for designing highthroughput systems, whose basic principle is to decompose a complex processing logic into a set of stages that are connected via event-based queues. We leverage our previous efforts on improving RPC performance using SEDA approach [13, 15] to design a network architecture-aware RPC engine. Offloading Data Transfers to Native IB-/RoCE-based RPC Engine: As illustrated in Figure 2, our proposed native IB- /RoCE-based RPC engine can leverage RC, UD, and hybrid RC-UD designs to communicate RPC call messages in an efficient manner, over both native InfiniBand clusters (IB QDR/FDR/EDR) or using RoCE over 4 GigE. All our designs can run with SR-IOV over IB/RoCE networks. The native IB- /RoCE-based RPC engine can be leveraged by both Writable and Protobuf serialization/de-serialization mechanisms. We provide details of our design over RC in Section III-B, design over UD in Section III-C, and how we can leverage these different transport protocols in a hybrid fashion in Section III-D. Fig. 2: Overview of Proposed Hadoop RPC and HBase Design B. Hadoop RPC Design over RC When the client issues an RPC call, the sender object that comprises of the RPC method information and the calling arguments is passed to the high-performance RPC plugin. The RPC client serializes the sender object, and passes it on to the underlying native-ib/roce RPC engine through Java Native Interface (JNI), using a direct Java byte buffer (ByteBuffer). A message is created with the serialized data object, and a corresponding message header is created. Since RC is a connection-oriented service that guarantees reliable service and provides RDMA capabilities, RPC over the RC channel adapts the different schemes based on payload size: (a) Eager protocol for small messages (Eager): For short message transfers, wherein the payload size is smaller than the eager threshold, the header and message are copied directly into the network buffer. Using IB send primitives with eager protocol over the RC channel, both the header and the message are communicated to the server-side in one network operation. On the server-side, the native-ib/roce RPC engine copies the message received into a JNI ByteBuffer allocated by the RPC server. This is illustrated in Figure 3(a). (b) Rendezvous protocol for large messages (Rendezvous): For large message size, wherein the payload size is greater than the eager threshold, RDMA is leveraged for efficient message transfers. The header information and the client-side buffer location is sent in the form of a Request-To-Send (RTS) message to the server-side. On receipt of the header, native-ib/roce RPC engine on the server obtains a direct byte buffer from the JNI layer. With this as a destination buffer, the RPC engine issues an RDMA read operation by specifying the remote client buffer as source. Compared to the eager protocol for small messages, the rendezvous protocol eliminates memory copies between client buffers and preallocated network buffers, at both sender and receiver side. This is illustrated in Figure 3(b). For both short and large messages, RPC server deserializes the message in the JNI ByteBuffer into the sender object. C. Hadoop RPC Design over UD UD transport provides fast but unreliable service, making it mandatory to explicitly handle reliability and flow control is within the native IB/RoCE RPC engine. To facilitate flow control, we leverage the traditional sliding window protocol as implemented in [11]. Using the sliding window to represent the upper bound on the number of unacknowledged messages, the native IB/RoCE RPC engine keeps track of ongoing message transfers. Explicit acknowledgment messages are sent by the receiver for each successful full message receipt. A timeout counter is also assigned to each end point channel. All messages within an expired end point channel that were unacknowledged are re-transmitted to enable reliable data transfer. Similar to RC-based design, the RPC client passes the serialized Java sender object to the native IB-/RoCE-based RPC engine via JNI using direct byte buffers. A message is created with the serialized data object, and a corresponding message header is created. The message transfer mechanism is chosen based the payload size: (a) Direct transfer of small messages (under Maximum Transmission Unit, MTU): For fast short message transfers, the header and message are copied directly into a pre-allocated network buffer. Using IB send primitives with eager protocol over the UD channel, the header and the message are communicated to the server-side. On the server-side, the native- IB/RoCE RPC engine copies the message received into a JNI ByteBuffer allocated by the RPC server. This is illustrated in Figure 4(a). (a) Packetized transfer for medium messages (PKTZED): For medium-sized messages, we employed a packetized data transfer mechanism. The message is divided into smaller chunks (less than MTU), and communicated to the RPC server using IB send over UD channel. Each chunk header contains a sequence identifier to help order the message chunks on the 312

4 server. At the server-side, the chunked messages are received into pre-allocated buffers and re-assembled into the original message. The re-assembled message is copied into the direct byte buffer and passed to the JNI layer, as illustrated in Figure 4(b). (b) Zero-copy transfer for large messages (ZCP): For large message sizes, the header information and the message size information are sent in the form of a Request-To-Send (RTS) message to the server-side. On receipt of the header, native- IB/RoCE RPC engine on the server obtains a direct byte buffer from the JNI layer with appropriate buffer size. Through this negotiation, the receiver side can guarantee that the receive buffer is ready for receiving data. Then similar to mediumsized messages, packetized data send can directly transfer message in smaller MTU-sized chunks without copying into internal intermediate buffers. On receipt, the message encapsulated in each chunk is directly saved into the allocated JNI byte buffer. Once the entire message has been received into the direct byte buffer, it is passed on to the RPC server plugin via JNI. Memory copies with large buffers are very expensive operations, so the zero-copy transfer scheme is designed to avoid the copy overhead with large messages. This is illustrated in Figure 4(c). For all message sizes, the RPC server deserializes the message in the JNI direct byte buffer to obtain sender object and invokes the RPC method with it. D. Hadoop RPC over Multi-Protocol Aware Hybrid Transport In order to leverage both RC and UD approaches discussed above, we introduce a hybrid connection management mode that can exploit the advantages of both RC and UD. To start with, we enable fast and reliable communication by using RC, but the maximum number of RC connections can be up to a specific configurable threshold. For better scalability and to reduce memory footprint, all further connections are made via UD transport channel. In this way, the native IB/RoCE RPC engine performs channel selection in a transparent manner, without affecting the serialization and deserialization process of the sender Java objects at the upper-level. E. Leveraging Multi-Protocol Aware RPC Engine in HBase While HDFS and YARN utilize Hadoop RPC to communicate between nodes or tasks, Hadoop components such as HBase utilize it for data transfers between client and server. Apache HBase is a distributed NoSQL database that runs on top of HDFS, and is designed to host large tables with billions of entries. Hence, to illustrate the impact of the proposed network architecture-aware approach on Hadoop RPC-based middleware, we demonstrate how our proposed native IB/RoCE-based high-performance RPC engine can benefit data communication in HBase. Since Apache HBase uses a modified version of the Hadoop RPC protocol as the core communication library for its set and get operations, we integrate our enhanced Hadoop RPC native engine into Apache HBase. In this way, we provide a means for HBase applications to leverage the high-performance made available by native IB/RoCE transport protocols on the latest HPC clouds. IV. NETWORK ARCHITECTURE-AWARE PERFORMANCE OPTIMIZATION FOR PROTOCOLS Experimental Setup: To evaluate the performance trends of Hadoop RPC and HBase over different generations of highspeed interconnects we employ different clusters that are equipped with InfiniBand QDR/FDR/EDR HCAs and RoCEenabled 4 GigE interconnects. Table I presents these different clusters used for our performance evaluations. Cluster Name RI-IB-QDR SDSC-Comet- IB-FDR RI2-IB-EDR RI2-4G-RoCE Chameleon-FDR- Bare-Metal Chameleon-FDR- SR-IOV Processor Type CPU Cores Intel Westmere 2.67 GHz 8 24 Intel Xeon E5-268v3 24 Intel Broadwell E5-268-v4 28 Intel Broadwell E5-268-v4 28 Intel Haswell E5-267-v3 48 Intel Haswell E5-267-v3 Mem. (GB) 48 TABLE I: Cluster Configurations HCA OS Mellanox RHEL IB QDR v6.1 Mellanox IB FDR CentOS Mellanox CentOS IB FDR v7.2 Mellanox CentOS ConnectX-4 v7.2 4GbE Mellanox CentOS IB FDR v7 Mellanox IB FDR w/ CentOS SR-IOV + v7 KVM-VMM A. Performance Optimization in RPC Engine 1) MTU based tuning for both RC and UD: To optimize RC and UD performance, we first look at the IB MTU. A larger MTU could bring good efficiency because each packet could carry more user data leading to fewer packets to process. However, when MTU is too large, it could lead to larger error rates and longer latency for user data that is very small. Thus, we tune the MTU to find out the optimal value for different HCA architectures. For both UD and RC channels, we use a ping-pong latency benchmark and run it for MTU from 124 to 496 bytes. The experiment results with UD and RC on RI2-IB-EDR, RI2-4G-RoCE, and Chameleon-FDR-SR-IOV clusters are shown in Figure 5(a), Figure 6(a), and Figure 7(a), respectively. For 4 MB message size shown in Figure 5(a), the ping-pong latency for RC-MTU-1K, RC-MTU-4K, UD- MTU-1K, and UD-MTU-4k is 67.9 us, us, us, and us, respectively. Compared with RC-MTU-1K, RC-MTU-4K reduces the latency by 6%. The UD-MTU-4K reduces the latency by 12%. While the same trend is observed for 4G-RoCE, the optimal MTU for RC over SR-IOV is however 2 K. 2) Eager-threshold tuning for RC: Eager-threshold specifies the switch point between eager and rendezvous protocol. If the threshold is too small, it could add overhead from exchanging control messages for setting up communication buffers at both sender and receiver. If it is too large, copying data to and from user buffer and pre-registered buffers at the both sender and receiver could incur large overheads. Therefore, we need to optimize this parameter to find out the optimal switch point. Since the eager-threshold varies with different CPU and HCA configurations, we tune this parameter for different network interconnects. We tune this parameter by running the ping-pong latency benchmark with the eager-threshold setting from 8 KB to KB. The experiment results are shown in Figure 5(b), Figure 6(b), and Figure 7(b). Through the zoomin figures, we can see the different performance trends with varied thresholds. 3) PKTZED/ZCP based tuning for UD: UD packetthreshold specifies the switch point between packetized send/recv and zero-copy based transfers on UD. If the switch point is too small, the overhead of exchanging control messages for setting up communication buffers will be added for 313

5 (a) Eager Transfer (b) Zero-Copy RDMA Fig. 3: Hadoop RPC Design with RC some medium messages. If it is too large, large copy overheads due to copying data between user s and pre-registered buffers could be incurred. Therefore, we optimize this parameter to find out the optimal switch point. We measure the roundtrip latency by setting the pkt-threshold from k to 512k. The experiment results are shown in Figure 5(c), Figure 6(c), and Figure 7(c). Different performance trends with varied thresholds can be observed through the zoom-in figures. The tuned parameters for all clusters shown in Table I are summarized in Table II. From Table II, we can make the following important observations: (1) Chameleon-FDR- Bare-Metal and SDSC-Comet-FDR have similarly optimized values for tuned parameters; these can be used for optimal performance of other FDR clusters, and, (2) SR-IOV and bare-metal on the Chameleon cloud have similar values for parameters like eager and packetized thresholds, MTUs for RC and UD. It can also be noted that our designs can run over both RoCE RC and RoCE UD. Based on these findings, we design a modular inside RPC engine to do automatic network architecture detection and then set the best parameters for associated communication channels. V. PERFORMANCE EVALUATION In this section, we present the detailed performance evaluations of our native IB/RoCE-based design of Hadoop RPC and HBase. We compare the performance of our design with that of the default architecture over various IB/RoCE interconnects and protocols. We have used the OSU HiBD Benchmarks (OHB) [] and Yahoo! YCSB benchmarks for our evaluations. In all our experiments, we have used Hadoop , HBase 1.1.2, and JDK A. Hadoop RPC Latency Figure 8 depicts the ping-pong latency of communications. Switching between RC and UD protocols for IPoIB requires administrator privileges and driver support. In all evaluated clusters, we can only switch this on RI-IB-QDR. We use default protocols for IPoIB on other clusters. From these graphs, we can see that compared to the or transport channel, the IPoIB UD and IPoIB RC transport channels do not achieve optimal performance. It is because when we make RPC messages go through IPoIB transport channels, the bottlenecks of memory allocations, copies, and adjustments in Hadoop RPC default design suppress the benefits from high-performance networks. As shown in Figure 8(a), for a bytes payload size, the latency for the IPo transport channel is 74 us, whereas the latency for the transport channel is 35 us. The transport channel has about 2.1x performance speedup. For the same payload size with UD, (a) Eager Transfer (b) Packetized Transfer (c) Zero-Copy Transfer Fig. 4: Hadoop RPC Design with UD the IPo and latency is 75.1 us and 35.1 us, respectively. The transport channel reduces the latency by 53%. The same trend can be seen from Figure 8(b) to Figure 8(f) as well. IB/RoCE with RC always gives the best performance in terms of latency, which is mainly because of taking advantage of RDMA-based design. IB/RoCE with UD can perform faster than TCP/IP over 4Gig Ethernet or IPoIB with RC/UD, which is mainly because of the proposed design on eager, packetized message transfer and zero-copy message transfer with UD. IB/RoCE with Hybrid RC and UD design can achieve similar performance as IB/RoCE RC. For Hadoop RPC latency, with the evolving of networking technologies, we see that IB EDR has the best performance for all the cases. 4G-RoCE and IB FDR have similar performance while 4Gig Ethernet and IB QDR perform similarly. SR-IOV has considerable overhead compared to bare-metal environment for small message sizes but the overhead gets reduced to around 6% for large message sizes (>KB). Therefore, SR-IOV can be suitable for large-size I/O operations that are common in many Big Data middleware components. B. Hadoop RPC Throughput In this section, we evaluate the performance of RPC communication in terms of throughput. For these tests, the RPC server runs on a separate node, and multiple clients ranging from 8 to are distributed uniformly over 8 compute nodes. Figure 9 shows the throughput results on all six clusters, respectively. From these figures, we can observe that compared with IPoIB and 4Gig Ethernet transport channels, the native IB and RoCE based transport channels are able to achieve higher throughput. For example, as shown in Figure 9(c), for bytes message size with clients, the throughput of transport channel can achieve calls/s, whereas it is only calls/s for IPo transport channel. Overall, the performance improvement can be up to 2.15x. Similarly, there are up to 4.6x and 12.5x throughput improvement on cluster RI-IB-QDR and SDSC-Comet-IB-FDR. Compared to 4Gig Ethernet, the native design with 4G- RoCE is able to achieve up to 1.89x performance speedup. In addition, we can observe that transport channel can achieve similar throughput with. We can attribute this to the ability of our hybrid design to dynamically switch between RC and UD. For the SR-IOV based virtualized environment, we see similar trends that is able to achieve up to 2.6x performance speedup compared to IPoIB. C. Hadoop RPC Memory Scalability Analysis Figure 1 presents the memory footprint for varying number of client connections from 1 to for, 314

6 RC-MTU-1K RC-MTU-2K RC-MTU-4K UD-MTU-1K UD-MTU-2K UD-MTU-4K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K 4 UD-Pktzed_Threshold-K 35 UD-Pktzed_Threshold-K 3 UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K 25 UD-Pktzed_Threshold-256K 2 UD-Pktzed_Threshold-512K K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 5: Performance Optimization on RI2-IB-EDR Cluster RC-MTU-1K RC-MTU-2K RC-MTU-4K UD-MTU-1K UD-MTU-2K UD-MTU-4K K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-256K UD-Pktzed_Threshold-512K UD-Pktzed_Threshold-1M K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 6: Performance Optimization on RI2-4G-RoCE Cluster 9 RC-MTU-1K 8 RC-MTU-2K 7 RC-MTU-4K 6 UD-MTU-1K UD-MTU-2K 5 UD-MTU-4K K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K 8 UD-Pktzed_Threshold-K 7 UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K 6 UD-Pktzed_Threshold-K 5 UD-Pktzed_Threshold-256K UD-Pktzed_Threshold-512K K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 7: Performance Optimization on Chameleon-FDR-SR-IOV Cluster TABLE II: Performance Optimization for Different Network Architectures RI-IB-QDR RI2-4G-RoCE SDSC-Comet-IB-FDR Chameleon-FDR-Bare-Metal Chameleon-FDR-SR-IOV RI2-IB-EDR MTU 4K 4K 4K 2K/4K (RC), 4K (UD) 2K (RC), 4K (UD) 4K RC Eager Threshold K K K K k K UD Pktzed Threshold K 512K/1M K K/256K K K and transports. transport needs to create one QP for each client connection. We can clearly see that, as the number of client connections increase, the memory footprint of keeps increasing. For client connections, it consumes approximately 42 MB memory. (Note that JVM memory consumption is always there in these tests.) However, there is no need to create new QP for each client connection for transport. As a result, we observe that consumes steady memory size. The slight memory increase is because of some new objects we allocate for increasing client connections in JVM. The steady memory consumption is favorable for large-scale applications. We also notice that by applying hybrid design, the memory footprint is clearly reduced, after the configured threshold (i.e., in this case) of client connections, and then keeps steady onwards. In the meantime, can still achieve the similar performance with. With this, we believe the hybrid scheme we proposed for Hadoop RPC will have less memory contention with applications which have higher memory consumption. D. YCSB over HBase HBase uses Google protocol buffer-based RPC for communicating all operation requests, such as Put, Get, etc., and responses between client and HRegionServer. To evaluate our RPC native engine integrated into HBase (described in Section III-E), we use YCSB workloads A, B, and C. We run 315

7 IPo IPo K (a) RI-IB-QDR IPo K (d) Chameleon-FDR-Bare-Metal IPo K K (b) SDSC-Comet-IB-FDR IPo K (e) Chameleon-FDR-SR-IOV IPo K K (c) RI2-IB-EDR 4GigE RoCE-RC RoCE-UD RoCE-Hybrid K K (f) RI2-4Gig Ethernet and 4G-RoCE Fig. 8: Hadoop RPC Latency Evaluation on Different Networking Technologies K 4K K 8 (a) RI-IB-QDR Cluster IPo IPo IPo K 4K K 8 IPo (b) SDSC-Comet-IB-FDR Cluster IPo K 4K K 8 (c) RI2-IB-EDR Cluster IPo 4GigE RoCE-RC K 4K K 8 (d) Chameleon-FDR-Bare-Metal K 4K K 8 (e) Chameleon-FDR-SR-IOV K 4K K 8 (f) RI2-4Gig Ethernet and 4G-RoCE Fig. 9: Hadoop RPC Throughput Evaluation on Different Networking Technologies Memory Footprint (MB) Fig. 1: Memory Footprint Evaluation Throughput (kcalls/s) IPo 4 8 Throughput (kcalls/s) IPo 4 8 (a) YCSB Workload A (b) YCSB Workload B (c) YCSB Workload C Fig. 11: YCSB Performance Evaluation on SDSC-Comet-IB-FDR Cluster Throughput (kcalls/s) IPo 4 8 3

8 25K operations per client with varying number of clients (from 4 to ) on a 36-node cluster with 4 HBase RegionServers and client nodes on SDSC-Comet-IB-FDR. We perform these experiments over IPoIB,, and transport channels. Figure 11 shows that the native IB based transport protocols (RC/UD/Hybrid) always perform better than the IPo transport. For the write-intensive YCSB workload A, with a read-to-update operation ratio of 5:5, we can observe up to 2.4x improvement in throughput with design over IPo. For the read-intensive YCSB workload B, with read-to-update ratio of 95:5, we can observe up to 2.65x improvement in the overall throughput of HBase with, as compared to IPo transport. For the readintensive YCSB workload C, with 1% read, we can observe up to 3.6x improvement in the overall throughput of HBase with, as compared to IPo transport. VI. CONCLUSIONS In this paper, we proposed network architecture-aware and multi-protocol (RC, UD, and Hybrid) based transport schemes to accelerate Hadoop RPC and HBase communication. We perform extensive performance optimization on different lowlevel communication parameters to determine the optimal value for achieving the best performance. Based on these, we present a comprehensive performance analysis of our proposed design using Hadoop RPC micro-benchmarks and application benchmarks on five bare-metal IB/RoCE clusters and one SR- IOV-enabled IB cluster. Performance evaluations reveal that our native IB/RoCEbased design with RC can give the best performance in terms of latency and throughput for Hadoop RPC and HBase communication. IB/RoCE with RC can achieve up to 12.5x performance speedup compared to IPoIB for Hadoop RPC communication. For HBase, IB/RoCE with RC could improve the throughput by up to 3.6x over IPoIB. These results show the advantages of eager and RDMA-based communication schemes with RC protocol. IB/RoCE with UD can perform faster than TCP/IP over 4GigE or IPoIB with RC/UD, which shows the benefits of the proposed design on eager, packetized message transfer and zero-copy message transfer with UD. IB/RoCE with Hybrid RC and UD design can achieve similar performance as IB/RoCE RC while maintaining a steady memory footprint as that of UD. We can also see that IB EDR provides optimal Hadoop latency and throughput. 4G-RoCE and IB FDR have similar performance, while 4GigE and IB QDR perform similarly. In the virtualized cloud environment, SR-IOV demonstrates considerable overhead compared to the bare-metal environment for small message sizes in terms of Hadoop RPC latency. However, the overhead gets reduced to around 6% for large message sizes (>KB). This makes us believe that SR-IOV could be suitable for large-size I/O operations that are common in many Big Data middleware components. For future work, we will study the impact of other transport protocols (like XRC, DC) on the performance of Hadoop and Spark components. We will also evaluate with more real applications on different networks and protocols. REFERENCES [1] Chameleon Cloud, [2] InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1., [3] IP over InfiniBand Working Group, org/html.charters/ipoib-charter.html. [4] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, Benchmarking Cloud Serving Systems with YCSB, in The Proceedings of the ACM Symposium on Cloud Computing (SoCC), Indianapolis, Indiana, June 21. [5] Google Protocol Buffers, [6] J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, in The Proceedings of IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), Shanghai, China, May 212. [7] International Data Corporation (IDC), New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending, [8] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, High Performance RDMAbased Design of HDFS over InfiniBand, in The Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, November 212. [9] J. Jose, M. Li, X. Lu, K. C. Kandalla, M. D. Arnold, and D. K. Panda, SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 213, pp [1] J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, in International Conference on Parallel Processing (ICPP), Sept 211. [11] J. Jose, H. Subramoni, K. Kandalla, M. Wasi-ur Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports, in Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 12), Washington, DC, USA, 212. [12] M. J. Koop, S. Sur, Q. Gao, and D. K. Panda, High Performance MPI Design using Unreliable Datagram for Ultra-scale InfiniBand Clusters, in ICS 7: Proceedings of the 21st annual international conference on Supercomputing. New York, NY, USA: ACM, 27, pp [13] X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, in The Proceedings of IEEE 42nd International Conference on Parallel Processing (ICPP), France, October 213. [14] X. Lu, F. Liang, B. Wang, L. Zha, and Z. Xu, DataMPI: Extending MPI to Hadoop-Like Big Data Computing, in Proceedings of the 214 IEEE 28th International Parallel and Distributed Processing Symposium, ser. IPDPS 14, Washington, DC, USA, 214, pp [15] X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, in Proceedings of the 214 IEEE 22Nd Annual Symposium on High-Performance Interconnects, ser. HOTI 14, Washington, DC, USA, 214, pp. 9. [] OSU NBC Lab, High-Performance Big Data (HiBD), [17] PCI-SIG, An Introduction to SR-IOV Technology, 213. [18] M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, in International Conference on Supercomputing (ICS), Munich, Germany, 214. [19] I. Research, Cloud Computing in HPC: Rationale for Adoption, [2] P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle, DaRPC: Data Center RPC, in Proceedings of the ACM Symposium on Cloud Computing, ser. SOCC 14. New York, NY, USA: ACM, 214, pp. 15:1 15:13. [21] Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal, Hadoop Acceleration through Network Levitated Merge, in The Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, WA, November 211. [22] M. Welsh, D. Culler, and E. Brewer, SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, in Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Alberta, Canada, 21. [23] J. Zhang, X. Lu, M. Arnold, and D. K. Panda, MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 215, pp

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy

More information

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

Assessing the Performance Impact of High-Speed Interconnects on MapReduce Assessing the Performance Impact of High-Speed Interconnects on MapReduce Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Cristian Cira, Bin Wang, Zhuo Liu, Bliss Bailey, Weikuan

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Chaitanya Kandalla, Mark Daniel Arnold, and Dhabaleswar K. (DK) Panda Department of Computer

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 7-66, p- ISSN: 7-77Volume 5, Issue (Nov. - Dec. 3), PP -7 Performance Evaluation of over Gigabit Gurkirat Kaur, Manoj Kumar, Manju Bala Department

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC-Switzerland (April 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies Talk at OpenFabrics Alliance Workshop (OFAW 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage*

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* 216 IEEE International Conference on Big Data (Big Data) Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage* Nusrat Sharmin Islam, Md. Wasi-ur-Rahman, Xiaoyi

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

RoCE vs. iwarp Competitive Analysis

RoCE vs. iwarp Competitive Analysis WHITE PAPER February 217 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...5 Summary...6

More information

MREv: an Automatic MapReduce Evaluation Tool for Big Data Workloads

MREv: an Automatic MapReduce Evaluation Tool for Big Data Workloads Procedia Computer Science Volume 51, 2015, Pages 80 89 ICCS 2015 International Conference On Computational Science MREv: an Automatic MapReduce Evaluation Tool for Big Data Workloads Jorge Veiga, Roberto

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA High Performance Design for HDFS with Byte-Addressability of and RDMA Nusrat Sharmin Islam, Md Wasi-ur-Rahman, Xiaoyi Lu, and Dhabaleswar K (DK) Panda Department of Computer Science and Engineering The

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

RDMA for Memcached User Guide

RDMA for Memcached User Guide 0.9.5 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

Performance Benefits of DataMPI: A Case Study with BigDataBench

Performance Benefits of DataMPI: A Case Study with BigDataBench Benefits of DataMPI: A Case Study with BigDataBench Fan Liang 1,2 Chen Feng 1,2 Xiaoyi Lu 3 Zhiwei Xu 1 1 Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet WHITE PAPER Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet Contents Background... 2 The MapR Distribution... 2 Mellanox Ethernet Solution... 3 Test

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Exploiting HPC Technologies for Accelerating Big Data Processing and Storage Talk in the 5194 class by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds

Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds 216 IEEE 8th International Conference on Cloud Computing Technology and Science Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

Introduction to Infiniband

Introduction to Infiniband Introduction to Infiniband FRNOG 22, April 4 th 2014 Yael Shenhav, Sr. Director of EMEA, APAC FAE, Application Engineering The InfiniBand Architecture Industry standard defined by the InfiniBand Trade

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand Hari Subramoni, Gregory Marsh, Sundeep Narravula, Ping Lai, and Dhabaleswar

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Gurkirat Kaur, Manoj Kumar 1, Manju Bala 2 1 Department of Computer Science & Engineering, CTIEMT Jalandhar, Punjab, India 2 Department of Electronics

More information

Performance Considerations of Network Functions Virtualization using Containers

Performance Considerations of Network Functions Virtualization using Containers Performance Considerations of Network Functions Virtualization using Containers Jason Anderson, et al. (Clemson University) 2016 International Conference on Computing, Networking and Communications, Internet

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

LS-DYNA Performance Benchmark and Profiling. April 2015

LS-DYNA Performance Benchmark and Profiling. April 2015 LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017 In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information