Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Size: px

Start display at page:

Download "Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase"

Letitia Simpson
5 years ago
Views:

2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank

1 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Hari Subramoni, and Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering, The Ohio State University {lu.9, shankar.5, gugnani.2, subramoni.1, Abstract The performance of Hadoop components can be significantly improved by leveraging advanced features such as Remote Direct Memory Access (RDMA) on modern HPC clusters, where high-performance networks like InfiniBand (IB) and RoCE have been deployed widely. With the emergence of high-performance computing in the cloud (HPC Cloud), highperformance networks have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology. With these advancements in HPC Cloud networking technologies, it is high time to investigate the design opportunities and impact of networking architectures (different generations of IB, 4GigE, 4G-RoCE) and protocols (TCP/IP, IPoIB, RC, UD, Hybrid) in accelerating Hadoop components over highperformance networks. In this paper, we propose a network architecture and multi-protocol aware Hadoop RPC design, that can take advantage of RC and UD protocols for IB and RoCE. A hybrid transport design with RC and UD is proposed which can deliver memory scalability and performance for Hadoop RPC. We present a comprehensive performance analysis on five bare-metal IB/RoCE clusters and one SR-IOV enabled cluster in the Chameleon Cloud. Our performance evaluations reveal that our proposed designs can achieve up to 12.5x performance improvement for Hadoop RPC over IPoIB. Further, we integrate our RPC engine into Apache HBase, and demonstrate that we can accelerate YCSB workloads by up to 3.6x. Other insightful observations on performance characteristics of different HPC Cloud networking technologies are also shared in this paper. I. INTRODUCTION Recent worldwide studies by IDC [7] reported that 67% of HPC centers perform high-performance data analytics (HPDA). This necessitates high-performance designs of Big Data middleware like Apache Hadoop for such modern HPC clusters. As a distributed processing framework, there is a large amount of data movement within and between different Hadoop components including HDFS, MapReduce, YARN and HBase. This makes the performance of Hadoop components sensitive to the underlying networking technologies. Highperformance networking technologies such as InfiniBand (IB), 1/4GigE, RDMA over Converged Ethernet (RoCE) have gained importance for designing the high-performance compute (HPC) clusters. In addition to providing high bandwidth and low latency, IB and RoCE provide advanced features like Remote Direct Memory Access (RDMA), different communication protocols including Reliable Connection (RC) and Unreliable Datagram (UD) that have different performance and memory characteristics. With the emergence of highperformance computing in the cloud (HPC Cloud), approximately 35% of HPC today is being done in the cloud envi- This research is supported in part by National Science Foundation grants #CNS , #IIS , and #CNS It used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number #OCI /$31. c 2 IEEE (CloudCom ) ronment [19]. Thus, advanced networking technologies such as IB, 1/4GigE, and RoCE have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR- IOV) [17] technology, including OpenStack-based HPC cloud such as Chameleon [1] and Amazon EC2. Prior research has shown that communication libraries such as MPI can deliver near bare-metal performance on SR-IOV enabled InfiniBand clusters [9, 23]. With these advancements in HPC Cloud networking technologies, it is high time to investigate the design opportunities and impact of networking architectures and protocols on accelerating Hadoop components over highperformance networks. A. Motivation As shown in Figure 1, there are three dimensions that need to be considered for designing high-performance communication libraries for Big Data middleware. The scope of the first dimension i.e., network architecture, are the various HPC cloud networking technologies such as different generations of IB including the IB QDR (Gbps), FDR (56Gbps), EDR (1Gbps), and 4 Gig Ethernet/RoCE interconnects. Based on the network interconnect available in the cluster, different highperformance network protocols such as TCP/IP over Ethernet/IB (IPoIB) and RC/UD over IB/RoCE constitute the second dimension. The third dimension spans the type of available HPC cloud environments: bare metal nodes and virtualized nodes with SR-IOV support. Fig. 1: HPC Cloud Networking Technologies Recent studies [6, 8, 1, 13, 15, 18, 21] have shed light on the possible performance improvements for different Big Data computing middleware by taking advantage of RDMA over IB with RC protocol. They do not take into account how one needs to adapt the design based on the cost-performance characteristics of the underlying high-performance interconnect protocols; And how we can enhance the performance and memory scalability of the proposed solutions further by taking advantage of a variety of modern transport protocols. Another / $31. 2 IEEE DOI 1.119/CloudCom

2 study [11] shows the trade-offs between RC and UD on IB for designing a high-performance key-value store. But it does not exploit the benefits of latest networking technologies like 4G-RoCE. And these recent studies have not yet explored what is the impact of these networking technologies on accelerating communication in Hadoop components on modern virtualized SR-IOV-based HPC clouds. These issues lead us to the following broad challenge: Can existing designs of Hadoop components over InfiniBand need to be made aware of the underlying architectural trends and take advantage of the support for modern transport protocols that InfiniBand and RoCE provide? B. Contributions In this paper, we take up the broad challenge of designing a novel multi-protocol (RC, UD, and hybrid protocols over IB and RoCE) based high performance and scalable Hadoop RPC engine, which is a fundamental communication mechanism in Hadoop Ecosystem. To illustrate its impact and benefits, we further integrate this design into Apache HBase and evaluate it with Yahoo! YCSB [4] application benchmarks. The major contributions of this paper are: (1) Propose multi-protocol (RC, UD, and Hybrid) based and architecture-aware designs over InfiniBand and RoCE for Hadoop RPC based communication, which further accelerate HBase middleware. (2) Study the impact of different architectures of highperformance interconnects (IB QDR, FDR, and EDR; 4Gig Ethernet, and RoCE) and protocols (TCP/IP, IPoIB, RC, UD, Hybrid) on the performance of Hadoop RPC and HBase. (3) Conduct extensive evaluations with the proposed designs on five bare-metal IB/RoCE clusters and one SR-IOV enabled IB cluster in HPC cloud. Performance evaluation shows that our proposed design can improve the Hadoop RPC throughput by up to 12.5x compared to running default Hadoop RPC with the IPoIB UD transport protocol. Further, our scalability analysis reveals that our proposed design with UD protocol achieves good memory scalability. Our RC and UD based hybrid communication scheme can deliver performance comparable to that of RC, while maintaining a steady memory footprint as that of UD. The performance comparison on Chameleon Cloud shows that Hadoop RPC running over SR-IOV has some overhead compared to native IB FDR on small messages in terms of latency, while the gap becomes very small (around 6%) for large messages. The integrated design with HBase illustrates that our proposed designs can accelerate HBase set/get operations. We demonstrate that our RPC designs can achieve up to 3.6x improvement in the HBase throughput for YCSB A, B, and C workloads, as compared to default HBase running with IPoIB over 36 nodes on SDSC Comet. To the best our knowledge, this is the first paper to study the impact of HPC Cloud networking technologies (architectures and protocols) on the design of Big Data processing middleware such as Hadoop RPC and HBase. The rest of the paper is organized as follows. Section II presents the necessary background and related works. Section III presents our proposed design. Section V describes our detailed evaluation. Finally, we conclude in Section VI with some description of possible future work. II. BACKGROUND AND RELATED WORK In this section, we provide the necessary background information for this paper and related works in literature. InfiniBand and RoCE: InfiniBand Architecture [2] defines a switched network fabric for interconnecting processing and I/O nodes, using a queue-based model. It supports two communication semantics: Channel Semantics (Send-Receive communication) over RC and UD; and Memory Semantics (RDMA communication) over RC. Both semantics can perform zerocopy transfers from source-to-destination buffers without additional host-level memory copies. RC is connection-oriented and requires a dedicated queue pair (QP) for each destination process while the connection-less UD transport uses a single QP for all. InfiniBand adapters can also be used to run applications that use TCP/IP over the InfiniBand network in the IP-over-InfiniBand (IPoIB) mode [3]. Leveraging these transport services have been demonstrated to be beneficial for several applications that are commonly run on HPC clusters. For example, Koop et. al [12] proposed designs in MPI library to take advantage of connection-less UD transport to reduce memory usage for connections while maintaining the same performance as that of RC. RoCE allows RDMA over highperformance Ethernet networks, and can work with both RC and UD protocols. SR-IOV: SR-IOV is an industry standard which specifies native I/O virtualization capabilities in PCI Express (PCIe) adapters. It presents a PCIe device as multiple virtual devices and each virtual device can be dedicated to a single VM. It has been demonstrated that SR-IOV performs significantly better than other software-based I/O virtualization solutions [9]. InfiniBand can support SR-IOV in HPC clouds. Accelerating Big Data Middleware on HPC clusters: With the increased need to design Big Data middleware capable that can deliver high-performance on modern HPC clusters, several studies have been devoted to exploring the advantages of re-designing Hadoop middleware components to be RDMAaware. For instance, Islam et al. proposed RDMA-enhanced designs [8] to improve HDFS performance on InfiniBand clusters. Similarly, [18, 21] proposed high-performance data shuffling using RDMA in MapReduce for InfiniBand clusters. Recent studies [6, 13, 2] have also shown that RPC and HBase can be improved significantly by leveraging the advanced features offered by high-performance networks. Highperformance communication schemes by taking advantage of MPI for improving the performance of RPC and Hadoop-like Big Data processing workloads were proposed in [14]. III. PROPOSED DESIGN In this section, we describe the network architecture-aware and multi-protocol aware design for Hadoop RPC and HBase. A. Hadoop RPC Architecture Overview Figure 2 depicts a high-level overview of the two Hadoop RPC designs: (1) the default Hadoop RPC running IP protocol over InfiniBand (IPoIB) or TCP/IP over 4 GigE, and, (2) our proposed high-performance and hybrid Hadoop RPC engine. The Java-based Hadoop RPC system supports two serialization formats: (1) Writable serialization, that implements a simple and efficient serialization protocol based on DataInput and DataOutput interfaces in Java, and, (2) Protocol buffers [5], 311

3 that provides a fast and flexible mechanism for serializing userdefined structured data objects. From Figure 2, we can see that default Hadoop employs a Java NIO-based RPC engine that interfaces with the Java Sockets library running with IPoIB. In contrast, our proposed RPC engine employs a re-designed communication sub-system that enables Hadoop components to native InfiniBand/RoCE primitives directly [13]. Based on the RPC design in [13], in this paper, we design a customized, scalable Hadoop RPC plugin module. Our design has the following main components. SEDA-based Thread Management: Staged Event-Driven Architecture (SEDA) [22] is widely used for designing highthroughput systems, whose basic principle is to decompose a complex processing logic into a set of stages that are connected via event-based queues. We leverage our previous efforts on improving RPC performance using SEDA approach [13, 15] to design a network architecture-aware RPC engine. Offloading Data Transfers to Native IB-/RoCE-based RPC Engine: As illustrated in Figure 2, our proposed native IB- /RoCE-based RPC engine can leverage RC, UD, and hybrid RC-UD designs to communicate RPC call messages in an efficient manner, over both native InfiniBand clusters (IB QDR/FDR/EDR) or using RoCE over 4 GigE. All our designs can run with SR-IOV over IB/RoCE networks. The native IB- /RoCE-based RPC engine can be leveraged by both Writable and Protobuf serialization/de-serialization mechanisms. We provide details of our design over RC in Section III-B, design over UD in Section III-C, and how we can leverage these different transport protocols in a hybrid fashion in Section III-D. Fig. 2: Overview of Proposed Hadoop RPC and HBase Design B. Hadoop RPC Design over RC When the client issues an RPC call, the sender object that comprises of the RPC method information and the calling arguments is passed to the high-performance RPC plugin. The RPC client serializes the sender object, and passes it on to the underlying native-ib/roce RPC engine through Java Native Interface (JNI), using a direct Java byte buffer (ByteBuffer). A message is created with the serialized data object, and a corresponding message header is created. Since RC is a connection-oriented service that guarantees reliable service and provides RDMA capabilities, RPC over the RC channel adapts the different schemes based on payload size: (a) Eager protocol for small messages (Eager): For short message transfers, wherein the payload size is smaller than the eager threshold, the header and message are copied directly into the network buffer. Using IB send primitives with eager protocol over the RC channel, both the header and the message are communicated to the server-side in one network operation. On the server-side, the native-ib/roce RPC engine copies the message received into a JNI ByteBuffer allocated by the RPC server. This is illustrated in Figure 3(a). (b) Rendezvous protocol for large messages (Rendezvous): For large message size, wherein the payload size is greater than the eager threshold, RDMA is leveraged for efficient message transfers. The header information and the client-side buffer location is sent in the form of a Request-To-Send (RTS) message to the server-side. On receipt of the header, native-ib/roce RPC engine on the server obtains a direct byte buffer from the JNI layer. With this as a destination buffer, the RPC engine issues an RDMA read operation by specifying the remote client buffer as source. Compared to the eager protocol for small messages, the rendezvous protocol eliminates memory copies between client buffers and preallocated network buffers, at both sender and receiver side. This is illustrated in Figure 3(b). For both short and large messages, RPC server deserializes the message in the JNI ByteBuffer into the sender object. C. Hadoop RPC Design over UD UD transport provides fast but unreliable service, making it mandatory to explicitly handle reliability and flow control is within the native IB/RoCE RPC engine. To facilitate flow control, we leverage the traditional sliding window protocol as implemented in [11]. Using the sliding window to represent the upper bound on the number of unacknowledged messages, the native IB/RoCE RPC engine keeps track of ongoing message transfers. Explicit acknowledgment messages are sent by the receiver for each successful full message receipt. A timeout counter is also assigned to each end point channel. All messages within an expired end point channel that were unacknowledged are re-transmitted to enable reliable data transfer. Similar to RC-based design, the RPC client passes the serialized Java sender object to the native IB-/RoCE-based RPC engine via JNI using direct byte buffers. A message is created with the serialized data object, and a corresponding message header is created. The message transfer mechanism is chosen based the payload size: (a) Direct transfer of small messages (under Maximum Transmission Unit, MTU): For fast short message transfers, the header and message are copied directly into a pre-allocated network buffer. Using IB send primitives with eager protocol over the UD channel, the header and the message are communicated to the server-side. On the server-side, the native- IB/RoCE RPC engine copies the message received into a JNI ByteBuffer allocated by the RPC server. This is illustrated in Figure 4(a). (a) Packetized transfer for medium messages (PKTZED): For medium-sized messages, we employed a packetized data transfer mechanism. The message is divided into smaller chunks (less than MTU), and communicated to the RPC server using IB send over UD channel. Each chunk header contains a sequence identifier to help order the message chunks on the 312

4 server. At the server-side, the chunked messages are received into pre-allocated buffers and re-assembled into the original message. The re-assembled message is copied into the direct byte buffer and passed to the JNI layer, as illustrated in Figure 4(b). (b) Zero-copy transfer for large messages (ZCP): For large message sizes, the header information and the message size information are sent in the form of a Request-To-Send (RTS) message to the server-side. On receipt of the header, native- IB/RoCE RPC engine on the server obtains a direct byte buffer from the JNI layer with appropriate buffer size. Through this negotiation, the receiver side can guarantee that the receive buffer is ready for receiving data. Then similar to mediumsized messages, packetized data send can directly transfer message in smaller MTU-sized chunks without copying into internal intermediate buffers. On receipt, the message encapsulated in each chunk is directly saved into the allocated JNI byte buffer. Once the entire message has been received into the direct byte buffer, it is passed on to the RPC server plugin via JNI. Memory copies with large buffers are very expensive operations, so the zero-copy transfer scheme is designed to avoid the copy overhead with large messages. This is illustrated in Figure 4(c). For all message sizes, the RPC server deserializes the message in the JNI direct byte buffer to obtain sender object and invokes the RPC method with it. D. Hadoop RPC over Multi-Protocol Aware Hybrid Transport In order to leverage both RC and UD approaches discussed above, we introduce a hybrid connection management mode that can exploit the advantages of both RC and UD. To start with, we enable fast and reliable communication by using RC, but the maximum number of RC connections can be up to a specific configurable threshold. For better scalability and to reduce memory footprint, all further connections are made via UD transport channel. In this way, the native IB/RoCE RPC engine performs channel selection in a transparent manner, without affecting the serialization and deserialization process of the sender Java objects at the upper-level. E. Leveraging Multi-Protocol Aware RPC Engine in HBase While HDFS and YARN utilize Hadoop RPC to communicate between nodes or tasks, Hadoop components such as HBase utilize it for data transfers between client and server. Apache HBase is a distributed NoSQL database that runs on top of HDFS, and is designed to host large tables with billions of entries. Hence, to illustrate the impact of the proposed network architecture-aware approach on Hadoop RPC-based middleware, we demonstrate how our proposed native IB/RoCE-based high-performance RPC engine can benefit data communication in HBase. Since Apache HBase uses a modified version of the Hadoop RPC protocol as the core communication library for its set and get operations, we integrate our enhanced Hadoop RPC native engine into Apache HBase. In this way, we provide a means for HBase applications to leverage the high-performance made available by native IB/RoCE transport protocols on the latest HPC clouds. IV. NETWORK ARCHITECTURE-AWARE PERFORMANCE OPTIMIZATION FOR PROTOCOLS Experimental Setup: To evaluate the performance trends of Hadoop RPC and HBase over different generations of highspeed interconnects we employ different clusters that are equipped with InfiniBand QDR/FDR/EDR HCAs and RoCEenabled 4 GigE interconnects. Table I presents these different clusters used for our performance evaluations. Cluster Name RI-IB-QDR SDSC-Comet- IB-FDR RI2-IB-EDR RI2-4G-RoCE Chameleon-FDR- Bare-Metal Chameleon-FDR- SR-IOV Processor Type CPU Cores Intel Westmere 2.67 GHz 8 24 Intel Xeon E5-268v3 24 Intel Broadwell E5-268-v4 28 Intel Broadwell E5-268-v4 28 Intel Haswell E5-267-v3 48 Intel Haswell E5-267-v3 Mem. (GB) 48 TABLE I: Cluster Configurations HCA OS Mellanox RHEL IB QDR v6.1 Mellanox IB FDR CentOS Mellanox CentOS IB FDR v7.2 Mellanox CentOS ConnectX-4 v7.2 4GbE Mellanox CentOS IB FDR v7 Mellanox IB FDR w/ CentOS SR-IOV + v7 KVM-VMM A. Performance Optimization in RPC Engine 1) MTU based tuning for both RC and UD: To optimize RC and UD performance, we first look at the IB MTU. A larger MTU could bring good efficiency because each packet could carry more user data leading to fewer packets to process. However, when MTU is too large, it could lead to larger error rates and longer latency for user data that is very small. Thus, we tune the MTU to find out the optimal value for different HCA architectures. For both UD and RC channels, we use a ping-pong latency benchmark and run it for MTU from 124 to 496 bytes. The experiment results with UD and RC on RI2-IB-EDR, RI2-4G-RoCE, and Chameleon-FDR-SR-IOV clusters are shown in Figure 5(a), Figure 6(a), and Figure 7(a), respectively. For 4 MB message size shown in Figure 5(a), the ping-pong latency for RC-MTU-1K, RC-MTU-4K, UD- MTU-1K, and UD-MTU-4k is 67.9 us, us, us, and us, respectively. Compared with RC-MTU-1K, RC-MTU-4K reduces the latency by 6%. The UD-MTU-4K reduces the latency by 12%. While the same trend is observed for 4G-RoCE, the optimal MTU for RC over SR-IOV is however 2 K. 2) Eager-threshold tuning for RC: Eager-threshold specifies the switch point between eager and rendezvous protocol. If the threshold is too small, it could add overhead from exchanging control messages for setting up communication buffers at both sender and receiver. If it is too large, copying data to and from user buffer and pre-registered buffers at the both sender and receiver could incur large overheads. Therefore, we need to optimize this parameter to find out the optimal switch point. Since the eager-threshold varies with different CPU and HCA configurations, we tune this parameter for different network interconnects. We tune this parameter by running the ping-pong latency benchmark with the eager-threshold setting from 8 KB to KB. The experiment results are shown in Figure 5(b), Figure 6(b), and Figure 7(b). Through the zoomin figures, we can see the different performance trends with varied thresholds. 3) PKTZED/ZCP based tuning for UD: UD packetthreshold specifies the switch point between packetized send/recv and zero-copy based transfers on UD. If the switch point is too small, the overhead of exchanging control messages for setting up communication buffers will be added for 313

5 (a) Eager Transfer (b) Zero-Copy RDMA Fig. 3: Hadoop RPC Design with RC some medium messages. If it is too large, large copy overheads due to copying data between user s and pre-registered buffers could be incurred. Therefore, we optimize this parameter to find out the optimal switch point. We measure the roundtrip latency by setting the pkt-threshold from k to 512k. The experiment results are shown in Figure 5(c), Figure 6(c), and Figure 7(c). Different performance trends with varied thresholds can be observed through the zoom-in figures. The tuned parameters for all clusters shown in Table I are summarized in Table II. From Table II, we can make the following important observations: (1) Chameleon-FDR- Bare-Metal and SDSC-Comet-FDR have similarly optimized values for tuned parameters; these can be used for optimal performance of other FDR clusters, and, (2) SR-IOV and bare-metal on the Chameleon cloud have similar values for parameters like eager and packetized thresholds, MTUs for RC and UD. It can also be noted that our designs can run over both RoCE RC and RoCE UD. Based on these findings, we design a modular inside RPC engine to do automatic network architecture detection and then set the best parameters for associated communication channels. V. PERFORMANCE EVALUATION In this section, we present the detailed performance evaluations of our native IB/RoCE-based design of Hadoop RPC and HBase. We compare the performance of our design with that of the default architecture over various IB/RoCE interconnects and protocols. We have used the OSU HiBD Benchmarks (OHB) [] and Yahoo! YCSB benchmarks for our evaluations. In all our experiments, we have used Hadoop , HBase 1.1.2, and JDK A. Hadoop RPC Latency Figure 8 depicts the ping-pong latency of communications. Switching between RC and UD protocols for IPoIB requires administrator privileges and driver support. In all evaluated clusters, we can only switch this on RI-IB-QDR. We use default protocols for IPoIB on other clusters. From these graphs, we can see that compared to the or transport channel, the IPoIB UD and IPoIB RC transport channels do not achieve optimal performance. It is because when we make RPC messages go through IPoIB transport channels, the bottlenecks of memory allocations, copies, and adjustments in Hadoop RPC default design suppress the benefits from high-performance networks. As shown in Figure 8(a), for a bytes payload size, the latency for the IPo transport channel is 74 us, whereas the latency for the transport channel is 35 us. The transport channel has about 2.1x performance speedup. For the same payload size with UD, (a) Eager Transfer (b) Packetized Transfer (c) Zero-Copy Transfer Fig. 4: Hadoop RPC Design with UD the IPo and latency is 75.1 us and 35.1 us, respectively. The transport channel reduces the latency by 53%. The same trend can be seen from Figure 8(b) to Figure 8(f) as well. IB/RoCE with RC always gives the best performance in terms of latency, which is mainly because of taking advantage of RDMA-based design. IB/RoCE with UD can perform faster than TCP/IP over 4Gig Ethernet or IPoIB with RC/UD, which is mainly because of the proposed design on eager, packetized message transfer and zero-copy message transfer with UD. IB/RoCE with Hybrid RC and UD design can achieve similar performance as IB/RoCE RC. For Hadoop RPC latency, with the evolving of networking technologies, we see that IB EDR has the best performance for all the cases. 4G-RoCE and IB FDR have similar performance while 4Gig Ethernet and IB QDR perform similarly. SR-IOV has considerable overhead compared to bare-metal environment for small message sizes but the overhead gets reduced to around 6% for large message sizes (>KB). Therefore, SR-IOV can be suitable for large-size I/O operations that are common in many Big Data middleware components. B. Hadoop RPC Throughput In this section, we evaluate the performance of RPC communication in terms of throughput. For these tests, the RPC server runs on a separate node, and multiple clients ranging from 8 to are distributed uniformly over 8 compute nodes. Figure 9 shows the throughput results on all six clusters, respectively. From these figures, we can observe that compared with IPoIB and 4Gig Ethernet transport channels, the native IB and RoCE based transport channels are able to achieve higher throughput. For example, as shown in Figure 9(c), for bytes message size with clients, the throughput of transport channel can achieve calls/s, whereas it is only calls/s for IPo transport channel. Overall, the performance improvement can be up to 2.15x. Similarly, there are up to 4.6x and 12.5x throughput improvement on cluster RI-IB-QDR and SDSC-Comet-IB-FDR. Compared to 4Gig Ethernet, the native design with 4G- RoCE is able to achieve up to 1.89x performance speedup. In addition, we can observe that transport channel can achieve similar throughput with. We can attribute this to the ability of our hybrid design to dynamically switch between RC and UD. For the SR-IOV based virtualized environment, we see similar trends that is able to achieve up to 2.6x performance speedup compared to IPoIB. C. Hadoop RPC Memory Scalability Analysis Figure 1 presents the memory footprint for varying number of client connections from 1 to for, 314

6 RC-MTU-1K RC-MTU-2K RC-MTU-4K UD-MTU-1K UD-MTU-2K UD-MTU-4K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K 4 UD-Pktzed_Threshold-K 35 UD-Pktzed_Threshold-K 3 UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K 25 UD-Pktzed_Threshold-256K 2 UD-Pktzed_Threshold-512K K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 5: Performance Optimization on RI2-IB-EDR Cluster RC-MTU-1K RC-MTU-2K RC-MTU-4K UD-MTU-1K UD-MTU-2K UD-MTU-4K K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K UD-Pktzed_Threshold-256K UD-Pktzed_Threshold-512K UD-Pktzed_Threshold-1M K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 6: Performance Optimization on RI2-4G-RoCE Cluster 9 RC-MTU-1K 8 RC-MTU-2K 7 RC-MTU-4K 6 UD-MTU-1K UD-MTU-2K 5 UD-MTU-4K K RC-Eager_Threshold-8K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K RC-Eager_Threshold-K K K K 8 UD-Pktzed_Threshold-K 7 UD-Pktzed_Threshold-K UD-Pktzed_Threshold-K 6 UD-Pktzed_Threshold-K 5 UD-Pktzed_Threshold-256K UD-Pktzed_Threshold-512K K K 256K (a) RC UD MTU Tuning (b) RC Eager Threshold Tuning (c) UD PKTZED/ZCP Tuning Fig. 7: Performance Optimization on Chameleon-FDR-SR-IOV Cluster TABLE II: Performance Optimization for Different Network Architectures RI-IB-QDR RI2-4G-RoCE SDSC-Comet-IB-FDR Chameleon-FDR-Bare-Metal Chameleon-FDR-SR-IOV RI2-IB-EDR MTU 4K 4K 4K 2K/4K (RC), 4K (UD) 2K (RC), 4K (UD) 4K RC Eager Threshold K K K K k K UD Pktzed Threshold K 512K/1M K K/256K K K and transports. transport needs to create one QP for each client connection. We can clearly see that, as the number of client connections increase, the memory footprint of keeps increasing. For client connections, it consumes approximately 42 MB memory. (Note that JVM memory consumption is always there in these tests.) However, there is no need to create new QP for each client connection for transport. As a result, we observe that consumes steady memory size. The slight memory increase is because of some new objects we allocate for increasing client connections in JVM. The steady memory consumption is favorable for large-scale applications. We also notice that by applying hybrid design, the memory footprint is clearly reduced, after the configured threshold (i.e., in this case) of client connections, and then keeps steady onwards. In the meantime, can still achieve the similar performance with. With this, we believe the hybrid scheme we proposed for Hadoop RPC will have less memory contention with applications which have higher memory consumption. D. YCSB over HBase HBase uses Google protocol buffer-based RPC for communicating all operation requests, such as Put, Get, etc., and responses between client and HRegionServer. To evaluate our RPC native engine integrated into HBase (described in Section III-E), we use YCSB workloads A, B, and C. We run 315

7 IPo IPo K (a) RI-IB-QDR IPo K (d) Chameleon-FDR-Bare-Metal IPo K K (b) SDSC-Comet-IB-FDR IPo K (e) Chameleon-FDR-SR-IOV IPo K K (c) RI2-IB-EDR 4GigE RoCE-RC RoCE-UD RoCE-Hybrid K K (f) RI2-4Gig Ethernet and 4G-RoCE Fig. 8: Hadoop RPC Latency Evaluation on Different Networking Technologies K 4K K 8 (a) RI-IB-QDR Cluster IPo IPo IPo K 4K K 8 IPo (b) SDSC-Comet-IB-FDR Cluster IPo K 4K K 8 (c) RI2-IB-EDR Cluster IPo 4GigE RoCE-RC K 4K K 8 (d) Chameleon-FDR-Bare-Metal K 4K K 8 (e) Chameleon-FDR-SR-IOV K 4K K 8 (f) RI2-4Gig Ethernet and 4G-RoCE Fig. 9: Hadoop RPC Throughput Evaluation on Different Networking Technologies Memory Footprint (MB) Fig. 1: Memory Footprint Evaluation Throughput (kcalls/s) IPo 4 8 Throughput (kcalls/s) IPo 4 8 (a) YCSB Workload A (b) YCSB Workload B (c) YCSB Workload C Fig. 11: YCSB Performance Evaluation on SDSC-Comet-IB-FDR Cluster Throughput (kcalls/s) IPo 4 8 3

8 25K operations per client with varying number of clients (from 4 to ) on a 36-node cluster with 4 HBase RegionServers and client nodes on SDSC-Comet-IB-FDR. We perform these experiments over IPoIB,, and transport channels. Figure 11 shows that the native IB based transport protocols (RC/UD/Hybrid) always perform better than the IPo transport. For the write-intensive YCSB workload A, with a read-to-update operation ratio of 5:5, we can observe up to 2.4x improvement in throughput with design over IPo. For the read-intensive YCSB workload B, with read-to-update ratio of 95:5, we can observe up to 2.65x improvement in the overall throughput of HBase with, as compared to IPo transport. For the readintensive YCSB workload C, with 1% read, we can observe up to 3.6x improvement in the overall throughput of HBase with, as compared to IPo transport. VI. CONCLUSIONS In this paper, we proposed network architecture-aware and multi-protocol (RC, UD, and Hybrid) based transport schemes to accelerate Hadoop RPC and HBase communication. We perform extensive performance optimization on different lowlevel communication parameters to determine the optimal value for achieving the best performance. Based on these, we present a comprehensive performance analysis of our proposed design using Hadoop RPC micro-benchmarks and application benchmarks on five bare-metal IB/RoCE clusters and one SR- IOV-enabled IB cluster. Performance evaluations reveal that our native IB/RoCEbased design with RC can give the best performance in terms of latency and throughput for Hadoop RPC and HBase communication. IB/RoCE with RC can achieve up to 12.5x performance speedup compared to IPoIB for Hadoop RPC communication. For HBase, IB/RoCE with RC could improve the throughput by up to 3.6x over IPoIB. These results show the advantages of eager and RDMA-based communication schemes with RC protocol. IB/RoCE with UD can perform faster than TCP/IP over 4GigE or IPoIB with RC/UD, which shows the benefits of the proposed design on eager, packetized message transfer and zero-copy message transfer with UD. IB/RoCE with Hybrid RC and UD design can achieve similar performance as IB/RoCE RC while maintaining a steady memory footprint as that of UD. We can also see that IB EDR provides optimal Hadoop latency and throughput. 4G-RoCE and IB FDR have similar performance, while 4GigE and IB QDR perform similarly. In the virtualized cloud environment, SR-IOV demonstrates considerable overhead compared to the bare-metal environment for small message sizes in terms of Hadoop RPC latency. However, the overhead gets reduced to around 6% for large message sizes (>KB). This makes us believe that SR-IOV could be suitable for large-size I/O operations that are common in many Big Data middleware components. For future work, we will study the impact of other transport protocols (like XRC, DC) on the performance of Hadoop and Spark components. We will also evaluate with more real applications on different networks and protocols. REFERENCES [1] Chameleon Cloud, [2] InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1., [3] IP over InfiniBand Working Group, org/html.charters/ipoib-charter.html. [4] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, Benchmarking Cloud Serving Systems with YCSB, in The Proceedings of the ACM Symposium on Cloud Computing (SoCC), Indianapolis, Indiana, June 21. [5] Google Protocol Buffers, [6] J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, in The Proceedings of IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), Shanghai, China, May 212. [7] International Data Corporation (IDC), New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending, [8] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, High Performance RDMAbased Design of HDFS over InfiniBand, in The Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, November 212. [9] J. Jose, M. Li, X. Lu, K. C. Kandalla, M. D. Arnold, and D. K. Panda, SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 213, pp [1] J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, in International Conference on Parallel Processing (ICPP), Sept 211. [11] J. Jose, H. Subramoni, K. Kandalla, M. Wasi-ur Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports, in Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 12), Washington, DC, USA, 212. [12] M. J. Koop, S. Sur, Q. Gao, and D. K. Panda, High Performance MPI Design using Unreliable Datagram for Ultra-scale InfiniBand Clusters, in ICS 7: Proceedings of the 21st annual international conference on Supercomputing. New York, NY, USA: ACM, 27, pp [13] X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, in The Proceedings of IEEE 42nd International Conference on Parallel Processing (ICPP), France, October 213. [14] X. Lu, F. Liang, B. Wang, L. Zha, and Z. Xu, DataMPI: Extending MPI to Hadoop-Like Big Data Computing, in Proceedings of the 214 IEEE 28th International Parallel and Distributed Processing Symposium, ser. IPDPS 14, Washington, DC, USA, 214, pp [15] X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, in Proceedings of the 214 IEEE 22Nd Annual Symposium on High-Performance Interconnects, ser. HOTI 14, Washington, DC, USA, 214, pp. 9. [] OSU NBC Lab, High-Performance Big Data (HiBD), [17] PCI-SIG, An Introduction to SR-IOV Technology, 213. [18] M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, in International Conference on Supercomputing (ICS), Munich, Germany, 214. [19] I. Research, Cloud Computing in HPC: Rationale for Adoption, [2] P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle, DaRPC: Data Center RPC, in Proceedings of the ACM Symposium on Cloud Computing, ser. SOCC 14. New York, NY, USA: ACM, 214, pp. 15:1 15:13. [21] Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal, Hadoop Acceleration through Network Levitated Merge, in The Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, WA, November 211. [22] M. Welsh, D. Culler, and E. Brewer, SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, in Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Alberta, Canada, 21. [23] J. Zhang, X. Lu, M. Arnold, and D. K. Panda, MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 215, pp

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu