A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 2

Big Data Technology and Hadoop Apache Hadoop is a popular Big Data technology Hadoop Distributed File System (HDFS) (http://hadoop.apache.org/) is the underlying file system of Hadoop, HBase and Spark Adopted by many reputed organizations eg: Facebook, Yahoo! Hadoop and HDFS are being increasingly used on HPC clusters for scientific applications HBase Spark MapReduce HDFS Hadoop Framework 3

Hadoop Distributed File System Primary storage of Hadoop; highly reliable and fault-tolerant NameNode: is master node that stores metadata for the file system DataNode: stores data blocks; blocks are replicated on multiple DataNodes to provide fault tolerance Write and Replication are communication intensive processes Developed in Java for platformindependence and portability Uses sockets for communication!...... High Performance Networks (HDD/SSD)... (HDD/SSD)... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 4

HPC and Big Data HPC community is working towards wide adoption of high performance interconnects and protocols (e.g. RDMA) to accelerate Big Data Technologies Heterogeneous storage (RAMDisk, SSD, HDD) and parallel file systems like Lustre is also being used as the underlying file system for Hadoop and Spark frameworks on HPC clusters Many companies (e.g. Intel, Mellanox, and Cray) are working along these directions http://opensfs.org/wp-content/uploads/2014/04/ D3_S29_ProgressReportonEfficientIntegrationofLustreandHadroopYARN.pdf http://www.mellanox.com/page/products_dyn?product_family=144 https://community.mellanox.com/docs/doc-2080 5

All Interconnects and Protocols in OpenFabrics Stack Applica:on / Middleware Interface Applica:on / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/10/40/100 GigE IPoIB 10/40 GigE- TOE RSockets SDP iwarp RoCE IB Na:ve 6

Prior Work In [1], an RDMA (Remote Direct Memory Access)-based design of HDFS has been proposed In [2], a parallel replication scheme has been proposed to fully take advantage of high-performance network bandwidth In [3], HDFS is redesigned to maximize overlapping for highperformance interconnects In [4], a hybrid design to accelerate HDFS I/O performance with heterogeneous storage (RAMDisk, SSD, HDD, and Lustre) architecture has been proposed [1] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 [2] Can Parallel Replication Benefit HDFS for High-Performance Interconnects? N. Islam, X. Lu, W. Rahman, D. K. Panda, HotI '13, Aug 2013. [3] N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 [4] N. S. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 2015 7

Design Overview of HDFS with RDMA Applica=ons Design Features RDMA- based HDFS write [1] Others Java Socket Interface 1/10/40/100 GigE, IPoIB Network HDFS Write Java Na=ve Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) RDMA- based HDFS replica:on [1] Parallel replica:on support [2] SEDA- based design to provide maximum overlapping between different stages of Data Transfer and I/O [3] On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communica:on, while suppor:ng tradi:onal socket interface JNI Layer bridges Java based HDFS with communica:on library wri[en in na:ve code 8

Triple-H: Hybrid Approach to Accelerate HDFS with Heterogeneous Storage Architecture [4] Hybrid Replication Applica=ons Triple- H Data Placement Policies Evic:on/ Promo:on Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance 9

Motivation All the designs mentioned in [1 4] are done on Apache Hadoop codebase There are many Hadoop distributors that add their own features or optimizations to Hadoop Hortonworks Data Platform (HDP) Cloudera Hadoop Distribution (CDH) R4H (RDMA for HDFS) is RDMA-based plugin to accelerate HDFS by Mellanox How can RDMA-enabled HDFS designs [1-4] be utilized by different Hadoop distributions (e.g. Apache, HDP, CDH) and versions without doing significant changes in the existing HDFS deployments? 10

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 11

Problem Statement Can we design a RDMA-based Plugin for HDFS That can bring benefits of efficient RDMA-enhanced HDFS designs, to different Hadoop distributions? Ensure similar performance benefits (without overhead) to Apache Hadoop as the existing enhanced designs of HDFS? Different enterprise Hadoop distributions, such as HDP, CDH also observe performance benefits for different benchmarks? What is the possible performance improvement over existing HDFS plugins such as Mellanox R4H? 12

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 13

Overview of RDMA-based Plugin Design Server Side RdmaDataXceiverServer is loaded on every DataNode when cluster is started/restarted It implements ServicePlugin interface Client Side RdmaDistributedFileSystem is loaded at HDFS client Extends DistributedFileSystem to utilize common File system operation like file open, rename, close etc RdmaDFSOutPutStream is main component that reads file and sends packets to DataNodes using RDMA 14

Design Alternatives of RDMA-based Client Side Plugin Approach 1 AbstractDFSOutputStream defines common HDFS methods like Get list of DataNodes from NameNode Read data and convert into packets Notify NameNode of file write completion Follows object oriented design Requires many code changes and code reorganization in HDFS Approach 2 Requires minimum code changes Change access specifiers in DFSOutputStream Utilize common HDFS methods from DFSOutputStream 15

Plugin Features (Current Status) The RDMA plugin incorporates RDMA based HDFS write [1], RDMA-based replication [1], RDMA-based parallel replication [2], SEDA designs [3] and Triple-H [4] features RDMA plugin along with Triple-H design is applied to Apache Hadoop 2.6, HDP 2.2 and CDH 5.4.2 For Apache Hadoop 2.5, we apply the RDMA plugin without Triple-H design as the heterogeneous storage support feature is not available for this version 16

Overview of the HiBD Project RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache and HDP Hadoop distributions RDMA for Apache Hadoop 1.x (RDMA-Hadoop) RDMA for Memcached (RDMA-Memcached) OSU HiBD-Benchmarks (OHB) HDFS and Memcached Micro-benchmarks http://hibd.cse.ohio-state.edu Users Base: 125 organizations from 20 countries More than 13,050 downloads from the project site RDMA for Apache HBase, Spark and CDH 17

RDMA for Apache Hadoop 2.x Distribution High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Enhanced HDFS with in-memory and heterogeneous storage High performance design of MapReduce over Lustre Plugin-based architecture supporting RDMA-based designs for Apache Hadoop and HDP Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB) Current release: 0.9.7 Based on Apache Hadoop 2.6.0 Compliant with Apache Hadoop 2.6.0 and HDP 2.2.0.0 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs and Lustre http://hibd.cse.ohio-state.edu 18

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 19

Experimental Setup Hardware Intel Westmere Cluster with large memory and SSD (Cluster A) Each node has Xeon Dual quad-core processor nodes operating at 2.67 GHz. Each node is equipped with 24 GB RAM and two 1TB HDDs Four of the nodes have 300GB OCZ VeloDrive PCIe SSD Each node has Mellanox QDR HCAs (32 Gbps data rate) Intel Westmere Cluster (Cluster B) Each node has Xeon Dual quad-core processor nodes operating at 2.67 GHz with 12GB RAM, 6GB RAM Disk, and 160GB HDD Mellanox QDR HCAs (32 Gbps data rate) Software JDK 1.7.0; Apache Hadoop 2.6, 2.5; HDP 2.2; CDH 5.4.2; R4H 1.3 In all our experiments, we use four DataNodes and one NameNode. The HDFS block size is 128MB and replication factor is three. All the following experiments are run on Cluster A, unless stated otherwise. 20

Evaluation with Apache Hadoop 2.6.0 Distribution (TestDFSIO) For TestDFSIO write test, Apache-2.6-TripleH-RDMAPlugin is able to offer similar performance benefits as that of Apache-2.6-TripleH-RDMA The plugin design offers 48% reduction in latency, 3x improvement in throughput over default HDFS running over IPoIB 21

Evaluation with Apache Hadoop 2.6.0 Distribution (TeraGen & RamdomWriter) Plugin design shows performance benefits of 27% for TeraGen and 31% for RandomWriter over default HDFS running over IPoIB The Triple-H design, along with the RDMA-enhanced designs, incorporated in the plugin, improve the I/O and communication performance 22

Evaluation with Apache Hadoop 2.6.0 Distribution (TeraSort & Sort) The RDMA based design and Triple-H included in the plugin ensure performance gains of 39% and 40% over IPoIB for TeraSort and Sort benchmarks 23

Evaluation with Apache Hadoop 2.5.0 Distribution (TestDFSIO) RDMA-enhanced HDFS used as plugin without Triple-H design is indicated as Apache-2.5-SORHDFS-RDMAPlugin Plugin design shows 27% higher throughput for and 18% reduction in latency for 40GB datasize for TestDFSIO write over IPoIB 24

Evaluation with Enterprise Hadoop Distributions - HDP 2.2 and CDH 5.4.2 (TestDFSIO) Default HDFS running over IPoIB for HDP and CDH are indicated as HDP-2.2-IPoIB and CDH-5.4.2-IPoIB. RDMA-enhanced plugin and Triple-H applied to HDP and CDH are indicated as HDP-2.2-TripleH-RDMAPlugin and CDH-5.4.2-TripleH-RDMAPlugin Plugin design offers 63% benefit for latency and 3.7x benefit in throughput for TestDFSIO write benchmark 25

Evaluation with Enterprise Hadoop Distributions - HDP 2.2 and CDH 5.4.2 (TeraGen & RandomWriter) Plugin design shows performance benefit of 37% for TeraGen and 23% for RandomWriter over HDP IPoIB Plugin applied to CDH shows performance benefit of 41% for TeraGen and 49% for RandomWriter 26

Performance Comparison with R4H on Cluster B) (TestDFSIO) Performance comparison between Mellanox R4H (RDMA for HDFS) shown as R4H and plugin shown as Triple-H. Both plugin are applied to HDP 2.2 RDMA-based plugin design offers 4.6x improvement in throughput compared to R4H. As data size increases, the throughput becomes bounded by disk 27

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 28

Conclusion and Future Work Proposed an RDMA-based plugin for Hadoop Distributed File System (HDFS), to leverage the benefits of RDMA across Apache and Enterprise Hadoop distributions Extensive experimental results demonstrate that our proposed RDMAbased HDFS plugin Incurs no extra overhead in terms of performance for different benchmarks Up to 3.7x improvement in TestDFSIO write throughput, and up to 48% improvement in latency, as compared to different Hadoop distributions running over IPoIB Apache, Hortonworks HDP, Cloudera CDH Up to 4.6x improvement in TestDFSIO write throughput, and 62% improvement in TestDFSIO write latency, as compared to Mellanox R4H Available at http://hibd.cse.ohio-state.edu Future Work Undertake detailed studies to assess the benefits of using the proposed plugin for more Hadoop applications Make RDMA-based plugin available for CDH as part of HiBD project 29

Thank You! {bhata, islamn, luxi, rahmanmd, shankard, panda}@cse.ohio- state.edu Network- Based Compu:ng Laboratory h[p://nowlab.cse.ohio- state.edu/ The High- Performance Big Data Project h[p://hibd.cse.ohio- state.edu/