A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Similar documents
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Memcached Design on High Performance RDMA Capable Interconnects

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Acceleration for Big Data, Hadoop and Memcached

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage*

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

High-Performance Training for Deep Learning and Computer Vision HPC

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks

Exploi'ng HPC Technologies to Accelerate Big Data Processing

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

RDMA for Memcached User Guide

Infiniband and RDMA Technology. Doug Ledford

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

MVAPICH2 Project Update and Big Data Acceleration

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Application Acceleration Beyond Flash Storage

Overview of High- Performance Big Data Stacks

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas

MREv: an Automatic MapReduce Evaluation Tool for Big Data Workloads

Advanced RDMA-based Admission Control for Modern Data-Centers

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

RDMA for Apache Hadoop 2.x User Guide

Enhancing Checkpoint Performance with Staging IO & SSD

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Can Modern Interconnects Improve the Performance of Hadoop Cluster? Performance evaluation of Hadoop on SSD and HDD with IPoIB

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters

Unified Runtime for PGAS and MPI over OFED

Memory Management Strategies for Data Serving with RDMA

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

RDMA for Apache HBase User Guide

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Unifying UPC and MPI Runtimes: Experience with MVAPICH

2008 International ANSYS Conference

Communication Frameworks for HPC and Big Data

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

High Performance MPI on IBM 12x InfiniBand Architecture

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

The rcuda middleware and applications

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

High Performance Migration Framework for MPI Applications on HPC Cloud

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

A Case for High Performance Computing with Virtual Machines

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

HPC Performance in the Cloud: Status and Future Prospects

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Memcached Design on High Performance RDMA Capable Interconnects

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Transcription:

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 2

Big Data Technology and Hadoop Apache Hadoop is a popular Big Data technology Hadoop Distributed File System (HDFS) (http://hadoop.apache.org/) is the underlying file system of Hadoop, HBase and Spark Adopted by many reputed organizations eg: Facebook, Yahoo! Hadoop and HDFS are being increasingly used on HPC clusters for scientific applications HBase Spark MapReduce HDFS Hadoop Framework 3

Hadoop Distributed File System Primary storage of Hadoop; highly reliable and fault-tolerant NameNode: is master node that stores metadata for the file system DataNode: stores data blocks; blocks are replicated on multiple DataNodes to provide fault tolerance Write and Replication are communication intensive processes Developed in Java for platformindependence and portability Uses sockets for communication!...... High Performance Networks (HDD/SSD)... (HDD/SSD)... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 4

HPC and Big Data HPC community is working towards wide adoption of high performance interconnects and protocols (e.g. RDMA) to accelerate Big Data Technologies Heterogeneous storage (RAMDisk, SSD, HDD) and parallel file systems like Lustre is also being used as the underlying file system for Hadoop and Spark frameworks on HPC clusters Many companies (e.g. Intel, Mellanox, and Cray) are working along these directions http://opensfs.org/wp-content/uploads/2014/04/ D3_S29_ProgressReportonEfficientIntegrationofLustreandHadroopYARN.pdf http://www.mellanox.com/page/products_dyn?product_family=144 https://community.mellanox.com/docs/doc-2080 5

All Interconnects and Protocols in OpenFabrics Stack Applica:on / Middleware Interface Applica:on / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/10/40/100 GigE IPoIB 10/40 GigE- TOE RSockets SDP iwarp RoCE IB Na:ve 6

Prior Work In [1], an RDMA (Remote Direct Memory Access)-based design of HDFS has been proposed In [2], a parallel replication scheme has been proposed to fully take advantage of high-performance network bandwidth In [3], HDFS is redesigned to maximize overlapping for highperformance interconnects In [4], a hybrid design to accelerate HDFS I/O performance with heterogeneous storage (RAMDisk, SSD, HDD, and Lustre) architecture has been proposed [1] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 [2] Can Parallel Replication Benefit HDFS for High-Performance Interconnects? N. Islam, X. Lu, W. Rahman, D. K. Panda, HotI '13, Aug 2013. [3] N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 [4] N. S. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 2015 7

Design Overview of HDFS with RDMA Applica=ons Design Features RDMA- based HDFS write [1] Others Java Socket Interface 1/10/40/100 GigE, IPoIB Network HDFS Write Java Na=ve Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) RDMA- based HDFS replica:on [1] Parallel replica:on support [2] SEDA- based design to provide maximum overlapping between different stages of Data Transfer and I/O [3] On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communica:on, while suppor:ng tradi:onal socket interface JNI Layer bridges Java based HDFS with communica:on library wri[en in na:ve code 8

Triple-H: Hybrid Approach to Accelerate HDFS with Heterogeneous Storage Architecture [4] Hybrid Replication Applica=ons Triple- H Data Placement Policies Evic:on/ Promo:on Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance 9

Motivation All the designs mentioned in [1 4] are done on Apache Hadoop codebase There are many Hadoop distributors that add their own features or optimizations to Hadoop Hortonworks Data Platform (HDP) Cloudera Hadoop Distribution (CDH) R4H (RDMA for HDFS) is RDMA-based plugin to accelerate HDFS by Mellanox How can RDMA-enabled HDFS designs [1-4] be utilized by different Hadoop distributions (e.g. Apache, HDP, CDH) and versions without doing significant changes in the existing HDFS deployments? 10

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 11

Problem Statement Can we design a RDMA-based Plugin for HDFS That can bring benefits of efficient RDMA-enhanced HDFS designs, to different Hadoop distributions? Ensure similar performance benefits (without overhead) to Apache Hadoop as the existing enhanced designs of HDFS? Different enterprise Hadoop distributions, such as HDP, CDH also observe performance benefits for different benchmarks? What is the possible performance improvement over existing HDFS plugins such as Mellanox R4H? 12

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 13

Overview of RDMA-based Plugin Design Server Side RdmaDataXceiverServer is loaded on every DataNode when cluster is started/restarted It implements ServicePlugin interface Client Side RdmaDistributedFileSystem is loaded at HDFS client Extends DistributedFileSystem to utilize common File system operation like file open, rename, close etc RdmaDFSOutPutStream is main component that reads file and sends packets to DataNodes using RDMA 14

Design Alternatives of RDMA-based Client Side Plugin Approach 1 AbstractDFSOutputStream defines common HDFS methods like Get list of DataNodes from NameNode Read data and convert into packets Notify NameNode of file write completion Follows object oriented design Requires many code changes and code reorganization in HDFS Approach 2 Requires minimum code changes Change access specifiers in DFSOutputStream Utilize common HDFS methods from DFSOutputStream 15

Plugin Features (Current Status) The RDMA plugin incorporates RDMA based HDFS write [1], RDMA-based replication [1], RDMA-based parallel replication [2], SEDA designs [3] and Triple-H [4] features RDMA plugin along with Triple-H design is applied to Apache Hadoop 2.6, HDP 2.2 and CDH 5.4.2 For Apache Hadoop 2.5, we apply the RDMA plugin without Triple-H design as the heterogeneous storage support feature is not available for this version 16

Overview of the HiBD Project RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache and HDP Hadoop distributions RDMA for Apache Hadoop 1.x (RDMA-Hadoop) RDMA for Memcached (RDMA-Memcached) OSU HiBD-Benchmarks (OHB) HDFS and Memcached Micro-benchmarks http://hibd.cse.ohio-state.edu Users Base: 125 organizations from 20 countries More than 13,050 downloads from the project site RDMA for Apache HBase, Spark and CDH 17

RDMA for Apache Hadoop 2.x Distribution High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Enhanced HDFS with in-memory and heterogeneous storage High performance design of MapReduce over Lustre Plugin-based architecture supporting RDMA-based designs for Apache Hadoop and HDP Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB) Current release: 0.9.7 Based on Apache Hadoop 2.6.0 Compliant with Apache Hadoop 2.6.0 and HDP 2.2.0.0 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs and Lustre http://hibd.cse.ohio-state.edu 18

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 19

Experimental Setup Hardware Intel Westmere Cluster with large memory and SSD (Cluster A) Each node has Xeon Dual quad-core processor nodes operating at 2.67 GHz. Each node is equipped with 24 GB RAM and two 1TB HDDs Four of the nodes have 300GB OCZ VeloDrive PCIe SSD Each node has Mellanox QDR HCAs (32 Gbps data rate) Intel Westmere Cluster (Cluster B) Each node has Xeon Dual quad-core processor nodes operating at 2.67 GHz with 12GB RAM, 6GB RAM Disk, and 160GB HDD Mellanox QDR HCAs (32 Gbps data rate) Software JDK 1.7.0; Apache Hadoop 2.6, 2.5; HDP 2.2; CDH 5.4.2; R4H 1.3 In all our experiments, we use four DataNodes and one NameNode. The HDFS block size is 128MB and replication factor is three. All the following experiments are run on Cluster A, unless stated otherwise. 20

Evaluation with Apache Hadoop 2.6.0 Distribution (TestDFSIO) For TestDFSIO write test, Apache-2.6-TripleH-RDMAPlugin is able to offer similar performance benefits as that of Apache-2.6-TripleH-RDMA The plugin design offers 48% reduction in latency, 3x improvement in throughput over default HDFS running over IPoIB 21

Evaluation with Apache Hadoop 2.6.0 Distribution (TeraGen & RamdomWriter) Plugin design shows performance benefits of 27% for TeraGen and 31% for RandomWriter over default HDFS running over IPoIB The Triple-H design, along with the RDMA-enhanced designs, incorporated in the plugin, improve the I/O and communication performance 22

Evaluation with Apache Hadoop 2.6.0 Distribution (TeraSort & Sort) The RDMA based design and Triple-H included in the plugin ensure performance gains of 39% and 40% over IPoIB for TeraSort and Sort benchmarks 23

Evaluation with Apache Hadoop 2.5.0 Distribution (TestDFSIO) RDMA-enhanced HDFS used as plugin without Triple-H design is indicated as Apache-2.5-SORHDFS-RDMAPlugin Plugin design shows 27% higher throughput for and 18% reduction in latency for 40GB datasize for TestDFSIO write over IPoIB 24

Evaluation with Enterprise Hadoop Distributions - HDP 2.2 and CDH 5.4.2 (TestDFSIO) Default HDFS running over IPoIB for HDP and CDH are indicated as HDP-2.2-IPoIB and CDH-5.4.2-IPoIB. RDMA-enhanced plugin and Triple-H applied to HDP and CDH are indicated as HDP-2.2-TripleH-RDMAPlugin and CDH-5.4.2-TripleH-RDMAPlugin Plugin design offers 63% benefit for latency and 3.7x benefit in throughput for TestDFSIO write benchmark 25

Evaluation with Enterprise Hadoop Distributions - HDP 2.2 and CDH 5.4.2 (TeraGen & RandomWriter) Plugin design shows performance benefit of 37% for TeraGen and 23% for RandomWriter over HDP IPoIB Plugin applied to CDH shows performance benefit of 41% for TeraGen and 49% for RandomWriter 26

Performance Comparison with R4H on Cluster B) (TestDFSIO) Performance comparison between Mellanox R4H (RDMA for HDFS) shown as R4H and plugin shown as Triple-H. Both plugin are applied to HDP 2.2 RDMA-based plugin design offers 4.6x improvement in throughput compared to R4H. As data size increases, the throughput becomes bounded by disk 27

Outline Introduc:on & Mo:va:on Problem Statement RDMA- based plugin Design Performance Evalua:on Conclusion & Future work 28

Conclusion and Future Work Proposed an RDMA-based plugin for Hadoop Distributed File System (HDFS), to leverage the benefits of RDMA across Apache and Enterprise Hadoop distributions Extensive experimental results demonstrate that our proposed RDMAbased HDFS plugin Incurs no extra overhead in terms of performance for different benchmarks Up to 3.7x improvement in TestDFSIO write throughput, and up to 48% improvement in latency, as compared to different Hadoop distributions running over IPoIB Apache, Hortonworks HDP, Cloudera CDH Up to 4.6x improvement in TestDFSIO write throughput, and 62% improvement in TestDFSIO write latency, as compared to Mellanox R4H Available at http://hibd.cse.ohio-state.edu Future Work Undertake detailed studies to assess the benefits of using the proposed plugin for more Hadoop applications Make RDMA-based plugin available for CDH as part of HiBD project 29

Thank You! {bhata, islamn, luxi, rahmanmd, shankard, panda}@cse.ohio- state.edu Network- Based Compu:ng Laboratory h[p://nowlab.cse.ohio- state.edu/ The High- Performance Big Data Project h[p://hibd.cse.ohio- state.edu/