Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Similar documents
A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Memcached Design on High Performance RDMA Capable Interconnects

Acceleration for Big Data, Hadoop and Memcached

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage*

Application Acceleration Beyond Flash Storage

Infiniband and RDMA Technology. Doug Ledford

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

High-Performance Training for Deep Learning and Computer Vision HPC

MVAPICH2 Project Update and Big Data Acceleration

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Accelerating and Benchmarking Big Data Processing on Modern Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

RDMA for Memcached User Guide

Accelerating and Benchmarking Big Data Processing on Modern Clusters

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

RDMA for Apache HBase User Guide

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Advanced RDMA-based Admission Control for Modern Data-Centers

Memory Management Strategies for Data Serving with RDMA

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

The rcuda middleware and applications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Birds of a Feather Presentation

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Exploiting HPC Technologies for Accelerating Big Data Processing and Storage

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Mark Falco Oracle Coherence Development

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Enhancing Checkpoint Performance with Staging IO & SSD

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Exploi'ng HPC Technologies to Accelerate Big Data Processing

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

InfiniBand Networked Flash Storage

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas

Can Modern Interconnects Improve the Performance of Hadoop Cluster? Performance evaluation of Hadoop on SSD and HDD with IPoIB

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Distributed Filesystem

2008 International ANSYS Conference

Unified Runtime for PGAS and MPI over OFED

High Performance Migration Framework for MPI Applications on HPC Cloud

No compromises: distributed transac2ons with consistency, availability, and performance

HPC Performance in the Cloud: Status and Future Prospects

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Interconnect Your Future

A Case for High Performance Computing with Virtual Machines

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Multifunction Networking Adapters

GPUs and Emerging Architectures

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Performance Evaluation of InfiniBand with PCI Express

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

NAMD Performance Benchmark and Profiling. January 2015

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Transcription:

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Conclusion & Future work 2

Introduction Big Data: provides groundbreaking opportunities for enterprise information management and decision making The rate of information growth appears to be exceeding Moore s Law The amount of data is exploding; companies are capturing and digitizing more information that ever 35 zettabytes of data will be generated and consumed by the end of this decade 3

Big Data Technology Apache Hadoop is a popular Big Data technology Provides framework for largescale, distributed data storage and processing Hadoop is an open-source implementation of MapReduce programming model Hadoop Distributed File System (HDFS) (http://hadoop.apache.org/) is the underlying file system of Hadoop and Hadoop DataBase, HBase HBase MapReduce HDFS Hadoop Framework 4

Hadoop Distributed File System Adopted by many reputed organiza<ons eg:- Facebook, Yahoo! Highly reliable and fault- tolerant - replica<on NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for plavorm- independence and portability Uses Java sockets for communica<on (HDFS) (HDFS Architecture) 5

Modern High Performance Interconnects (Socket Interface) Applica.on/Middleware Applica<on/Middleware Interface Sockets Protocol Implementa<on Network Adapter Network Switch Kernel Space Ethernet Driver 1/10 GigE Adapter Ethernet Switch TCP/IP IPoIB InfiniBand Adapter InfiniBand Switch TCP/IP Hardware Offload 10 GigE Adapter 10 GigE Switch Cloud Computing systems are being widely used on High Performance Computing (HPC) Clusters Commodity high performance networks like InfiniBand can provide low latency and high throughput data transmission For data-intensive applications network performance becomes key component for HDFS 1/10 GigE IPoIB 10 GigE- TOE 6

Modern High Performance Interconnects Applica.on/Middleware Applica<on/Middleware Interface Sockets IB Verbs Protocol Implementa<on Kernel Space Ethernet Driver TCP/IP IPoIB TCP/IP Hardware Offload User space RDMA Network Adapter 1/10 GigE Adapter InfiniBand Adapter 10 GigE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch 10 GigE Switch InfiniBand Switch 1/10 GigE IPoIB 10 GigE- TOE IB Verbs 7

RDMA-based Design of HDFS Enables high performance RDMA communica<on, while suppor<ng tradi<onal socket interface JNI Layer bridges Java based HDFS with communica<on library wrizen in na<ve code Only the communica.on part of HDFS Write is modified; No change in HDFS architecture Available in Hadoop RDMA 0.9.1 release hzp://hadoop- rdma.cse.ohio- state.edu/ Others Java Socket Interface Applica.ons HDFS Write Java Na.ve Interface (JNI) OSU- IB (communica.on library) 1/10 GigE, IPoIB Network IB Verbs InfiniBand N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 8

Hadoop-RDMA Release High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand support at the verbs-level for HDFS, MapReduce, and RPC components Easily configurable for both native InfiniBand and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB) Current release: 0.9.1 Based on Apache Hadoop 0.20.2 Compliant with Apache Hadoop 0.20.2 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) Various multi-core platforms Different file systems with disks and SSDs http://hadoop-rdma.cse.ohio-state.edu Updated release with Hadoop stable version (1.2.1) coming soon 9

HDFS Replication: Pipelined vs Parallel Basic mechanism of HDFS fault tolerance is Data Replication Replicates each block to multiple DataNodes Not good for latencysensitive applications Client DataNode(1 DataNode(1 Can Parallel Replication benefit HDFS DataNode(2 DFSClient writes data to all the DataNodes and its applications when HDFS runs on different kinds of High Performance Networks? Client DataNode(2 Pipelined replication DataNode(3 Parallel replication DataNode(3 10

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Conclusion & Future work 11

Problem Statement What are the challenges to introduce the parallel replication scheme in both the socket-based and RDMA-based design of HDFS? Can we re-design HDFS to take advantage of the parallel replication scheme over high performance networks and protocols? What will be the impact of parallel replication on Hadoop benchmarks over different interconnects and protocols? Can we observe performance improvement for other cloud computing middleware such as HBase with this replication technique? 12

Outline Introduc<on and Mo<va<on Problem Statement Design Parallel Replica<on in Socket- based Design of HDFS Parallel Replica<on in RDMA- based Design of HDFS Performance Evalua<on Conclusion & Future work 13

Parallel Replication in Socket-based Design of HDFS (Issues and Challenges) In Pipelined Replication DFSClient writes to the first DataNode in the pipeline DFSClient receives acknowledgements from only the first DataNode in the pipeline In Parallel Replication DFSClient writes to all (default is three) the DataNodes the block should be replicated to DFSClient receives acknowledgements from all the DataNodes MultiDataOutputStream and MultiDataInputStream 14

Parallel Replication in Socket-based Design of HDFS (Communication Flow) DFSClient DataStreamer waits for Data Packets DFSClient DataStreamer waits for Data Packets DataStreamer writes packet to a DataOutputStream DataStreamer writes packet to a MultiDataOutputStream LastPacket? no LastPacket? no yes yes ResponseProcessor waits on a DataInputStream till it gets acknowledgements from the first DataNode Pipelined Replication ResponseProcessor waits on a MultiDataInputStream till it gets acknowledgements from all the DataNodes 15 Parallel Replication

Outline Introduc<on and Mo<va<on Problem Statement Design Parallel Replica<on in Socket- based Design of HDFS Parallel Replica<on in RDMA- based Design of HDFS Performance Evalua<on Conclusion & Future work 16

Parallel Replication in RDMA-based Design of HDFS (Issues and Challenges) The RDMA-based design of HDFS implements pipelined replication Challenges to incorporate parallel replication Reducing connection creation overhead in DFSClient Minimizing the polling overhead Reducing the total wait time for acknowledgements in the DFSClient side 17

Parallel Replication in RDMA-based Design of HDFS (Connection Management) RDMA connection creation is expensive. DFSClient now needs three connections instead of one C1 C2 C3 Expensive To DN1 To DN2 To DN3 Single Connection object; different end-points to connect to different DataNodes C ep1 ep2 ep3 To DN1 To DN2 To DN3 It also reduces the total wait time for acknowledgements 18

Parallel Replication in RDMA-based Design of HDFS (Minimizing Polling Overhead) One Receiver per client (or block) in the DataNode increases polling overhead Rcvr1 (C1) Rcvr2 (C2) RcvrN (CN) Increased polling overhead Single Receiver in the DataNode; different clients connect to different end-points of the same connection object C ep1 ep2 ep3 Single Rcvr 19

Parallel Replication in RDMA-based Design of HDFS (Communication Flow) DFSClient RDMADataStreamer waits for Data Packets DFSClient RDMADataStreamer waits for Data Packets RDMADataStreamer writes packet to the end-point to the first DataNode RDMADataStreamer writes packet to different end-points connecting to different DataNodes LastPacket? no LastPacket? no yes yes RDMAResponseProcessor waits on the end-point till it gets acknowledgements from the first DataNode Pipelined Replication RDMAResponseProcessor waits on the single Connection object till it gets acknowledgements from all the DataNodes 20 Parallel Replication

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Conclusion & Future Work 21

Hardware Intel Westmere (Cluster A) Experimental Setup Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad- core CPUs, 12 GB main memory, 160 GB hard disk Network: 1GigE, IPoIB, and IB- QDR (32Gbps) Intel Westmere with larger memory (Cluster B) Nodes in this cluster has same configura<on as Cluster A; 24GB RAM 8 storage nodes with three 1 TB HDD per node Network: 1GigE, 10GigE, IPoIB and IB- QDR (32Gbps) Sokware Hadoop 0.20.2, HBase 0.90.3 and JDK 1.7 Yahoo! Cloud Serving Benchmark (YCSB) 22

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Micro- benchmark level evalua<ons Evalua<ons with TestDFSIO Evalua<ons with TeraGen Integra<on with HBase (TCP/IP) Conclusion & Future Work 23

Evaluations using Micro-benchmark 500 250 Latency (s) 450 400 350 300 250 200 150 100 50 0 10 15 20 File Size (GB) 1GigE-pipelined 1GigE-parallel IPoIB-pipelined (32Gbps) IPoIB-parallel (32Gbps) OSU-IB-pipelined (32Gbps) Throughput (MBps) 200 150 100 50 OSU-IB-parallel (32Gbps) 0 10 15 20 File Size (GB) Micro-benchmark Latency Micro-benchmark Throughput Cluster A with 8 HDD DataNodes 15% improvement over IPoIB (32Gbps) 12.5% improvement over OSU- IB (32Gbps) For 1GigE, NIC bandwidth is a bottleneck Improvement for larger data size 24

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Micro- benchmark level evalua<ons Evalua<ons with TestDFSIO Evalua<ons with TeraGen Integra<on with HBase (TCP/IP) Conclusion & Future Work 25

Evaluations using TestDFSIO Total Throughput (MBps) 400 350 300 250 200 150 100 50 10GigE-pipelined 10GigE-parallel IPoIB-pipelined (32Gbps) IPoIB-parallel (32Gbps) OSU-IB-pipelined (32Gbps) OSU-IB-parallel (32Gbps) Total Throughput (MBps) 700 600 500 400 300 200 100 0 20 25 30 File Size (GB) Cluster B with 8 Nodes Cluster B with 8 HDD DataNodes 0 40 60 80 File Size (GB) Cluster A with 32 Nodes 11% improvement over 10GigE 10% improvement over IPoIB (32Gbps) 12% improvement over OSU- IB (32Gbps) Cluster A with 32 HDD DataNodes 8% improvement over IPoIB (32Gbps) 26 9% improvement over OSU- IB (32Gbps)

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Micro- benchmark level evalua<ons Evalua<ons with TestDFSIO Evalua<ons with TeraGen Integra<on with HBase (TCP/IP) Conclusion & Future Work 27

Evaluations using TeraGen Execution Time (s) 300 250 200 150 100 50 0 20 25 30 File Size (GB) Cluster B with 8 Nodes 10GigE-pipelined 10GigE-parallel IPoIB-pipelined (32Gbps) IPoIB-parallel (32Gbps) OSU-IB-pipelined (32Gbps) Execution Time (s) 250 200 150 100 50 OSU-IB-parallel (32Gbps) 0 40 60 80 File Size (GB) Cluster A with 32 Nodes Cluster B with 8 HDD DataNodes 16% improvement over IPoIB (32Gbps), 10GigE and OSU- IB (32Gbps) Cluster A with 32 HDD DataNodes 11% improvement over IPoIB (32Gbps) and OSU- IB (32Gbps) 28

Outline Introduc<on and Mo<va<on Problem Statement Design Performance Evalua<on Micro- benchmark level evalua<ons Evalua<ons with TestDFSIO Evalua<ons with TeraGen Integra<on with HBase (TCP/IP) Conclusion & Future Work 29

Evaluations using YCSB 300 6000 Average Latency (us) 250 200 150 100 50 IPoIB-pipelined (32Gbps) IPoIB-parallel (32Gbps) OSU-IB-pipelined (32Gbps) OSU-IB-parallel (32Gbps) Throughput (Ops/s) 5000 4000 3000 2000 1000 0 240K 360K 480K Record Count 0 240K 360K 480K Record Count Put Latency Put Throughput HBase Put Opera<on with 4 Region Servers in Cluster A 17% improvement over IPoIB (32Gbps) 10% improvement over OSU- IB (32Gbps) 30

Outline Introduc<on and Mo<va<on Problem Statement Design using Hybrid Transports Performance Evalua<on Conclusion & Future Work 31

Conclusion and Future Works Introduced Parallel Replication in both Socket-based and RDMA-based design of HDFS over InfiniBand Comprehensive Evaluation regarding the impact of Parallel Replication on different Hadoop benchmarks Integration with HBase leads to performance improvement of HBase Put operation Identify architectural bottlenecks of higher level HDFS designs and propose enhancements to work with high performance communication schemes Integration with other Hadoop components designed over InfiniBand 32

Tutorial on August 23, 2013 (8:30 12:30) Accelerating Big Data Processing with Hadoop and Memcached Using High Performance Interconnects: Opportunities and Challenges by D. K. Panda and Xiaoyi Lu The Ohio State University 33

Thank You! {islamn, luxi, rahmanmd, panda}@cse.ohio- state.edu Network- Based Compu<ng Laboratory hzp://nowlab.cse.ohio- state.edu/ Hadoop Web Page hzp://hadoop- rdma.cse.ohio- state.edu/ 34