Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Size: px

Start display at page:

Download "Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects"

Oscar Adams
6 years ago
Views:

1 Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University

Introduction to Big Data Applications and

important elements of business analytics Provides

information management and decision making The

capturing and digitizing more information than

2 Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law ADMS '14 2

4V Characteristics of Big Data Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.

3 4V Characteristics of Big Data Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, 4/5V s of Big Data 3V + *Veracity, *Value ADMS '14 JTEzcfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg 3

4 Velocity of Big Data How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 6.59% from 21 to 211 and now represents 2.1 Billion People. ADMS '14 4

5 Data Management and Processing in Modern Datacenters Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark ADMS '14 5

6 Overview of Web 2. Architecture and Memcached Three-layer architecture of Web 2. Web Servers, Memcached Servers, Database Servers Memcached is a core component of Web 2. architecture ADMS '14 6

Typically used to cache database queries, results of API

7 Memcached Architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive ADMS '14 7

HBase Overview Apache Hadoop Database (http://hbase.apache.

many datacenter applications eg: Facebook Social Inbox Developed in Java for

8 HBase Overview Apache Hadoop Database ( Semi-structured database, which is highly scalable Integral part of many datacenter applications eg: Facebook Social Inbox Developed in Java for platformindependence and portability (HBase Architecture) Uses sockets for communication! ADMS '14 8

Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations Client eg: Facebook, Yahoo!

9 Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations Client eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! RPC RPC RPC RPC (HDFS Architecture) ADMS '14 9

Data Movement in Hadoop MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the

10 Data Movement in Hadoop MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the intermediate data to local disk Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS Communication in shuffle phase uses HTTP over Java Sockets ADMS '14 1

Spark Architecture Overview An in-memory data-processing framework Iterative machine learning jobs Interactive data analytics Scala based Implementation Standalone, YARN, Mesos Scalable and

11 Spark Architecture Overview An in-memory data-processing framework Iterative machine learning jobs Interactive data analytics Scala based Implementation Standalone, YARN, Mesos Scalable and communication intensive Wide dependencies between Resilient Distributed Datasets (RDDs) MapReduce-like shuffle operations to repartition RDDs Sockets based communication ADMS '14 11

12 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based Memcached Case study with OLTP SSD-assisted hybrid Memcached RDMA-based HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 12

13 High-End Computing (HEC): PetaFlop to ExaFlop 1 PFlops in EFlops in 218? Expected to have an ExaFlop system in ! ADMS '14 13

14 Trends for Commodity Computing Clusters in the Top 5 List ( Number of Clusters Percentage of Clusters Number of Clusters Percentage of Clusters Timeline ADMS '14 14

15 High End Computing (HEC) High End Computing (HEC) grows dramatically High Performance Computing Big Data Computing Technology Advancement Multi-core/many-core technologies and accelerators Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs) and Non-Volatile Random- Access Memory (NVRAM) Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Tianhe 2 (1) Titan (2) Stampede (6) Tianhe 1A (1) ADMS '14 15

16 Overview of High Performance Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack ( with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list ADMS '14 16

All interconnects and protocols in OpenFabrics Stack Application / Middleware Interface Application / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA

17 All interconnects and protocols in OpenFabrics Stack Application / Middleware Interface Application / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Native ADMS '14 17

18 Trends of Networking Technologies in TOP5 Systems Percentage share of InfiniBand is steadily increasing Interconnect Family Systems Share ADMS '14 18

19 Large-scale InfiniBand Installations 223 IB Clusters (44.3%) in the June 214 Top5 list ( Installations in the Top 5 (25 systems): 519,64 cores (Stampede) at TACC (7 th ) 12, 64 cores (Nebulae) at China/NSCS (28 th ) 62,64 cores (HPC2) in Italy (11 th ) 72,288 cores (Yellowstone) at NCAR (29 th ) 147, 456 cores (Super MUC) in Germany (12 th ) 7,56 cores (Helios) at Japan/IFERC (3 th ) 76,32 cores (Tsubame 2.5) at Japan/GSIC (13 th ) 138,368 cores (Tera-1) at France/CEA (35 th ) 194,616 cores (Cascade) at PNNL (15 th ) 222,72 cores (QUARTETTO) in Japan (37 th ) 11,4 cores (Pangea) at France/Total (16 th ) 53,54 cores (PRIMERGY) in Australia (38 th ) 96,192 cores (Pleiades) at NASA/Ames (21 st ) 77,52 cores (Conte) at Purdue University (39th) 73,584 cores (Spirit) at USA/Air Force (24 th ) 44,52 cores (Spruce A) at AWE in UK (4 th ) 77,184 cores (Curie thin nodes) at France/CEA (26 h ) 48,896 cores (MareNostrum) at Spain/BSC (41 st ) 65,32-cores, idataplex DX36M4 at Germany/Max- Planck (27 th ) and many more! ADMS '14 19

20 Open Standard InfiniBand Networking Technology Introduced in Oct 2 High Performance Data Transfer Interprocessor communication and I/O Low latency (<1. microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 1Gbps), and low CPU utilization (5-1%) Flexibility for LAN and WAN communication Multiple Transport Services Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram Provides flexibility to develop upper layers Multiple Operations Send/Recv RDMA Read/Write Atomic Operations (very unique) high performance and scalable implementations of distributed locks, semaphores, collective communication operations Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems,. ADMS '14 2

Communication in the Channel Semantics (Send/Receive Model) Memory Segment Memory Segment Memory

Memory Segment Memory Segment CQ 1. Post receive WQE 2. Post send WQE 3.

send buffer (multiple non-contiguous segments) Hardware ACK InfiniBand Device Receive WQE contains

21 Communication in the Channel Semantics (Send/Receive Model) Memory Segment Memory Segment Memory Segment CQ Memory QP Send Recv Processor Processor Processor is involved only to: QP Send Recv Memory Memory Segment Memory Segment CQ 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ ADMS '14 InfiniBand Device Send WQE contains information about the send buffer (multiple non-contiguous segments) Hardware ACK InfiniBand Device Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place 21

Communication in the Memory Semantics (RDMA Model) Memory Segment Memory Processor Processor Memory Memory Segment Memory Segment CQ QP Send Recv Initiator processor is involved only to: 1.

22 Communication in the Memory Semantics (RDMA Model) Memory Segment Memory Processor Processor Memory Memory Segment Memory Segment CQ QP Send Recv Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ QP Send Recv Memory Segment CQ No involvement from the target processor InfiniBand Device Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Hardware ACK InfiniBand Device ADMS '14 22

23 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 23

24 Wide Adaptation of RDMA Technology Message Passing Interface (MPI) for HPC Parallel File Systems Lustre GPFS Delivering excellent performance (latency, bandwidth and CPU Utilization) Delivering excellent scalability ADMS '14 24

25 MVAPICH2/MVAPICH2-X Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC) Used by more than 2,2 organizations in 73 countries More than 221, downloads from OSU site directly Empowering many TOP5 clusters 7 th ranked 519,64-core cluster (Stampede) at TACC 13 th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology 23 th ranked 96,192-core cluster (Pleiades) at NASA many others... Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) Partner in the U.S. NSF-TACC Stampede System ADMS '14 25

26 One-way Latency: MPI over IB with MVAPICH2 Latency (us) Small Message Latency Latency (us) 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ADMS '14 26

27 Bandwidth: MPI over IB with MVAPICH2 Bandwidth (MBytes/sec) Unidirectional Bandwidth Bandwidth (MBytes/sec) Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) ADMS '14 DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 27

28 Can High-Performance Interconnects Benefit Data Management and Processing? Most of the current Big Data systems use Ethernet Infrastructure with Sockets Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest Oracle, IBM, Google, Intel are working along these directions What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached? ADMS '14 28

29 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Other Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 29

30 Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets ADMS '14 3

31 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 31

32 The High-Performance Big Data (HiBD) Project RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) RDMA for Apache HBase and Spark ADMS '14 32

33 RDMA for Memcached Distribution High-Performance Design of Memcached over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for Memcached and libmemcached components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.1 Based on Memcached and libmemcached Compliant with Memcached APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms ADMS '14 33

34 RDMA for Apache Hadoop 1.x/2.x Distributions High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.9/.9.1 Based on Apache Hadoop 1.2.1/2.4.1 Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs ADMS '14 34

35 OSU HiBD Micro-Benchmark (OHB) Suite - Memcached Released in OHB.7.1 (ohb_memlat) Evaluates the performance of stand-alone Memcached Three different micro-benchmarks SET Micro-benchmark: Micro-benchmark for memcached set operations GET Micro-benchmark: Micro-benchmark for memcached get operations MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 9:1) Calculates average latency of Memcached operations Can measure throughput in Transactions Per Second ADMS '14 35

36 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based Memcached Case study with OLTP SSD-assisted hybrid Memcached RDMA-based HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 36

37 Design Challenges of RDMA-based Memcached Can Memcached be re-designed from the ground up to utilize RDMA capable networks? How efficiently can we utilize RDMA for Memcached operations? Can we leverage the best features of RC and UD to deliver both high performance and scalability to Memcached? Memcached applications need not be modified; the middleware uses verbs interface if available ADMS '14 37

38 Memcached-RDMA Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads Native IB-verbs-level Design and evaluation with Server : Memcached ( Client : libmemcached ( Different networks and protocols: 1GigE, IPoIB, native IB (RC, UD) ADMS '14 38

39 Memcached Get Latency (Small Message) Time (us) K 2K Message Size IPoIB (DDR) 1GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR) Time (us) IPoIB (QDR) 1GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) K 2K Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster ADMS '14 39

40 Memcached Get TPS (4 bytes) Thousands of Transactions per second (TPS) No. of Clients IPoIB (QDR) 12 OSU-RC-IB (QDR) OSU-UD-IB (QDR) K No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to IPoIB Thousands of Transactions per second (TPS) ADMS '14 4

Time (us) 1 1 Memcached Performance (FDR Interconnect) 1 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size Memcached Get

41 Time (us) 1 1 Memcached Performance (FDR Interconnect) 1 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X K 2K 4K Message Size Memcached Get latency Thousands of Transactions per Second (TPS) bytes OSU-IB: 2.84 us; IPoIB: us 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput Memcached Throughput No. of Clients ADMS ' X

42 Application Level Evaluation Olio Benchmark 6 25 IPoIB (QDR) 5 2 OSU-RC-IB (QDR) OSU-UD-IB (QDR) Time (ms) 4 3 Time (ms) 15 1 OSU-Hybrid-IB (QDR) No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design ADMS '14 42

43 Time (ms) Application Level Evaluation Real Application Workloads No. of Clients Time (ms) Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 ADMS '14 43

44 Memcached-RDMA Case Studies with OLTP Workloads Design of scalable architectures using Memcached and MySQL Case study to evaluate benefits of using RDMA- Memcached for traditional OLTP workloads ADMS '14 44

Evaluation with Traditional OLTP Workloads - mysqlslap Latency (sec) 8 7 6 5 4 3 2 1 Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) 48 64 8 96 112 128 144 16 24 32 4 No.

45 Evaluation with Traditional OLTP Workloads - mysqlslap Latency (sec) Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) No. of Clients Throughput (q/ms) Memcached-IPoIB (32Gbps) 35 Memcached-RDMA (32Gbps) No. of Clients Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool Up to 4 nodes, 1 concurrent threads per node, 2 queries per client Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps) - throughput by up to 69% over IPoIB (32Gbps) ADMS '14 45

virtual memory Higher hit ratio due to SSD Mapped virtual memory Frequent

46 Traditional Memcached Deployments Existing memcached deployment Low hit ratio at Memcached server due to Limited main memory size Mmap() SSD into virtual memory Higher hit ratio due to SSD Mapped virtual memory Frequent access to backend DB Significant overhead at virtual memory management ADMS '14 46

SSD-Assisted Hybrid Memory for Memcached IB Verbs Get Latency (us) IPoIB 1GigE 1GigE MySQL N/A 1763 1724 1122 Memcached (In RAM) Memcached (SSD virtual memory) 1 6 4 15 347 387 362 455 SSD Basic

47 SSD-Assisted Hybrid Memory for Memcached IB Verbs Get Latency (us) IPoIB 1GigE 1GigE MySQL N/A Memcached (In RAM) Memcached (SSD virtual memory) SSD Basic Performance (us) (Fusionio iodrive) Random Read SSD Latency 68 7 Random Write SSD-Assisted Hybrid Memory to expand Memory size A user-level library that bypasses the kernel-based virtual memory management overhead ADMS '14 47

6 X 3 X 93 us 16 us Memcached with InfiniBand DDR transport (16Gbps) 3GB data in SSD, 256 MB read/write buffer

48 Evaluation on SSD-Assisted Hybrid Memory for Memcached - Latency Memcached Get Latency Memcached Put Latency 347 us 48 us 3.6 X 3 X 93 us 16 us Memcached with InfiniBand DDR transport (16Gbps) 3GB data in SSD, 256 MB read/write buffer Get / Put a random object X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang & D. K. Panda, SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, ICPP 12 ADMS '14 48

49 Evaluation on SSD-Assisted Hybrid Memory for Memcached - Throughput Native Hmem (upper-bound) Memcached+IB+Hmem Memcached+IB+VMS Transactions/sec (thousands) # processes 5.3 X Memcached with InfiniBand DDR transport (16Gbps) 3GB data in SSD, object size = 1KB, 256 MB read/write buffer 1, 2, 4 and 8 client process to perform random get() ADMS '14 49

50 Motivation Detailed Analysis on HBase Put/Get Time (us) Time (us) Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Get 1KB M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, Poster Paper, ISPASS 12 ADMS '14 5

HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE, IPoIB Networks RDMA Capable Networks (IB, 1GE/ iwarp, RoCE.

51 HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE, IPoIB Networks RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HBase with communication library written in native code Enables high performance RDMA communication, while supporting traditional socket interface ADMS '14 51

52 HBase-RDMA Design - Communication Flow Application (HBase Client) (HBase Server) Response Queue Handler Threads IB Reader HBase Client Responder Reader Threads Call Queue Helper Threads Network Selector JNI Interface Sockets Sockets JNI Interface OSU-IB Design OSU-IB Design 1/1G Network 1/1G Network IBVerbs IBVerbs RDMA design components Network Selector, IB Reader, Helper threads JNI Adaptive Interface ADMS '14 52

53 Time (us) HBase Micro-benchmark (Single-Server-Multi- Client) Results No. of Clients Latency HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE Ops/sec IPoIB (DDR) OSU-IB (DDR) 1GigE No. of Clients Throughput J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 ADMS '14 53

54 HBase YCSB Read-Write Workload Time (us) Time (us) IPoIB (QDR) 1GigE OSU-IB (QDR) No. of Clients No. of Clients Read Latency Write Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Put latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients ADMS '14 54

55 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 55

56 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 56

57 Design Overview of HDFS with RDMA Applications Design Features Others HDFS Write RDMA-based HDFS write RDMA-based HDFS replication Java Socket Interface Java Native Interface (JNI) Parallel replication support OSU Design On-demand connection setup 1/1 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based HDFS with communication library written in native code ADMS '14 57

Communication Times in HDFS Communication Time (s) 25 2 15 1 5 1GigE IPoIB (QDR) OSU-IB (QDR) Reduced by 3% 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in

58 Communication Times in HDFS Communication Time (s) GigE IPoIB (QDR) OSU-IB (QDR) Reduced by 3% 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 212 ADMS '14 58

Evaluations using OHB HDFS Micro-benchmark Write Latency (s) 1 9 8 7 6 5 4 3 2 1 1GigE IPoIB (QDR) OSU-IB (QDR) reduced by 25% Total Throughput (MBps) 6 5 4 3 2 1 1GigE IPoIB (QDR) OSU-IB (QDR)

59 Evaluations using OHB HDFS Micro-benchmark Write Latency (s) GigE IPoIB (QDR) OSU-IB (QDR) reduced by 25% Total Throughput (MBps) GigE IPoIB (QDR) OSU-IB (QDR) Increased by 5% File Size (GB) Write Latency File Size (GB) Write Throughput with 2 clients Cluster with 4 HDD DataNodes, single disk per node 25% improvement in latency over IPoIB (QDR) for 1GB file size 5% improvement in throughput over IPoIB (QDR) for 1GB file size ADMS '14 N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December

60 Evaluations using TestDFSIO Total Throughput (MBps) GigE IPoIB (QDR) OSU-IB (QDR) Increased by 24% Total Throughput (MBps) IPoIB (QDR) OSU-IB (QDR) Increased by 61% File Size (GB) File Size (GB) Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Cluster with 8 HDD DataNodes, single disk per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 SSD DataNodes, single SSD per node 61% improvement over IPoIB (QDR) for 2GB file size ADMS '14 6

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede Aggregated Throughput (MBps) 14 12 1 8 6 4 2

Reduced by 37% OSU-IB (FDR) 64 128 256 File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement

61 Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede Aggregated Throughput (MBps) IPoIB (FDR) OSU-IB (FDR) Increased by 64% File Size (GB) Execution Time (s) IPoIB (FDR) Reduced by 37% OSU-IB (FDR) File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size ADMS '14 61

62 Evaluations using YCSB (32 Region Servers: 1% Update) Average Latency (us) IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Latency HBase using TCP/IP, running over HDFS-IB HBase Put latency for 48K records Throughput (Kops/sec) IPoIB (QDR) 21 us for OSU Design; 272 us for IPoIB (32Gbps) HBase Put throughput for 48K records OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Throughput 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 26% improvement in average latency; 24% improvement in throughput ADMS '14 62

63 HDFS and HBase Integration over IB (OSU-IB) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) Latency (us) Throughput (Kops/sec) K 24K 36K 48K Number of Records (Record Size = 1KB) 12K 24K 36K 48K Number of Records (Record Size = 1KB) YCSB Evaluation with 4 RegionServers (1% update) HBase Put Latency and Throughput for 36K records - 37% improvement over IPoIB (32Gbps) - 18% improvement over OSU-IB HDFS only ADMS '14 63

64 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 64

Design Overview of MapReduce with RDMA Java Socket Interface MapReduce Job Tracker 1/1 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI)

.) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map,

65 Design Overview of MapReduce with RDMA Java Socket Interface MapReduce Job Tracker 1/1 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code ADMS '14 65

Advanced Overlapping among different phases Default Architecture A hybrid approach

other approaches Efficient Shuffle Algorithms Dynamic and Efficient Switching

66 Advanced Overlapping among different phases Default Architecture A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches Efficient Shuffle Algorithms Dynamic and Efficient Switching On-demand Shuffle Adjustment Enhanced Overlapping Advanced Overlapping M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 214. ADMS '14 66

67 Evaluations using OHB MapReduce Micro-benchmark slave nodes 16 slave nodes Stand-alone MapReduce micro-benchmark (MR-AVG) 1 KB key/value pair size For 8 slave nodes, RDMA has up to 3% over IPoIB (56Gbps) For 16 slave nodes, RDMA has up to 28% over IPoIB (56Gbps) ADMS '14 IPoIB (FDR) OSU-IB (FDR) IPoIB (FDR) OSU-IB (FDR) D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (214). 3% 28% 67

68 Evaluations using Sort (HDD, SSD) Job Execution Time (sec) 6 1GigE IPoIB (QDR) 5 UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 HDD DataNodes With 8 HDD DataNodes for 4GB sort 43% improvement over IPoIB (QDR) 44% improvement over UDA-IB (QDR) Job Execution Time (sec) 5 1GigE IPoIB (QDR) 45 UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 SSD DataNodes With 8 SSD DataNodes for 4GB sort 52% improvement over IPoIB (QDR) 45% improvement over UDA-IB (QDR) ADMS '14 68

69 Evaluations with TeraSort Job Execution Time (sec) GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) Data Size (GB) 54% 49% 1 GB TeraSort with 8 DataNodes with 2 HDD per node 49% benefit compared to UDA-IB (QDR) 54% benefit compared to IPoIB (QDR) 56% benefit compared to 1GigE ADMS '14 69

70 Performance Evaluation on Larger Clusters 12 1 IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) 7 IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) Job Execution Time (sec) Job Execution Time (sec) Data Size: 6 GB Data Size: 12 GB Data Size: 24 GB Data Size: 8 GB Data Size: 16 GBData Size: 32 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster TeraSort in TACC Stampede For 24GB Sort in 64 nodes 4% improvement over IPoIB (QDR) with HDD used for HDFS For 32GB TeraSort in 64 nodes 38% improvement over IPoIB (FDR) with HDD used for HDFS ADMS '14 7

71 Evaluations using PUMA Workload 1.9 1GigE IPoIB (QDR) OSU-IB (QDR) Normalized Execution Time AdjList (3GB) SelfJoin (8GB) SeqCount (3GB) WordCount (3GB) InvertIndex (3GB) Benchmarks 5% improvement in Self Join over IPoIB (QDR) for 8 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 3 GB data size ADMS '14 71

72 Evaluations using SWIM Execution Time (sec) IPoIB (QDR) OSU-IB (QDR) Job Number 5 small MapReduce jobs executed in a cluster size of 4 Maximum performance benefit 24% over IPoIB (QDR) Average performance benefit 13% over IPoIB (QDR) ADMS '14 72

73 Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark ADMS '14 73

74 Design Overview of Spark with RDMA Design Features RDMA based shuffle SEDA-based plugins Dynamic connection management and sharing Non-blocking and out-oforder data transfer Off-JVM-heap buffer management InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 214 ADMS '14 74

75 Preliminary Results of Spark-RDMA Design - GroupBy GroupBy Time (sec) GroupBy Time (sec) GigE IPoIB RDMA Data Size (GB) Data Size (GB) Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks 18% improvement over IPoIB (QDR) for 1GB data size Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks 2% improvement over IPoIB (QDR) for 2GB data size ADMS '14 75

76 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 76

77 Optimize Apache Hadoop over Parallel File Systems Compute Nodes TaskTracker Map Reduce Lustre Client MetaData Servers Object Storage Servers Lustre Setup HPC Cluster Deployment Hybrid topological solution of Beowulf architecture with separate I/O nodes Lean compute nodes with light OS; more memory space; small local storage Sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre MapReduce over Lustre Local disk is used as the intermediate data directory Lustre is used as the intermediate data directory ADMS '14 77

78 Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede Job Execution Time (sec) Local disk is used as the intermediate data directory IPoIB (FDR) OSU-IB (FDR) Job Execution Time (sec) IPoIB (FDR) OSU-IB (FDR) GB 4 GB 8 GB 16 GB 32 GB 64 GB Data Size (GB) Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 For 5GB Sort in 64 nodes For 64GB Sort in 128 nodes 44% improvement over IPoIB (FDR) 48% improvement over IPoIB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMAbased Approach Benefit?, Euro-Par, August 214. ADMS '14 78

79 Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede Lustre is used as the intermediate data directory 2 18 IPoIB (FDR) OSU-IB (FDR) 3 25 IPoIB (FDR) OSU-IB (FDR) Job Execution Time (sec) Job Execution Time (sec) GB 16 GB 32 GB Data Size (GB) Cluster: 8 Cluster: 16 Cluster: 32 For 16GB Sort in 16 nodes For 32GB Sort in 32 nodes 35% improvement over IPoIB (FDR) 33% improvement over IPoIB (FDR) Can more optimizations be achieved by leveraging more features of Lustre? ADMS '14 79

80 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 8

81 Ongoing and Future Activities for Hadoop Accelerations Some other existing Hadoop solutions Hadoop-A (UDA) RDMA-based implementation of Hadoop MapReduce shuffle engine Uses plug-in based solution : UDA (Unstructured Data Accelerator) Cloudera Distributions of Hadoop CDH: Open Source distribution Cloudera Standard: CDH with automated cluster management Cloudera Enterprise: Cloudera Standard + enhanced management capabilities and support Ethernet/IPoIB Hortonworks Data Platform HDP: Open Source distribution Developed as projects through the Apache Software Foundation (ASF), NO proprietary extensions or add-ons Functional areas: Data Management, Data Access, Data Governance and Integration, Security, and Operations. Ethernet/IPoIB ADMS '14 81

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS,

Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance

82 Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 82

83 More Challenges Multi-threading and Synchronization Multi-threaded model exploration Fine-grained synchronization and lock-free design Unified helper threads for different components Multi-endpoint design to support multi-threading communications QoS and Virtualization Network virtualization and locality-aware communication for Big Data middleware Hardware-level virtualization support for End-to-End QoS I/O scheduling and storage virtualization Live migration ADMS '14 83

84 More Challenges (Cont d) Support of Accelerators Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs Offload computation-intensive workload to accelerators Explore maximum overlapping between communication and offloaded computation Fault Tolerance Enhancements Exploration of light-weight fault tolerance mechanisms for Big Data Support of Parallel File Systems Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters Big Data Benchmarking ADMS '14 84

85 Are the Current Benchmarks Sufficient for Big Data Management and Processing? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? ADMS '14 85

86 Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 86

87 OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from Available in an integrated manner with MVAPICH2 stack ADMS '14 87

Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase,

88 Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) ADMS '14 88

89 Future Plans of OSU High Performance Big Data Project Upcoming Releases of RDMA-enhanced Packages will support Hadoop 2.x MapReduce & RPC Spark HBase Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support HDFS MapReduce RPC Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations ADMS '14 89

90 Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Data Management and Processing The High-Performance Big Data (HiBD) Project RDMA-based design for Memcached and HBase RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Ongoing and Future Activities Conclusion and Q&A ADMS '14 9

91 Concluding Remarks Presented an overview of data management and processing middleware in different tiers Provided an overview of modern networking technologies Discussed challenges in accelerating Big Data middleware Presented initial designs to take advantage of InfiniBand/RDMA for Memcached, HBase, Hadoop and Spark Results are promising Many other open issues need to be solved Will enable Big Data management and processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner ADMS '14 91

92 Thank You! Network-Based Computing Laboratory The High-Performance Big Data Project ADMS '14 92

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng