Acceleration for Big Data, Hadoop and Memcached

Size: px

Start display at page:

Download "Acceleration for Big Data, Hadoop and Memcached"

Audra Hawkins
5 years ago
Views:

1 Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University panda@cse.ohio-state.edu

2 Recap of Last Two Day s Presentations MPI is a dominant programming model for HPC Systems Introduced some of the MPI Features and their Usage Introduced MVAPICH2 stack Illustrated many performance optimizations and tuning techniques for MVAPICH2 Provided an overview of MPI-3 Features Introduced challenges in designing MPI for Exascale systems Presented approaches being taken by MVAPICH2 for Exascale systems 2

3 High-Performance Networks in the Top5 Percentage share of InfiniBand is steadily increasing 3

4 Use of High-Performance Networks for Scientific Computing OpenFabrics software stack with IB, iwarp and RoCE interfaces are driving HPC systems Message Passing Interface (MPI) Parallel File Systems Almost 11.5 years of Research and Development since InfiniBand was introduced in October 21 Other Programming Models are emerging to take advantage of High-Performance Networks UPC SHMEM 4

5 Latency (us) Latency (us) One-way Latency: MPI over IB Small Message Latency Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch 5

6 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB 7 Unidirectional Bandwidth Bidirectional Bandwidth 15 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch 6

7 Large-scale InfiniBand Installations 29 IB Clusters (41.8%) in the November 11 Top5 list ( Installations in the Top 3 (13 systems): 12,64 cores (Nebulae) in China (4 th ) 29,44 cores (Mole-8.5) in China (21 st ) 73,278 cores (Tsubame-2.) in Japan (5 th ) 42,44 cores (Red Sky) at Sandia (24 th ) 111,14 cores (Pleiades) at NASA Ames (7 th ) 62,976 cores (Ranger) at TACC (25 th ) 138,368 cores (Tera-1) at France (9 th ) 2,48 cores (Bull Benchmarks) in France (27 th ) 122,4 cores (RoadRunner) at LANL (1 th ) 2,48 cores (Helios) in Japan (28 th ) 137,2 cores (Sunway Blue Light) in China (14 th ) More are getting installed! 46,28 cores (Zin) at LLNL (15 th ) 33,72 cores (Lomonosov) in Russia (18 th ) 7

8 Enterprise/Commercial Computing Focuses on big data and data analytics Multiple environments and middleware are gaining momentum Hadoop (HDFS, HBase and MapReduce) Memcached 8

9 Can High-Performance Interconnects Benefit Enterprise Computing? Most of the current enterprise systems use 1GE Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest Oracle, IBM, Google are working along these directions What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? 9

10 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 1

11 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD (Database Servers) Main memory CPUs Web Frontend Servers (Memcached Clients) (Memcached Servers) SSD HDD Integral part of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 11

Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks

12 Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases 12

13 Network-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD) High Performance Networks (HDD/SSD) (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 13

14 Overview of HBase Architecture An open source database project based on Hadoop framework for hosting very large tables Major components: HBaseMaster, HRegionServer and HBaseClient HBase and HDFS are deployed in the same cluster to get better data locality 14

15 Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes (HDD/SSD) High Performance Networks... High Performance Networks... (HDD/SSD) (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) 15

16 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 16

17 Designing Communication and I/O Libraries for Enterprise Systems: Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c a c he d) P r ogr am m i n g M ode l s (S oc ke t ) Com m u n i c at i on an d I / O L i br ar y P oi nt -t o-p oi nt Com m uni c a t i on T hre a di ng M ode l s a nd S ync hroni z a t i on I/ O a nd F i l e s ys t e m s Q os F a ul t T ol e ra nc e N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 1/ 4 G i G E, RN ICs & Int e l l i ge nt N ICs ) Com m odi t y Com put i ng S ys t e m A rc hi t e c t ure s (s i ngl e, dua l, qua d,..) M ul t i / M a ny-c ore A rc hi t e c t ure a nd A c c e l e ra t ors S t ora ge T e c hnol ogi e s (H D D or S S D ) 17

Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol

iwarp User space RDMA User space RDMA User space Network Adapter Ethernet Adapter InfiniBand

Network Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch Ethernet

18 Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol mplementation Kernel Space Ethernet Driver TCP/IP IPoIB TCP/IP Hardware Offload SDP RDMA iwarp User space RDMA User space RDMA User space Network Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE SDP iwarp RoCE IB Verbs 18

19 Can New Data Analysis and Management Systems be Designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets 19

20 Interplay between Storage and Interconnect/Protocols Most of the current generation enterprise systems use the traditional hard disks Since hard disks are slower, high performance communication protocols may not have impact SSDs and other storage technologies are emerging Does it change the landscape? 2

21 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 21

22 Memcached Design Using Verbs Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread, each verbs worker thread can support multiple clients All other Memcached data structures are shared among RDMA and Sockets worker threads Memcached applications need not be modified; uses verbs interface if available Memcached Server can serve both socket and verbs clients simultaneously 22

23 Experimental Setup Hardware Intel Clovertown Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 25 GB hard disk Network: 1GigE, IPoIB, 1GigE TOE and IB (DDR) Intel Westmere Software Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 16 GB hard disk Network: 1GigE, IPoIB, and IB (QDR) Memcached Server: Memcached Client: (libmemcached).52 In all experiments, memtable is contained in memory (no disk access involved) 23

24 Time (us) Time (us) Memcached Get Latency (Small Message) SDP OSU-RC-IB 1GigE IPoIB 1GigE OSU-UD-IB K 2K K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster 24

25 Time (us) Time (us) Memcached Get Latency (Large Message) SDP OSU-RC-IB IPoIB 1GigE 4 4 1GigE OSU-UD-IB K 4K 8K 16K 32K 64K 128K 256K 512K 2K 4K 8K 16K 32K 64K 128K 256K 512K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 8K bytes RC/UD DDR: 18.9/19.1 us; QDR: 11.8/12.2 us 512K bytes RC/UD -- DDR: 369/43 us; QDR: 173/23 us Almost factor of two improvement over 1GE (TOE) for 512K bytes on the DDR cluster 25

26 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K 4 8 No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB 26

27 Memory Footprint (MB) Memcached - Memory Scalability SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE OSU-Hybrid-IB 2 1 Steady Memory Footprint for UD Design ~ 2MB K 1.6K 2K 4K RC Memory Footprint increases as increase in number of clients ~5MB for 4K clients No. of Clients 27

28 Time (ms) Time (ms) Application Level Evaluation Olio Benchmark SDP 1 IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 28

Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 1 8 6 4 2 SDP IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB 35 3 25 2 15 1 5 1 4 8 No.

29 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads SDP IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid 12 29

30 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 3

31 HBase Design Using Verbs Current Design OSU Design HBase HBase JNI Interface Sockets OSU Module 1/1 GigE Network InfiniBand (Verbs) 31

32 Experimental Setup Hardware Intel Clovertown Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 25 GB hard disk Network: 1GigE, IPoIB, 1GigE TOE and IB (DDR) Intel Westmere Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 16 GB hard disk Network: 1GigE, IPoIB, and IB (QDR) 3 Nodes used Software Node1 [NameNode & HBase Master] Node2 [DataNode & HBase RegionServer] Node3 [Client] Hadoop.2., HBase.9.3 and Sun Java SDK 1.7. In all experiments, memtable is contained in memory (no disk access involved) 32

33 Details on Experiments Key/Value size Key size: 2 Bytes Value size: 1KB/4KB Get operation One Key/Value pair is inserted, so that Key/Value pair will stay in memory Get operation is repeated 8, times Skipped first 4, iterations as warm-up Put operation Memstore_Flush_Size is set to be 256 MB No memory flush operation involved Put operation is repeated 4, times Skipped first 1, iterations as warm-up 33

34 Time (us) Operations /sec Get Operation (IB:DDR) Latency Throughput GE 1GE IPoIB OSU Design K 4K Message Size HBase Get Operation 1K 1K bytes 65 us (15K TPS) 4K bytes us (11K TPS) Almost factor of two improvement over 1GE (TOE) Message Size 4K 34

35 Time (us) Operations /sec Get Operation (IB:QDR) Latency Throughput GE IPoIB OSU Design K Message Size 4K 1K Message Size 4K HBase Get Operation 1K bytes 47 us (22K TPS) 4K bytes us (16K TPS) Almost factor of four improvement over IPoIB for 1KB 35

36 Time (us) Operations /sec Put Operation (IB:DDR) Latency Throughput GE IPoIB 1GE OSU Design K Message Size 4K 1K Message Size 4K HBase Put Operation 1K bytes 114 us (8.7K TPS) 4K bytes us (5.6K TPS) 34% improvement over 1GE (TOE) for 1KB 36

Time (us) Operations /sec Put Operation (IB:QDR) 4 1GE Latency 14 Throughput 35 IPoIB 12 3 OSU Design 1 25 2 15 1 5 8 6 4 2 1K Message Size

37 Time (us) Operations /sec Put Operation (IB:QDR) 4 1GE Latency 14 Throughput 35 IPoIB 12 3 OSU Design K Message Size 4K 1K Message Size 4K HBase Put Operation 1K bytes 78 us (13K TPS) 4K bytes us (8K TPS) A factor of two improvement over IPoIB for 1KB 37

Time (us) Time (us) HBase Put/Get Detailed Analysis 3 25 2 15 1 25 2 15 1 Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5

38 Time (us) Time (us) HBase Put/Get Detailed Analysis Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5 1GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB HBase Get 1KB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 38

39 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients No. of Clients Latency Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 39

40 Time (us) Time (us) HBase YCSB Read-Write Workload IPoIB 1GigE OSU-IB 1GigE No. of Clients No. of Clients Read Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High- Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 4

41 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 41

42 Studies and Experimental Setup Two Kinds of Designs and Studies we have Done Studying the impact of HDD vs. SSD for HDFS Unmodified Hadoop for experiments Preliminary design of HDFS over Verbs Hadoop Experiments Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T32 Intel X-25E 64GB SSD and 25GB HDD Hadoop version.2.2, Sun/Oracle Java 1.6. Dedicated NameServer and JobTracker Number of Datanodes used: 2, 4, and 8 42

43 Average Write Throughput (MB/sec) Hadoop: DFS IO Write Performance Four Data Nodes Using HDD and SSD 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 1GE-TOE with HDD 1GE-TOE with SSD File Size(GB) DFS IO included in Hadoop, measures sequential access throughput We have two map tasks each writing to a file of increasing size (1-1GB) Significant improvement with IPoIB, SDP and 1GigE With SSD, performance improvement is almost seven or eight fold! SSD benefits not seen without using high-performance interconnect 43

44 Execution Time (sec) Hadoop: RandomWriter Performance Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 1GE-TOE with HDD 1GE-TOE with SSD Each map generates 1GB of random binary data and writes to HDFS SSD improves execution time by 5% with 1GigE for two DataNodes For four DataNodes, benefits are observed only with HPC interconnect IPoIB, SDP and 1GigE can improve performance by 59% on four Data Nodes 44

45 Execution Time (sec) Hadoop Sort Benchmark GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 1GE-TOE with HDD 1GE-TOE with SSD 2 4 Number of data nodes Sort: baseline benchmark for Hadoop Sort phase: I/O bound; Reduce phase: communication bound SSD improves performance by 28% using 1GigE with two DataNodes Benefit of 5% on four DataNodes using SDP, IPoIB or 1GigE S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda Can High-Performance Interconnects Benefit Hadoop Distributed File System?, MASVDC 1 in conjunction with MICRO 21, Atlanta, GA. 45

46 HDFS Design Using Verbs Current Design OSU Design HDFS HDFS JNI Interface Sockets OSU Module 1/1 GigE Network InfiniBand (Verbs) 46

47 Time (ms) RDMA-based Design for Native HDFS Preliminary Results GigE IPoIB 1 GigE OSU-Design HDFS File Write Experiment using four data nodes on IB-DDR Cluster HDFS File Write Time GB 14 s, 5 GB 86s, File Size (GB) For 5 GB File Size - 2% improvement over IPoIB, 14% improvement over 1GigE 47

48 Presentation Outline Overview of Hadoop, Memcached and HBase Challenges in Accelerating Enterprise Middleware Designs and Case Studies Memcached HBase HDFS Conclusion and Q&A 48

49 Concluding Remarks InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage It is possible to use the RDMA feature in enterprise environments for accelerating big data processing Presented some initial designs and performance numbers Many open research challenges remain to be solved so that middleware for enterprise environments can take advantage of modern high-performance networks multi-core technologies emerging storage technologies 49

Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c a c

50 Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c a c he d) P r ogr am m i n g M ode l s (S oc ke t ) Com m u n i c at i on an d I / O L i br ar y P oi nt -t o-p oi nt Com m uni c a t i on T hre a di ng M ode l s a nd S ync hroni z a t i on I/ O a nd F i l e s ys t e m s Q os F a ul t T ol e ra nc e N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 1/ 4 G i G E, RN ICs & Int e l l i ge nt N ICs ) Com m odi t y Com put i ng S ys t e m A rc hi t e c t ure s (s i ngl e, dua l, qua d,..) M ul t i / M a ny-c ore A rc hi t e c t ure a nd A c c e l e ra t ors S t ora ge T e c hnol ogi e s (H D D or S S D ) 5

51 Web Pointers MVAPICH Web Page 51

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu