Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Size: px

Start display at page:

Download "Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached"

Tracy Wilkinson
5 years ago
Views:

1 Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University

2 Big Data Provides groundbreaking opportunities for enterprise information management and decision making The rate of information growth appears to be exceeding Moore s Law The amount of data is exploding; companies are capturing and digitizing more information that ever 35 zettabytes of data will be generated and consumed by the end of this decade Courtesy: John Gantz and David Reinsel, "The Digital Universe Decade Are You Ready?" May

3 Frequency of Big Data An Example Graph of 126 Twitter users whose tweets contain big data Tweets made over a period of 8 hours and 24 minutes 3

4 Trends of Networking Technologies in TOP5 Systems Interconnect Family Systems Share Interconnect Family Performance Share 4

5 HPC and Advanced Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list 5

6 Use of High-Performance Networks for Scientific Computing Message Passing Interface (MPI) Parallel File Systems Storage Systems Almost 12 years of Research and Development since InfiniBand was introduced in October 2 Other Programming Models are emerging to take advantage of High-Performance Networks UPC OpenSHMEM Hybrid MPI+PGAS (OpenSHMEM and UPC) 6

7 Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used for Web 2. 7

8 Can High-Performance Interconnects Benefit Enterprise Computing? Most of the current enterprise systems use 1GE Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest Oracle, IBM, Google are working along these directions What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? 8

9 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 9

Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics http://hadoop.apache.

10 Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics Framework includes: MapReduce, HDFS, and HBase Underlying Hadoop Distributed File System (HDFS) used by MapReduce and HBase Model scales but high amount of communication during intermediate phases HPC Advisory Council Stanford Conference, Mar'13 1

Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg:- Facebook, Yahoo!

11 Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg:- Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! (HDFS Architecture) 11

12 Network-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD) High Performance Networks (HDD/SSD) (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) HPC Advisory Council Stanford Conference, Mar '13 12

HBase Apache Hadoop Database (http://hbase.apache.

many datacenter applications eg:- Facebook Social Inbox Developed in Java for

13 HBase Apache Hadoop Database ( Semi-structured database, which is highly scalable ZooKeeper Integral part of many datacenter applications eg:- Facebook Social Inbox Developed in Java for platformindependence and portability HBase Client HMaster HRegion Servers (HBase Architecture) HDFS Uses sockets for communication! 13

14 Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes (HDD/SSD) High Performance Networks... High Performance Networks... (HDD/SSD) (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) HPC Advisory Council Stanford Conference, Mar '13 14

15 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD (Database Servers) Main memory CPUs Web Frontend Servers (Memcached Clients) (Memcached Servers) SSD HDD Integral part of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive HPC Advisory Council Stanford Conference, Mar '13 15

16 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS and HBase) Memcached Conclusion and Q&A 16

17 Designing Communication and I/O Libraries for Enterprise Systems: Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c a c he d) P r ogr am m i n g M ode l s (S oc ke t ) Com m u n i c at i on an d I / O L i br ar y P oi nt -t o-p oi nt Com m uni c a t i on T hre a di ng M ode l s a nd S ync hroni z a t i on I/ O a nd F i l e s ys t e m s Q os F a ul t T ol e ra nc e N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 1/ 4 G i G E, RN ICs & Int e l l i ge nt N ICs ) Com m odi t y Com put i ng S ys t e m A rc hi t e c t ure s (s i ngl e, dua l, qua d,..) M ul t i / M a ny-c ore A rc hi t e c t ure a nd A c c e l e ra t ors S t ora ge T e c hnol ogi e s (H D D or S S D ) 17

Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space

18 Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 18

19 Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets 19

20 Interplay between Storage and Interconnect/Protocols Most of the current generation enterprise systems use the traditional hard disks Since hard disks are slower, high performance communication protocols may not have impact SSDs and other storage technologies are emerging Does it change the landscape? 2

21 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 21

22 HDFS-RDMA Design Overview Enables high performance RDMA communication, while supporting traditional socket interface Applications Others Java Socket Interface HDFS Write Java Native Interface (JNI) OSU Design HDFS Write involves replication; more network intensive HDFS Read is mostly node-local 1/1 GigE, IPoIB Network IB Verbs InfiniBand JNI Layer bridges Java based HDFS with communication library written in native code Only the communication part of HDFS Write is modified; No change in HDFS architecture 22

23 Communication Times in HDFS Communication Time (s) GigE IPoIB (QDR) OSU-IB (QDR) Reduced by 3% 1 2GB 4GB 6GB 8GB 1GB Cluster B with 32 HDD DataNodes File Size (GB) 3% improvement in communication time over IPoIB (32Gbps) 87% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov

24 Throughput (MegaBytes/s) Throughput (MegaBytes/s) Evaluations using TestDFSIO GigE IPoIB (QDR) OSU-IB (QDR) GigE IPoIB (QDR) OSU-IB (QDR) Increased by 15% 1 8 Increased by 13.5% File Size (GB) File Size (GB) Cluster with 32 HDD Nodes Cluster with 4 SSD Nodes Cluster with 32 HDD DataNodes 13.5% improvement over IPoIB (32Gbps) for 8GB file size Cluster with 4 SSD DataNodes 15% improvement over IPoIB (32Gbps) for 8GB file size 24

25 Average Throughput (MegaBytes/s) Evaluations using TestDFSIO GigE (1HDD) IPoIB(1HDD) OSU-IB (1HDD) 1GigE (2HDD) IPoIB (2HDD) OSU-IB (2HDD) 25 2 Increased by 16.2% File Size (GB) Cluster with 4 DataNodes, 1 HDD per node 1% improvement over IPoIB (32Gbps) for 1GB file size Cluster with 4 DataNodes, 2 HDD per node 16.2% improvement over IPoIB (32Gbps) for 1GB file size 2 HDD vs 1 HDD 2.1x improvement for OSU-IB (32Gbps) 1.8x improvement for IPoIB (32Gbps) 25

26 Average Latency (us) Throughput (MegaBytes/s) Evaluations using YCSB (Single Region Server: 1% Update) GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Record (Record Size = 1KB) Put Latency GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Throughput HBase using TCP/IP, running over HDFS-IB HBase Put latency for 36K records 24 us for OSU Design; 252 us for IPoIB (32Gbps) HBase Put throughput for 36K records 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 2% improvement in both average latency and throughput 26

27 Average Latency (us) Throughput (Kops/sec) Evaluations using YCSB (32 Region Servers: 1% Update) GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Latency HBase using TCP/IP, running over HDFS-IB HBase Put latency for 48K records 21 us for OSU Design; 272 us for IPoIB (32Gbps) HBase Put throughput for 48K records Number of Records (Record Size = 1KB) Put Throughput 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 26% improvement in average latency; 24% improvement in throughput GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K

28 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 28

Design IB Verbs 1/1 GigE Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE.

29 MapReduce-RDMA Design Overview Applications RDMA-enabled MapReduce Job Task Map Tracker Tracker Reduce Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface 29

30 Execution Time (sec) Execution Time (sec) Evaluations using Sort GigE IPoIB (QDR) OSU-IB (QDR) GigE IPoIB (QDR) OSU-IB (QDR) Sort Size (GB) Sort Size (GB) For 5-2 GB Sort, 4-node cluster with a single SSD per node For 25-4 GB Sort, 8-node cluster with a single HDD per node 46% improvement over IPoIB (32 Gbps) for 15 GB Sort 27% improvement over IPoIB (32 Gbps) for 4 GB Sort 3

31 Execution Time (sec) Evaluations using TeraSort GigE IPoIB (QDR) OSU-IB (QDR) GB 6 GB 1 GB 2 GB Cluster Size = 4 Cluster Size = 8 Cluster Size = 12 Cluster Size = 24 Cluster Size 4 and 8 have 24 GB RAM in each node, Cluster Size 12 and 24 have 12 GB RAM in each node, all the nodes have single HDD 41% improvement over IPoIB (32Gbps) for 1 GB Terasort 31

32 Execution Time (sec) Evaluations using PUMA Workload 3 25 IPoIB (QDR) OSU-IB (QDR) Adjacency List (3GB) Self Join (3 GB) Word Count (5 GB) Sequence Count (5 GB) Inverted Index (5 GB) Workloads The DataSet for these workloads are taken from PUMA (Purdue MapReduce Benchmark Suite) 46% improvement in Adjacency List over IPoIB (32Gbps) for 3 GB data size 33% improvement in Sequence Count over IPoIB (32 Gbps) for 5 GB data size 32

33 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 33

Time (us) Time (us) 3 25 2 15 1 HBase Put/Get Detailed Analysis 25 2 15 1 Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5

34 Time (us) Time (us) HBase Put/Get Detailed Analysis Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5 1GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB HBase Get 1KB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 34

HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE.

35 HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/1 GigE Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HBase with communication library written in native code Enables high performance RDMA communication, while supporting traditional socket interface 35

36 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients No. of Clients Latency Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 36

37 Time (us) Time (us) HBase YCSB Read-Write Workload IPoIB 1GigE OSU-IB 1GigE No. of Clients No. of Clients Read Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 37

38 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 38

39 Latency (us) Throughput (Kops/sec) HDFS and HBase Integration over IB (OSU-IB) 1GigE IPoIB (QDR) 1GigE IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) K 24K 36K 48K Number of Records (Record Size = 1KB) 12K 24K 36K 48K Number of Records (Record Size = 1KB) YCSB Evaluation with 4 RegionServers (1% update) HBase Put Latency and Throughput for 36K records - 37% improvement over IPoIB (32Gbps) - 18% improvement over OSU-IB HDFS only 39

40 Presentation Outline Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached Challenges in Accelerating Enterprise Middleware Designs and Case Studies Hadoop HDFS MapReduce HBase Combination (HDFS + HBase) Memcached Conclusion and Q&A 4

41 Memcached-RDMA Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Memcached applications need not be modified; uses verbs interface if available Memcached Server can serve both socket and verbs clients simultaneously Native IB-verbs-level Design and evaluation with Memcached Server: Memcached Client: (libmemcached).52 41

42 Time (us) Time (us) Memcached Get Latency (Small Message) SDP OSU-RC-IB 1GigE IPoIB 1GigE OSU-UD-IB K 2K K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster 42

Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) 16 14 12 16 14 12 SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE 1 1 8 8 6 6 4 4 2 2 1 2 4 8 16 32

43 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K 4 8 No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB 43

44 Time (ms) Time (ms) Application Level Evaluation Olio Benchmark SDP 1 IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 44

Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB 6 35 3 25 2 15 4 2 1 5 1 4 8 No.

45 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 45

Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c

P r ogr am m i n g M ode l s (S oc ke t ) Com m u n i c at i on an d I / O L i br ar y P oi nt -t o-p oi nt Com m uni c a t i on T hre a di ng M ode l s a nd S ync hroni z a t i on I/ O a nd F i

46 Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges Appl i c at i on s D at ac e n t e r M i ddl e war e (H D F S, H Ba s e, M a pre duc e, M e m c a c he d) Upper level Changes? P r ogr am m i n g M ode l s (S oc ke t ) Com m u n i c at i on an d I / O L i br ar y P oi nt -t o-p oi nt Com m uni c a t i on T hre a di ng M ode l s a nd S ync hroni z a t i on I/ O a nd F i l e s ys t e m s Q os F a ul t T ol e ra nc e N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 1/ 4 G i G E, RN ICs & Int e l l i ge nt N ICs ) Com m odi t y Com put i ng S ys t e m A rc hi t e c t ure s (s i ngl e, dua l, qua d,..) M ul t i / M a ny-c ore A rc hi t e c t ure a nd A c c e l e ra t ors S t ora ge T e c hnol ogi e s (H D D or S S D ) 46

47 Concluding Remarks Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, HBase and Memcached Results are promising Working on Integrated designs of all components Many other open issues need to be solved including design changes at the upper layers Will enable BigData community to take advantage of modern clusters and Hadoop middleware to carry out their analytics in a fast and scalable manner 47

48 Web Pointers MVAPICH Web Page 48

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda