Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Size: px

Start display at page:

Download "Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics"

Allan Blake
5 years ago
Views:

1 Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma Ozcan

2 Authors Juwei Shi - Director of Product Development, K2D Data Technology. Bachelors and Masters from Renmin University in China (Big data management and cluster computing system performance). Researchers, IBM Research, China Yunjie Qiu PhD, Beijing University of Posts and Telecommunications Limei Jiao PhD, Beijing University of Posts and Telecommunications Chen Wang PhD, Tsinghua University Researchers, IBM Almaden Research Center, USA Umar Farooq Minhas PhD, University of Waterloo Berthold Reinwald - PhD, University of Erlangen Nuernberg, Germany Fatma Ozcan - PhD, University of Maryland PAGE 2

3 Outline Introduction Cluster Computing Architectures Key Architectural Components 5 different workloads Workload Characteristics Experimental Setup 5 different Experiments Summary Discussion PAGE 3

4 WHAT IS THE PAPER ABOUT? Evaluate the performance of major components involved in Map-Reduce (Hadoop) and Spark PAGE 4

5 Introduction Example - Quick Sort vs Merge Sort? -> Easy PAGE 5

6 Introduction Example - Quick Sort vs Merge Sort? -> Easy Quick sort loses on partitioning step, when keys are sorted PAGE 6

7 Introduction Example - Quick Sort vs Merge Sort? -> Easy Quick sort loses on partitioning step, when keys are sorted Map-Reduce vs Spark -> Not Easy PAGE 7

8 Introduction Example - Quick Sort vs Merge Sort? -> Easy Quick sort loses on partitioning step, when keys are sorted Map-Reduce vs Spark -> Not Easy Provide simple APIs, But hide the complexity of parallel task execution and fault-tolerance from the user. Restricts clients to figure out which one is better! PAGE 8

9 Introduction Example - Quick Sort vs Merge Sort? -> Easy Quick sort loses on partitioning step, when keys are sorted Map-Reduce vs Spark -> Not Easy Provide simple APIs, But hide the complexity of parallel task execution and fault-tolerance from the user. Restricts clients to figure out which one is better! Compare them due to their wide spread adoption in big data analytics. PAGE 9

10 Cluster Computing Architectures Map-Reduce PAGE 10

11 Cluster Computing Architectures Map-Reduce Machine Learning and Graph processing need iteration Repeat the READ -> PROCESS -> WRITE cycle Overhead! PAGE 11

12 Cluster Computing Architectures Map-Reduce Machine Learning and Graph processing need iteration Repeat the READ -> PROCESS -> WRITE cycle Overhead! Apache Spark Resilient Distributed Datasets (RDDs) In memory computations Efficient PAGE 12

13 Cluster Computing Architectures Spark PAGE 13

14 Key Architectural Components Shuffle Execution Model Caching PAGE 14

15 Workloads Focus on batch and iterative jobs: Word Count Sort K-Means Linear Regression Page Rank PAGE 15

16 Characteristics of Workloads Shuffle selectivity (map output size/job input size) Job selectivity (reducer output size/the job input size) Iteration selectivity (output size/input size, for each iteration) PAGE 16

17 Experimental Setup Hardware configuration A total of 4 servers each with GHz CPU cores 9 disks at 7.2 RPM with 1TB each 190 GB RAM 64-bit RHEL Connected with 1Gbps Ethernet switch Equivalent to a cluster of 100 VMs (m3.medium on AWS) Software configuration Hadoop with YARN Input format Block size, replica factor, JVM heap size, container Disabled overlap between Map and Reduce stages (except Sort) Spark on HDFS Workers, JVM heap size PAGE 17

18 Experimental Setup Profiling Tools Logging with RRDTool and Ganglia Correlating resource utilization with the task execution plan in a timeline view PAGE 18

19 Experimental Setup Execution Plan Visualization Example PAGE 19

20 Experimental Setup Fine-grained time break-down Hadoop : Extracting details from log data Spark: Inserting timers in the source code after each sub stage. Aggregate in the end PAGE 20

21 Experiment 1: Word Count Implementation: Word Count program Spark 2.1x, 2.6x, 2.7x faster for increasing data sizes PAGE 21

22 Experiment 1: Word Count Map - stage: Spark is 3x faster Spark is about 6.2x faster than MapReduce in the combine stage. Hash-based combine better than sort-based combine. Reduce - stage: Similar Results (Network Bound) PAGE 22

23 Experiment 1: Word Count Map - stage: Spark is 3x faster Spark is about 6.2x faster than MapReduce in the combine stage. Hash-based combine better than sort-based combine. Reduce - stage: Similar Results (Network Bound) PAGE 23

24 Experiment 1: Word Count Inference: For low shuffle selectivity workloads, hash-based aggregation in Spark is more efficient than sort-based aggregation in MapReduce due to the complexity differences in its in-memory collection and combine components. PAGE 24

25 Experiment 2: Sort Implementation: TeraSort for Hadoop, SortByKey() for Spark Map-Reduce: 1.1x, 1.5x, 1.8x faster for increasing data sizes. PAGE 25

26 Experiment 2: Sort Data-Read Stage : Lightweight in Hadoop, Disk bound for Spark (scans complete file) Map-stage: Spark 2.5 x faster, CPU-bound for both Reduce - stage: Hadoop 3.3x faster (overlapping), network bound for both PAGE 26

27 Experiment 2: Sort Data-Read Stage : Lightweight in Hadoop, Disk bound for Spark (scans complete file) Map-stage: Spark 2.5 x faster, CPU-bound for both Reduce - stage: Hadoop 3.3x faster (overlapping), network bound for both PAGE 27

28 Experiment 2: Sort Spark reads from OS buffer cache Map-Reduce OS buffer cache is not that effective Spark s hash based shuffle writer writes directly to the disk which reduces latency PAGE 28

29 Experiment 2: Sort Inferences In MapReduce, the reduce stage is faster than Spark because MapReduce can overlap the shuffle stage with the map stage, which effectively hides the network overhead. In Spark, the execution time of the map stage increases as the number of reduce tasks increase. This overhead is caused by and is proportional to the number of files opened simultaneously. Map-Reduce performs better than Spark for increasing data sizes. PAGE 29

30 Experiment 3: Iterative Algorithms Implementation: Mahout s K-Means for Hadoop, Example program for Spark For each iteration, it reads the training data to compute the updated weights commonly seen in SVM/linear/logistic regression. PAGE 30

31 Experiment 3: Iterative Algorithms Map-Stage: CPU bound Reduce Stage: low shuffle selectivity (not a bottleneck) Spark wins, reason is RDD caching PAGE 31

32 Experiment 3: Iterative Algorithms PAGE 32

33 Experiment 3: Iterative Algorithms The storage levels do not impact the execution time of subsequent iterations. For DISK ONLY, there are almost no disk reads in subsequent iterations since the bandwidth of 8 disks for caching RDDs is more than enough to sustain the IO rate for k-means, with or without OS buffer cache.

34 Experiment 3: Iterative Algorithms When there is no RDD caching, disk reads decrease from one iteration to the next iteration, but this does not lead to execution time improvements. Why? MICRO BENCHMARK EXPERIMENT: For k-means without RDD caching, the reduction of disk I/O due to OS buffer cache does not result in execution time improvements for subsequent iterations, since the CPU overhead of transforming the text to Point object is a bottleneck. PAGE PAGE 34 34

35 Experiment 3: Iterative Algorithms Inferences: For iterative algorithms, if an iteration is CPU-bound, caching the raw file (to reduce disk I/O) may not help reduce the execution time since the disk I/O is hidden by the CPU overhead. But on the contrary, if an iteration is disk-bound, caching the raw file can significantly reduce the execution time. RDD caching can reduce not only disk I/O, but also the CPU overhead since it can cache any intermediate data during the analytic pipeline. For example, the main contribution of RDD caching for k-means is to cache the Point object to save the transformation cost from a text line, which is the bottleneck for each iteration. PAGE 35

36 Experiment 4: PageRank Implementation: X-RIME PR for Hadoop, example and GraphX version for Spark over Facebook and Twitter datasets. Pattern: For each iteration in MapReduce, in the map stage, each vertex loads its graph data structure (i.e., adjacency list) from HDFS, and sends its rank to its neighbors through a shuffle. During the reduce stage, each vertex receives the ranks of its neighbors to update its own rank, and stores both the adjacency list and ranks on HDFS for the next iteration. Much higher shuffle selectivity (to exchange ranks) and iteration selectivity (to exchange graph structures) of PageRank than k-means. PAGE 36

37 Experiment 4: PageRank Spark - GraphX is 4x faster than Spark-Naive, Spark-Naive faster than Hadoop. PAGE 37

38 Experiment 4: PageRank CPU-bound for both. Heavier shuffle and disk I/O overheads for Hadoop. PAGE 38

39 Experiment 4: PageRank Inference: For graph analytic algorithms such as BFS and Community Detection that read the graph structure and iteratively exchange states through a shuffle, compared to MapReduce, Spark can avoid materializing graph data structures on HDFS across iterations, which reduces overheads for serialization/de-serialization, disk I/O, and network I/O. PAGE 39

40 Impact of Other Factors Varying numbers of disks: Not substantial for typical clusters Memory limits: Not substantial so long as JVM heap size above a certain level Built-in fault tolerance: Hadoop is more reasonable in re-executing a failed reducer PAGE 40

41 Summary Developers/ System Admins/ Users/ Researchers: Can improve both the architecture and implementation through our observations. Once tuned properly, the majority of workloads are CPU-bound for both MapReduce and Spark, and hence are scalable to the number of CPU cores. For MapReduce, the network overhead during a shuffle can be hidden by overlapping the map and reduce stages. For iterative algorithms in Spark, counter-intuitively, DISK ONLY configuration might be better than MEMORY ONLY. PAGE 41

42 References Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages , M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7): , A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29 42, J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. VLDB, 7(13): , PAGE 42

43 Discussion Strengths and Weaknesses Strengths Extensive experimental study with in-depth profiling. Presented micro benchmarks to further study the impact of RDD and other factors. Covered major categories of computational workloads from sorting to graph analytics. Provided guidance big data stacks/tools. Weaknesses Missed some major computational workloads: e.g., streaming. Spark is in standalone mode, hence the lack of comparison of Spark and Hadoop with HDFS and YARN. Missed scalability tests given the limited number of nodes in the cluster. Some technical details can be hidden or put into Appendix for clarity and coherence. PAGE 43

44 Discussion Related Papers Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: cluster computing with working sets." HotCloud 10 (2010): Marcu, Ovidiu-Cristian, Alexandru Costan, Gabriel Antoniu, and María S. Pérez. "Spark versus flink: Understanding performance in big data analytics frameworks." In IEEE 2016 International Conference on Cluster Computing.(2016). G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages , Li, Haoyuan, et al. "Tachyon: Reliable, memory speed storage for cluster computing frameworks." Proceedings of the ACM Symposium on Cloud Computing. ACM, PAGE 44

45 Discussion Future Work Including stream processing experiments: Spark-Streaming vs. Hadoop-based streaming. Including comparison of Spark and Hadoop with HDFS and YARN (in fully distributed mode). Including scalability tests given the limited number of nodes in the cluster. More in-depth experimental study of different categories of workloads: e.g., graph algorithms over Hadoop vs. Spark. PAGE 45

46 Discussion Questions What is one possible reason for Spark to insist on stage barriers instead of overlapping map and reduce? If we replace the disks with SSDs, would the experimental results change? If so, which parts? If we replace the 1 Gbps Ethernet with InfiniBand, would the experimental results change? If so, which parts? Would Hadoop and Spark exhibit substantially different scalability curves? PAGE 46

47 Thank You PAGE 47

Apache Flink: Distributed Stream Data Processing

Apache Flink: Distributed Stream Data Processing K.M.J. Jacobs CERN, Geneva, Switzerland 1 Introduction The amount of data is growing significantly over the past few years. Therefore, the need for distributed