Evolution From Shark To Spark SQL:

Size: px

Start display at page:

Download "Evolution From Shark To Spark SQL:"

Rosa Gilbert
6 years ago
Views:

Computing Technology, Chinese Academy of Sciences and University of

1 Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences BPOE-6 Sep 4, 2015 INSTITUTE OF COMPUTING TECHNOLOGY

2 Outline Background Evolution from Shark to Spark SQL Evaluation Conclusion BPOE

3 Background The exploding growth of the market of big data analytics systems Variety of big data analytics systems Fast evolution of each system BPOE

4 Hadoop s versions Trunk development (source of new features) alpha alpha alpha BPOE

5 Spark s versions Released versions Main Versions Libraries Streaming Alpha MLlib Shark Streaming Stable GraphX Alpha GraphX Stable Shark Shark Shark Spark SQL BPOE

6 Background The fast evolution poses new challenges: For users, difficult to tune the configurations For developers, difficult to program For researchers, difficult to evaluate and analyze What are we supposed to do? BPOE

7 Motivation To investigate the version updates of big data systems Users: To understand how the additional configurations or features affect the execution Developers: To study how to achieve better performance and reliability To develop new systems BPOE

8 Outline Background Evolution from Shark to Spark SQL Evaluation Conclusion BPOE

9 Our Target System: Spark A general-purpose engine based on the abstraction of resilient distributed datasets (RDD) Developed by the AMPLab of UC Berkeley Apache top-level project since 2014 RDD: Data objects reside in memory In-memory data sharing Lineage-based fault tolerance Matei Zaharia, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI, BPOE

10 Programming Model of Spark Two types of RDD operations: Transformations Each transformation creates a new rdd map, mappartitions, groupbykey, sortbykey Actions Reduce, Collect, Count Lazy scheduling Not start a job until an action operation appears BPOE

11 Spark Scheduling A Job consists of many stages Depend on the dependencies of RDDs A: B: Stage 1 groupby C: D: F: map E: Stage 2 union = cached data partition join G: Stage 3 Matei Zaharia, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI, BPOE

Example: PageRank 1. Start each page with a rank of 1 2.

rank i / neighbors i links = // RDD of (url, neighbors) pairs

ITERATIONS) { ranks = links.join(ranks).

12 Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page s rank to Σ i neighbors rank i / neighbors i links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reducebykey(_ + _) } Links (url, neighbors) Ranks 0 (url, rank) join Contribs 0 reduce Ranks 1 join Contribs 2 reduce Ranks 2... BPOE

13 Spark s versions Released versions Main Versions Libraries Streaming Alpha MLlib Shark Streaming Stable GraphX Alpha GraphX Stable Shark Shark Shark Spark SQL BPOE

The Fast Evolution of Spark Rapidly growth of code bases 600+ developers LOC: From 16,000+ of Spark 0.5.0 to 80,000+ of 1.4.

14 The Fast Evolution of Spark Rapidly growth of code bases 600+ developers LOC: From 16,000+ of Spark to 80,000+ of Increasing support of various workloads: Machine Learning, Graph, Streaming, Database BPOE

15 Database Support on Spark The Spark community cares about it! Two layers: SQL parser: translate a query to a Spark job Spark engine: process the job submitted by parser BPOE

16 What did we do? Analyzing the evolution of query processing on Spark on two layers: Query parser From hive-based Shark to Spark SQL The differences of optimization rules Spark core components: The evolution of Spark engine How does it impact the execution of query processing BPOE

17 Shark vs. Spark SQL Evolution of the query parser: Shark: A SQL parser based on Hive Spark SQL: Spark query processing component based on a new SQL parser called Catalyst Query HQL SQL HQL Abstract Syntax Tree Hive Parser CatalystSqlParser HiveQlParser Logical Plan Hive Logical Planner Catalyst Logical Planner Optimized Plan Hive Optimizer Catalyst Optimizer Physical Plan Spark Work Generator SparkPlan Generator RDD Query Processing Procedure RDD Generator Shark RDD Generator Spark SQL BPOE

18 An example: a Join query A Join query statement: SELECT sourceip, sum(adrevenue) as totalrevenue, avg(pagerank) as pagerank FROM rankings R JOIN (SELECT sourceip, desturl, adrevenue FROM uservisits UV WHERE UV.visitDate > X" AND UV.visitDate < Y") NUV ON (R.pageURL =NUV.destURL) GROUP BY sourceip ORDER BY totalrevenue DESC LIMIT 1; Explanations of each step: 1. Select required columns from table uservisits, filter data based on the where condition 2. Select required columns from Table rankings 3. Join data from two tables 4. GroupBy, sum and avg operations 5. Get rows with the largest totalrevenue Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD BPOE

19 Stage4 FileSinkOperator LimitOperator From SQL to Spark plans Shark ExtractOperator Stage3 ReduceSinkOperator SelectOperator GroupByPostShuffleOperator Spark SQL Stage3 TakeOrdered & Limit Aggregate Stage2 ReduceSinkOperator Stage2 Exchange GroupByPreShuffleOperator SelectOperator JoinOperator Partical Aggregate Project ShuffleHashJoin Stage0 ReduceSinkOperator Stage1 ReduceSinkOperator Stage0 Exchange Stage1 Exchange SelectOperator TableScanOperator Project HiveTableScan FilterOperator Filter TableScanOperator HiveTableScan BPOE

20 Stage4 FileSinkOperator LimitOperator From SQL to Spark plans Shark ExtractOperator Stage3 ReduceSinkOperator SelectOperator GroupByPostShuffleOperator Spark SQL Stage3 TakeOrdered & Limit Aggregate Stage2 ReduceSinkOperator Shark: ReduceSink operator GroupByPreShuffleOperator Originally Designed for MapReduce, SelectOperator cannot avoid for aggregation and join JoinOperator Stage2 Exchange Spark SQL: Exchange operator Partical Aggregate Not added if two operations use the Project same partition type ShuffleHashJoin Stage0 ReduceSinkOperator Stage1 ReduceSinkOperator Stage0 Exchange Stage1 Exchange SelectOperator TableScanOperator Project HiveTableScan FilterOperator Filter TableScanOperator HiveTableScan BPOE

21 Stage4 FileSinkOperator LimitOperator Shark ExtractOperator Stage3 ReduceSinkOperator Shark: use FileSinkOperator SelectOperator Have to write result into HDFS GroupByPostShuffleOperator Spark SQL Stage3 TakeOrdered & Limit Aggregate Stage2 ReduceSinkOperator GroupByPreShuffleOperator SelectOperator JoinOperator Stage2 Exchange Spark SQL: use actions Partical Aggregate Can return results to driver or be Project saved as a RDD ShuffleHashJoin Stage0 ReduceSinkOperator Stage1 ReduceSinkOperator Stage0 Exchange Stage1 Exchange SelectOperator TableScanOperator Project HiveTableScan FilterOperator Filter TableScanOperator HiveTableScan BPOE

22 Major differences (SQL->Spark plans) Stage separated Shark: ReduceSink operator Originally Designed for MapReduce, cannot avoid for aggregation and join Spark SQL: Exchange operator Not added if two operations use the same partition type TerminalOperator Shark: use FileSinkOperator Have to write result into HDFS Spark SQL: use actions Can return result to driver or be saved as a RDD Additional optimization rules Spark SQL: Aggregation on small data size can be executed on driver BPOE

From plan to RDD (Join) MapPartitionsRDD (ComputeJoin) ZippedPartitionsRDD (ShuffleHashJoin) CoGroupedRDD (RDDs) ShuffledRDD(Exchange) ShuffledRDD(Exchange) ShuffledRDD(Repartition)

23 From plan to RDD (Join) MapPartitionsRDD (ComputeJoin) ZippedPartitionsRDD (ShuffleHashJoin) CoGroupedRDD (RDDs) ShuffledRDD(Exchange) ShuffledRDD(Exchange) ShuffledRDD(Repartition) ShuffledRDD(Repartition) MapPartitionsRDD(Exchange) MapPartitionsRDD(Exchange) RDD {ReduceSinkOp.processPartition} RDD {ReduceSinkOp.processPartition} MapPartitionsRDD(ProjectFunc) RDD { filterop.processpartition } MapPartitionsRDD(FilterFunc) Shark: HadoopRDD1 HadoopRDD2 Shuffle operation in ReduceSinkOp Serialize the value to reduce size A pair of (key, value) is serialized as bytes Create a new RDD called CoGroupedRDD for JoinOp Specialized shuffle fetch HadoopRDD1 Spark SQL: HadoopRDD2 All operations are fitted into existing RDDs Just like a normal Spark job But may lack low-level execution optimization for query processing BPOE

24 The Real Execution on Spark Query Abstract Syntax Tree Logical Plan Optimized Logical Plan Physical Plan Optimized Physical Plan RDDs Spark Job BPOE

25 Evolution of the Spark Core: From to RDD Operator1 Operator2 OperatorN DAGScheduler Job DAG Stage1 Stage2 StageN SubmitMissingTasks Tasks of each stage TaskSet TaskSet TaskSet TaskScheduler.submitTasks TaskSet TaskManager TaskManager TaskManager SchedulerBackend.reviveOffers LaunchedTasks: Scheduler, Communication and GC Components Purpose DAGScheduler Actor-based Thread-based Better Scalability Task Submit No serialize Serialize before sending Optimize communication Block Transfer Java Nio Netty-based Better network performance Memory GC MetadataCleaner to periodically clean up states A daemon thread ContextCleaner and weak refs Faster memory garbage collection Seq[Seq[Task]] Seq[Seq[Task]]... Executor.launchTask runningtasks on each executor runningtasks on each executor TaskRunner TaskRunner TaskRunner TaskRunner BPOE

26 Evolution of the Spark Core: From to RDD Operator1 Operator2 OperatorN DAGScheduler Job DAG Stage1 Stage2 StageN SubmitMissingTasks Tasks of each stage TaskSet TaskSet TaskSet TaskScheduler.submitTasks TaskSet TaskManager TaskManager TaskManager SchedulerBackend.reviveOffers LaunchedTasks: Scheduler, Communication and GC Components Purpose DAGScheduler Actor-based Thread-based Better Scalability Task Submit No serialize Serialize before sending Optimize communication Block Transfer Java Nio Netty-based Better network performance Memory GC MetadataCleaner to periodically clean up states A daemon thread ContextCleaner and weak refs Faster memory garbage collection Seq[Seq[Task]] Seq[Seq[Task]]... Executor.launchTask runningtasks on each executor runningtasks on each executor TaskRunner TaskRunner TaskRunner TaskRunner BPOE

27 Evolution of the Spark Core: From to RDD Operator1 Operator2 OperatorN DAGScheduler Job DAG Stage1 Stage2 StageN SubmitMissingTasks Tasks of each stage TaskSet TaskSet TaskSet TaskScheduler.submitTasks TaskSet TaskManager TaskManager TaskManager SchedulerBackend.reviveOffers LaunchedTasks: Scheduler, Communication and GC Components Purpose DAGScheduler Actor-based Thread-based Better Scalability Task Submit No serialize Serialize before sending Optimize communication Block Transfer Java Nio Netty-based Better network performance Memory GC MetadataCleaner to periodically clean up states A daemon thread ContextCleaner and weak refs Faster memory garbage collection Seq[Seq[Task]] Seq[Seq[Task]]... Executor.launchTask runningtasks on each executor runningtasks on each executor TaskRunner TaskRunner TaskRunner TaskRunner BPOE

28 Evolution of the Spark Core: From to RDD Operator1 Operator2 OperatorN DAGScheduler Job DAG Stage1 Stage2 StageN SubmitMissingTasks Tasks of each stage TaskSet TaskSet TaskSet TaskScheduler.submitTasks TaskSet TaskManager TaskManager TaskManager SchedulerBackend.reviveOffers LaunchedTasks: Scheduler, Communication and GC Components Purpose DAGScheduler Actor-based Thread-based Better Scalability Task Submit No serialize Serialize before sending Optimize communication Block Transfer Java Nio Netty-based Better network performance Memory GC MetadataCleaner to periodically clean up states A daemon thread ContextCleaner and weak refs Faster memory garbage collection Seq[Seq[Task]] Seq[Seq[Task]]... Executor.launchTask runningtasks on each executor runningtasks on each executor TaskRunner TaskRunner TaskRunner TaskRunner BPOE

29 Evolution of the Spark Core: Stage 0 Stage 1 Cache Block Manager Memory Block Store From to Cache Shuffle and Storage Compone nts Default Shuffle Manager Shuffle Block Manager Cache Manager Disk Storage Tachyon Support Purpose Hash Sort Map-side Sort & Spill data File block first put data into memory Using memory map for reading Adding index block manager Large data into disk Small file direct reading Sort-based shuffle support Better memory management Better memory usage No Yes Separate computation and storage BPOE

30 Outline Background Evolution from Shark to Spark SQL Evaluation Conclusion BPOE

31 Benchmarks Versions we investigated: Spark with Shark Spark with Spark SQL Three queries are from Pavlo s paper: Table Scan Aggregation Complex Join Queries are with different conditions for data filtering Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD BPOE

32 Spark Configuration Configuration Value Description Spark.shuffle.manager hash No Sort operation Spark.shuffle.compress true Compress shuffle data Decrease the size of shuffle data Spark.shuffle.memoryFraction 0.2 Memory for shuffle aggregation Spark.storage.memoryFraction 0.2 Memory for cache, no need here Spark.default.parallelism 300 Default partition number for shuffle, increase it to decrease memory usage of shuffle Spark.serializer JavaSerializer Default partition number for shuffle, increase it to decrease memory usage of shuffle BPOE

33 Run Time (s) Rum Time (s) Run Time (s) Performance comparison 9 nodes, each has 16GB memory Data size: rankings 90 million rows, 5 GB uservisits 700 million rows, 100 GB Each runs with 3 different where conditions for different sizes of filtered data (in ascending order) Select Aggregation Join SparkSQL Shark Spark SQL Shark Spark SQL Shark BPOE

34 Time (s) Data Size (GB) Observations But Spark SQL performs worse than Shark when the size of filtered data becomes large The huge increment of GC time! The huge data size of shuffle fetching GC Time Total Shuffle Read Data Size Aggregation3 Join Aggregation3 Join3 Shark Spark SQL Shark Spark SQL BPOE

35 Conclusions (1) Spark SQL achieved better when the memory is enough Users should carefully design the query relied on the size of data Spark SQL needs more optimizations on data filtering and join mechanism The low-level execution should have specific mechanism for query processing One-size-fits-all is difficult BPOE

36 Conclusions (2) Garbage collection remains to be a main factor impacting the overall performance Lack of resource awareness and effective memory management Mainly rely on JVM for memory management and thread scheduling Produce lots of intermediate data when executing complex computations The Read-only design of RDD BPOE

37 Thank you! BPOE

Resilient Distributed Datasets

Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,