Shark: Hive on Spark

Size: px

Start display at page:

Download "Shark: Hive on Spark"

Jasmine Foster
5 years ago
Views:

1 Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1

2 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x 2

3 Motivation Hive is great, but Hadoop s execution engine makes even the smallest queries take minutes Scala is good for programmers, but many data users only know SQL Can we extend Hive to run on Spark? 3

4 Hive Architecture Client CLI JDBC Driver Meta store SQL Query Physical Plan Parser Optimizer Execution HDFS MapReduce 4

5 Shark Architecture Client CLI JDBC Driver Cache Mgr. Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS 5

6 Shark Engine: Extensions to Hive PDE (Partial DAG Executions) To Support dynamic query optimization allows dynamic alteration of query plans based on data statistics collected at run- time use PDE to optimize the global structure of the plan at stage boundaries Skew Handling and Degree of Parallelism Importance of DoP for Mappers vs Reducers (too few can overload reducers) Skew mitigation: Fine- grained partitions are assigned to coalesced partitions using a greedy bin- packing heuristic Distributed Data Loading Loading tasks use the data schema to extract individual fields from rows Marshal a partition of data into its columnar representation Store those columns in memory 6

7 Shark Engine: Extensions to Hive Join Optimizations 7

8 Efficient In- Memory Storage Simply caching Hive records as Java objects is inefficient due to high per- object overhead Instead, Shark employs column- oriented storage using arrays of primitive types Row Storage Column Storage 1 john mike 3.5 john mike sally 3 sally

9 Efficient In- Memory Storage Simply caching Hive records as Java objects is inefficient due to high per- object overhead Instead, Shark employs column- oriented storage using arrays of primitive types Row Storage Column Storage 1 john Benefit: similarly compact size to serialized data, but >5x faster to access 2 mike 3.5 john mike sally 3 sally

10 Shark vs Spark SQL 10

11 11

12 Spark SQL 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang YSmart: Yet Another SQL-to- MapReduce Translator. In Proceedings of the st International Conference on Distributed Computing Systems (ICDCS '11). IEEE Computer Society, Washington, DC, USA, [2] Harold Lim, Herodotos Herodotou, and Shivnath Babu Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. 5, 11 (July 2012), [3] PTF: [4] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1-2 (September 2010), [5] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, [6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2. [7] Spark and Shark: < [8] Spark SQL: < 21

Fast, Interactive, Language-Integrated Cluster Computing

Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org