Shark: Hive (SQL) on Spark

Size: px

Start display at page:

Download "Shark: Hive (SQL) on Spark"

Alison Gibbs
6 years ago
Views:

1 Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 29, 2013 UC BERKELEY

2 Stage 0:M ap-shuffle-reduce M apper(row ) { fields = row.split("\t") em it(fields[0],fields[1]); } Reducer(key,values) { sum = 0; for (value in values) { sum + = value; } em it(key,sum ); } Stage 1:M ap-shuffle M apper(row ) {... em it(page_view s,page_nam e); }...shuffle Stage 2:Local data = open("stage1.out") for (iin 0 to 10) { print(data.getn ext()) }

3 SELECT page_nam e,su M (page_view s) view s FRO M w ikistats G RO U P BY page_nam e O RD ER BY view s D ESC LIM IT 10;

4 Stage 0:M ap-shuffle-reduce M apper(row ) { fields = row.split("\t") em it(fields[0],fields[1]); } Reducer(key,values) { page_view s = 0; for (page_view s in values) { sum + = value; } em it(key,sum ); } Stage 1:M ap-shuffle M apper(row ) {... em it(page_view s,page_nam e); }...shuffle sorts the data Stage 2:Local data = open("stage1.out") for (iin 0 to 10) { print(data.getn ext()) }

6 Outline Hive and Shark Usage Under the hood

7 Apache Hive Puts structure/schema onto HDFS data Compiles HiveQL queries into MapReduce jobs Very popular: 90+% of Facebook Hadoop jobs generated by Hive Initially developed by Facebook

8 Scalability Massive scale out and fault tolerance capabilities on commodity hardware Can handle petabytes of data Easy to provision (because of scaleout)

9 Extensibility Data types: primitive types and complex types User-defined functions Scripts Serializer/Deserializer: text, binary, JSON Storage: HDFS, Hbase, S3

10 But slow Takes 20+ seconds even for simple queries "A good day is when I can run 6 Hive queries

11 Shark Analytic query engine compatible with Hive Supports Hive QL, UDFs, SerDes, scripts, types A few esoteric features not yet supported Makes Hive queries run much faster Builds on top of Spark, a fast compute engine Allows (optionally) caching data in a cluster s memory

13 Use cases Interactive query & BI (e.g. Tableau) Reduce reporting turn-around time Integration of SQL and machine learning pipeline

14 Much faster? 100X faster with in-memory data 2-10X faster with on-disk data

15 Performance (1.7TB on 100 EC2 nodes) Shark Shark(disk) Hive 100 Runtime(seconds) Q1 Q2 Q3 Q4

16 Outline Hive and Shark Usage Under the hood

17 Data Model Tables: unit of data with the same schema Partitions: e.g. range-partition tables by date Buckets: hash-partitions within partitions (not yet supported in Shark)

18 Data Types Primitive types TINYINT, SMALLINT, INT, BIGINT BOOLEAN FLOAT, DOUBLE STRING TIMESTAMP Complex types Structs: STRUCT {a INT; b INT} Arrays: ['a', 'b', 'c ] Maps (key-value pairs): M['key ]

19 Hive QL Subset of SQL Projection, selection Group-by and aggregations Sort by and order by Joins Sub-queries, unions Hive-specific Supports custom map/reduce scripts (TRANSFORM) Hints for performance optimizations

20 Analyzing Data CREATE EXTERN AL TABLE w iki(id BIG IN T,title STRIN G,last_m odified STRIN G,xm lstrin G,text STRIN G )RO W FO RM AT D ELIM ITED FIELD S TERM IN ATED BY '\t'lo CATIO N 's3n://spark-data/w ikipedia-sam ple/'; SELECT CO U N T(*) FRO M w ikiw H ERE TEXT LIKE '% Berkeley% ';

21 Caching Data in Shark CREATE TABLE m ytable_cached AS SELECT * FRO M m ytable W H ERE count > 10; Creates a table cached in a cluster s memory using RDD.cache ()

22 Spark Integration Unified system for SQL, graph processing, machine learning All share the same set of workers and caches def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) f or (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache())

23 Tuning Degree of Parallelism SET m apred.reduce.tasks= 50; Shark relies on Spark to infer the number of map tasks (automatically based on input size) Number of reduce tasks needs to be specified Out of memory error on slaves if num too small

24 Outline Hive and Shark Data Model Under the hood

25 How? A better execution engine Hadoop MR is ill-suited for SQL Optimized storage format Columnar memory store Various other optimizations Fully distributed sort, data copartitioning, partition pruning, etc

26 Hive Architecture

27 Shark Architecture

28 Why is Spark a better engine? Extremely fast scheduling ms in Spark vs secs in Hadoop MR Support for general DAGs Each query is a job rather than stages of jobs Many more useful primitives Higher level APIs Broadcast variables

29 select page_nam e, sum (page_view s) hits from w ikistats_cached w here page_nam e like "% berkeley% group by page_nam e order by hits;

30 select page_nam e,sum (page_view s) hits from w ikistats_cached w here page_nam e like "% berkeley% group by page_nam e order by hits; filter (map)groupby sort

31 Columnar Memory Store Column-oriented storage for inmemory tables Yahoo! contributed CPU-efficient compression (e.g. dictionary encoding, run-length encoding) RowStorage ColumnStorage 3 20X reduction in data size 1 john mike sally 6.4 john mike sally

32 Ongoing Work Code generation for query plan (Intel) BlinkDB integration (UCB) Bloom-filter based pruning (Yahoo!) More intelligent optimizer (UCB & Yahoo! & ClearStory & OSU)

33 Getting Started ~5 mins to install Shark locally Spark EC2 AMI comes with Shark installed (in /root/shark) Also supports Amazon Elastic MapReduce Use Mesos or Spark standalone cluster for private cloud

34 AMPCamp Each on-site audience gets a 4-node EC2 cluster preloaded with Wikipedia traffic statistics data Live streaming audiences get an AMI preloaded with all software (Mesos, Spark, Shark) Use Spark and Shark to analyze the data

35 More Information Hive resources: y/hive/ GettingStarted Shark resources:

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce