Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Similar documents
Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Shark: Hive (SQL) on Spark

Fast, Interactive, Language-Integrated Cluster Computing

Resilient Distributed Datasets

Big Data Infrastructures & Technologies

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

a Spark in the cloud iterative and interactive cluster computing

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Shark: Hive (SQL) on Spark

CSE 444: Database Internals. Lecture 23 Spark

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Warehouse- Scale Computing and the BDAS Stack

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

CS Spark. Slides from Matei Zaharia and Databricks

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Spark: A Brief History.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Shark: Hive on Spark

Programming Systems for Big Data

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark Overview. Professor Sasu Tarkoma.

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Processing of big data with Apache Spark

Evolution From Shark To Spark SQL:

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

DATA SCIENCE USING SPARK: AN INTRODUCTION

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Spark and distributed data processing

2/4/2019 Week 3- A Sangmi Lee Pallickara

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Cloud Computing & Visualization

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Apache Hive for Oracle DBAs. Luís Marques

Big Data Hadoop Stack

Lecture 11 Hadoop & Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Apache Spark 2.0. Matei

Distributed Computing with Spark

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Spark, Shark and Spark Streaming Introduction

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

The Datacenter Needs an Operating System

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CompSci 516: Database Systems

An Introduction to Apache Spark

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

The Stratosphere Platform for Big Data Analytics

Cloud, Big Data & Linear Algebra

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

A Platform for Fine-Grained Resource Sharing in the Data Center

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Accelerate Big Data Insights

Chapter 4: Apache Spark

Distributed Computing with Spark and MapReduce

DATA MINING II - 1DL460

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Data-intensive computing systems

Backtesting with Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Improving the MapReduce Big Data Processing Framework

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

April Copyright 2013 Cloudera Inc. All rights reserved.

Big Data Hadoop Course Content

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Beyond MapReduce: Apache Spark Antonino Virgillito

Massive Online Analysis - Storm,Spark

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Lambda Architecture with Apache Spark

New Developments in Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Introduction to Hive Cloudera, Inc.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Data Storage Infrastructure at Facebook

Parallel Processing Spark and Spark SQL

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Stream Processing on IoT Devices using Calvin Framework

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Transcription:

Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha status Future directions

What Spark Is Not a wrapper around Hadoop Separate, fast, MapReduce- like engine In- memory data storage for very fast iterative queries Powerful optimization and scheduling Interactive use from Scala interpreter Compatible with Hadoop s storage APIs Can read/write to any Hadoop- supported system, including HDFS, HBase, SequenceFiles, etc

Project History Started in summer of 2009 Open sourced in early 2010 In use at UC Berkeley, UCSF, Princeton, Klout, Conviva, Quantifind, Yahoo! Research

Spark Programming Model Resilient distributed datasets (RDDs) Distributed collections of Scala objects Can be cached in memory across cluster nodes Manipulated like local Scala collections Automatically rebuilt on failure

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Driver cachedmsgs = messages.cache() Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: Result: full- text scaled search to 1 of TB Wikipedia data in 5-7 in <1 sec sec (vs (vs 20170 sec sec for for on- disk data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

Apache Hive Data warehouse solution developed at Facebook SQL- like language called HiveQL to query structured data stored in HDFS Queries compile to Hadoop MapReduce jobs

Hive Principles SQL provides a familiar interface for users Extensible types, functions, and storage formats Horizontally scalable with high performance on large datasets

Hive Applications Reporting Ad hoc analysis ETL for machine learning

Hive Downsides Not interactive Hadoop startup latency is ~20 seconds, even for small jobs No query locality If queries operate on the same subset of data, they still run from scratch Reading data from disk is often bottleneck Requires separate machine learning dataflow

Shark Motivation Exploit temporal locality: working set of data can often fit in memory to be reused between queries Provide low latency on small queries Integrate distributed UDF s into SQL

Introducing Shark Shark = Spark + Hive Run HiveQL queries through Spark with Hive UDF, UDAF, SerDe Utilize Spark s in- memory RDD caching and flexible language capabilities

Shark in the AMP Stack Shark Bagel (Pregel on Spark) Spark Streaming Spark Debug Tools Hadoop MPI Mesos Private Cluster Amazon EC2

Shark Physical operators using Spark Rely on Spark s fast execution, fault tolerance, and in- memory RDD s Reuse as much Hive code as possible Convert logical query plan generated from Hive into Spark execution graph Compatible with Hive Run HiveQL queries on existing HDFS data using Hive metadata, without modifications

0.1 alpha: 84% Hive tests passing (575 out of 683) http://github.com/amplab/shark

0.1 alpha Experimental SQL/RDD integration User selected caching with CTAS Columnar RDD cache Some performance improvements

Caching User selected caching with CTAS CREATE TABLE mytable_cached AS SELECT * FROM mytable WHERE count > 10; mytable_cached becomes an in- memory materialized view that can be used to answer queries.

SQL/Spark Integration Allow users to implement sophisticated algorithms as UDFs in Spark Query processing UDFs are streamlined val rdd = sc.sql2rdd( "select foo, count(*) c from pokes group by foo") println(rdd.count) println(rdd.maprows(_.getint( c )).reduce(_+_))

Performance optimizations Hash- based shuffle (speeds up group- bys) Hadoop does sort- based shuffle which can be expensive. Limit push- down in order by select * from pokes order by foo limit 10 Columnar cache

Large Caches in JVM Java objects have large overhead (2-4x the actual data size) Boxing/Unboxing is slow GC is a bottleneck if we have many long- lived objects

Example: Caching Rows Int, Double, Boolean 25 10.7 True 24 + 24 + 16 = 64 bytes* Should be ~ 13 bytes *Estimated on 64 bit VM. Implementations may vary.

Possible Solutions 1. Store deserialized Java objects 2. Serialize objects to in- memory byte array 3. Use memory slab external to JVM 4. Store data in primitive arrays

Deserialized Java Objects + Fast (requires little CPU) (Up to 5X speedup) - Poor memory usage (3X worse) - Lots of Garbage Collection overhead

Serialize to Bytes + Best memory efficiency + Efficient GC - CPU usage to serialize/ deserialize each obj

Store in External Slab + Efficient memory + No GC - CPU usage to serialize/ deserialize each obj - More difficult to implement

Primitive Column Arrays + Efficient memory (similar to byte arrays) + Efficient GC + No extra CPU overhead

Column vs Row Storage Column Row 1 2 3 1 john 4.1 john mike sally 2 mike 3.5 4.1 3.5 6.4 3 sally 6.4

Columnar Benefits Better compression Fewer objects to track Only materialize necessary columns => Less CPU

Columnar in Shark Store all primitive columns in primitive arrays on each node Less deserialization overhead Space efficient and easier to garbage collect

Result Achieved efficient storage space without sacrificing performance

Limit Pushdown in Sort SELECT * FROM table ORDER BY key LIMIT 10;

Main Idea The limit can be applied before sorting all data to increase performance. Each mapper computes its Top K and then a single reducer merges these

Possible Implementations 1. PriorityQueue (Heap) to keep running top k in one traversal on each mapper. O(n logk)

Possible Implementations 2. QuickSelect algorithm. Essentially quicksort, but prunes elements on one side of pivot when possible. O(n)

Result We chose to use Google Guava s implementation of QuickSelect. We observed much better performance than Hive which applies the limit after sorting on each reducer.

Demo 3- node EC2 large cluster (2 virtual cores, 7GB of RAM) Synthetic TPC- H data (2x scale factor)

Some numbers while we wait for Hive to finish running Brown/Stonebraker benchmark 70GB1 Also used on Hive mailing list2 10 Amazon EC2 High Memory Nodes (30GB of RAM/node) Naively cache input tables Compare Shark to Hive 0.7 1 http://database.cs.brown.edu/projects/mapreduce- vs- dbms/ 2 https://issues.apache.org/jira/browse/hive- 396 1

Benchmarks: Query 1 30GB input table SELECT * FROM grep WHERE field LIKE %XYZ% ;

Benchmark: Query 2 5 GB input table SELECT pagerank, pageurl FROM rankings WHERE pagerank > 10;

Status Passing 575 Hive tests out of 683. You can help us decide what to implement next.

Unsupported Hive Features ADD FILE (use Hadoop distributed cache to distribute user- defined scripts) STREAMTABLE hint not supported Outer join with non- equi join filter Table statistics (piggyback file sink) Table with buckets ORDER BY with multiple reducers has a bug (will be fixed this week!) MapJoin has a serialization bug (will be fixed this week!)

Unsupported Hive Features Automatically convert joins to mapjoins Sort- merge mapjoins Partitions with different input formats Unique joins Writing to two or more tables in a single SELECT INSERT Archiving data virtual columns (INPUT FILE NAME, BLOCK OFFSET INSIDE FILE) Merge small files from output

What s coming up (in a month)? Performance improvements (groupbys, joins) Demo of Shark on a 100- node cluster in SIGMOD 2012 (May 22, Scottsdale, AZ) mining Wikipedia visit log data

What s coming up (a year)? Shark- specific query optimizer Automatic tuning of parameters (e.g. number of reducers) Off- heap caching (e.g. EhCache) Automatic caching based on query analysis Shared caching infrastructure

Conclusion Shark is a warehouse solution compatible with Apache Hive Can be orders of magnitude faster We are looking for people to try and we are happy to provide support

Thank you! Questions? http://shark.cs.berkeley.edu (will be updated tonight)