Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Similar documents
Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale

Big Data Infrastructures & Technologies

DATA SCIENCE USING SPARK: AN INTRODUCTION

Shark: Hive on Spark

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Resilient Distributed Datasets

April Copyright 2013 Cloudera Inc. All rights reserved.

CSE 444: Database Internals. Lecture 23 Spark

Big Data Hadoop Stack

Apache Hive for Oracle DBAs. Luís Marques

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark, Shark and Spark Streaming Introduction

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

The Stratosphere Platform for Big Data Analytics

Lecture 11 Hadoop & Spark

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

CompSci 516: Database Systems

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

An Introduction to Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Databases 2 (VU) ( / )

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Cloud Computing & Visualization

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Programming Systems for Big Data

Introduction to Data Management CSE 344

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Spark: A Brief History.

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Accelerate Big Data Insights

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

a Spark in the cloud iterative and interactive cluster computing

Big Data Hadoop Course Content

An Introduction to Big Data Formats

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Batch Processing Basic architecture

CS Spark. Slides from Matei Zaharia and Databricks

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Improving the MapReduce Big Data Processing Framework

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

MapR Enterprise Hadoop

Apache Spark 2.0. Matei

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Fast, Interactive, Language-Integrated Cluster Computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Hadoop Development Introduction

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Cloud, Big Data & Linear Algebra

Spark and distributed data processing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Chapter 5. The MapReduce Programming Model and Implementation

The Datacenter Needs an Operating System

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data with Hadoop Ecosystem

Introduction to Data Management CSE 344

Prototyping Data Intensive Apps: TrendingTopics.org

Introduction to MapReduce (cont.)

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Certified Big Data and Hadoop Course Curriculum

Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

2/4/2019 Week 3- A Sangmi Lee Pallickara

Dremel: Interac-ve Analysis of Web- Scale Datasets

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Apache Flink Big Data Stream Processing

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Map Reduce Group Meeting

Cloudera Kudu Introduction

Introduction to Data Management CSE 344

Certified Big Data Hadoop and Spark Scala Course Curriculum

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

microsoft

Massive Online Analysis - Storm,Spark

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Stages of Data Processing

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Importing and Exporting Data Between Hadoop and MySQL

Transcription:

Shark: SQL and Rich Analytics at Scale Reynold Xin UC Berkeley

Challenges in Modern Data Analysis Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis: machine learning, graph algorithms, etc. Low-latency, interactivity.

MapReduce Apache Hive, Google Tenzing, Turn Cheetah... Enables fine-grained fault-tolerance, resource sharing, scalability. Expressive Machine Learning algorithms. High-latency, dismissed for interactive workloads. MPP Databases Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala... Fast! Generally not fault-tolerant; challenging for long running queries as clusters scale up. Lack rich analytics such as machine learning and graph algorithms.

Apache Hive A data warehouse - initially developed by Facebook - puts structure/schema onto HDFS data (schema-on-read) - compiles HiveQL queries into MapReduce jobs - flexible and extensible: support UDFs, scripts, custom serializers, storage formats. Popular: 90+% of Facebook Hadoop jobs generated by Hive But slow: 30+ seconds even for simple queries

What is Shark? A data analysis (warehouse) system that - builds on Spark (MapReduce deterministic, idempotent tasks), - scales out and is fault-tolerant, - supports low-latency, interactive queries through in-memory computation, - supports both SQL and complex analytics such as machine learning, - is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

What is Shark? A data analysis (warehouse) system that HOW DO I FIT PB OF DATA IN MEMORY??? - builds on Spark (MapReduce deterministic, idempotent tasks), - scales out and is fault-tolerant, - supports low-latency, interactive queries through in-memory computation, - supports both SQL and complex analytics such as machine learning, - is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

BI&software (e.g.&tableau) CommandHline&shell Thrift&/&JDBC Driver Meta store SQL&Parser Query& Optimizer Physical&Plan SerDes,&UDFs Execution MapReduce Hadoop&Storage&(e.g.&HDFS,&HBase) Hive Architecture

BI&software (e.g.&tableau) CommandHline&shell Thrift&/&JDBC Driver Meta store SQL&Parser Query& Optimizer Physical&Plan SerDes,&UDFs Execution Spark Hadoop&Storage&(e.g.&HDFS,&HBase) Shark Architecture

Analyzing Data CREATE EXTERNAL TABLE wiki (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/'; SELECT COUNT(*) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';

Caching Data in Shark CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster s memory using RDD.cache().

Tuning the Degree of Parallelism Relies on Spark to infer the number of map tasks (automatically based on input size). Number of reduce tasks needs to be specified by the user. - SET mapred.reduce.tasks=499; Out of memory error on slaves if the number is too small. It is usually OK to set a higher value since the overhead of task launching is low in Spark.

Demo 18 months of Wikipedia traffic statistics

Engine Extensions and Features Partial DAG Execution (coming soon) Columnar Memory Store Machine Learning Integration Hash-based Shuffle vs Sort-based Shuffle Data Co-partitioning (coming soon) Partition Pruning based on Range Statistics Distributed Data Loading Distributed sorting Better push-down of limits...

Partial DAG Execution (PDE) How to optimize the following query? SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_crazy_udf(b.field1, b.field2) = true;

Partial DAG Execution (PDE) How to optimize the following query? SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_crazy_udf(b.field1, b.field2) = true; Hard to estimate cardinality! Without cardinality estimation, cost-based optimizer breaks down.

Partial DAG Execution (PDE) PDE allows dynamic alternation of query plans based on statistics collected at run-time. Can gather customizable statistics at global and per-partition granularities while materializing map output. - partition sizes, record counts (skew detection) - heavy hitters - approximate histograms

Partial DAG Execution (PDE) PDE allows dynamic alternation of query plans based on statistics collected at run-time. Can gather customizable statistics at global and per-partition granularities while materializing map output. - partition sizes, record counts (skew detection) - heavy hitters - approximate histograms Table 2 Stage 2 Alter query plan based on such statistics. - map join vs shuffle join - symmetric vs non-symmetric hash join Map join Shuffle join Table 1 Join Result Stage 1 Join Result

Columnar Memory Store Simply caching Hive records as JVM objects is inefficient. Shark employs column-oriented storage using arrays of primitive objects. Row'Storage' Column'Storage' 1" john" 4.1" 2" mike" 3.5" 1" john" 2" 3" mike" sally" 3" sally" 6.4" 4.1" 3.5" 6.4" Compact storage (as much as 5X less space footprint). JVM garbage collection friendly. CPU-efficient compression (e.g. dictionary encoding, run-length encoding, bit packing).

Machine Learning Integration Unified system for query processing and machine learning Write machine learning algorithms in Spark, optimized for iterative computations Query processing and ML share the same set of workers and caches def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache())

Conviva Warehouse Queries (1.7 TB) 100 Shark Shark (disk) Hive Runtime (seconds) 75 50 25 1.1 0.8 0.7 1.0 0 Q1 Q2 Q3 Q4

Machine Learning (1B records, 10 features/record) Shark/Spark 4.1 Hadoop 0 30 60 90 120 150 k- means Shark/Spark 0.96 Hadoop 0 20 40 60 80 100 120 logistic regression

Getting Started ~ 5 mins to install Shark locally - https://github.com/amplab/shark/wiki The Spark EC2 AMI comes with Shark installed (in /root) - spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> Also supports Amazon Elastic MapReduce (EMR) - http://tinyurl.com/spark-emr Use Apache Mesos or Spark standalone cluster mode for private cloud,

Open Source Development Spark/Shark is a very small code base. - Spark: 20K LOC - Shark: 7K LOC Easy to adapt and tailor to specific use cases. Already accepted major contributions from Yahoo!, ClearStory Data, Intel. Mailing list: shark-users @ googlegroups

Summary By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases. It can answer queries up to 100X faster than Hive and machine learning 100X faster than Hadoop MapReduce. Try it out on EC2 (takes 10 mins to spin up a cluster): http://shark.cs.berkeley.edu

backup slides

Shark Impala Focus integrate SQL with complex analytics data warehouse / OLAP Execution Spark (MapReduce like) Parallel Databases In-memory in-memory tables no (buffer cache) Fault-tolerance tolerate slave failures no Large (out-of-core) joins yes no UDF yes no

Why are previous MR-based systems slow? Disk-based intermediate outputs. Inferior data format and layout (no control of data co-partitioning). Execution strategies (lack of optimization based on data statistics). Task scheduling and launch overhead!

Task Scheduling and Launch Overhead Hadoop uses heartbeat to communicate scheduling decisions. Hadoop task launch delay 5-10 seconds. Spark uses an event-driven architecture and can launch tasks in 5ms. - better parallelism - easier straggler mitigation - elasticity - multi-tenancy resource sharing

Task Scheduling and Launch Overhead Time (seconds) 0 2000 4000 6000 Time (seconds) 50 100 150 200 0 1000 2000 3000 4000 5000 Number of Hadoop Tasks 0 1000 2000 3000 4000 5000 Number of Spark Tasks