Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Similar documents
Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale

April Copyright 2013 Cloudera Inc. All rights reserved.

Lecture 11 Hadoop & Spark

Resilient Distributed Datasets

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Evolution From Shark To Spark SQL:

The Stratosphere Platform for Big Data Analytics

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Data Storage Infrastructure at Facebook

Shark: Hive on Spark

Shark: Hive (SQL) on Spark

Fast, Interactive, Language-Integrated Cluster Computing

Big Data Infrastructures & Technologies

CSE 444: Database Internals. Lecture 23 Spark

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Massive Online Analysis - Storm,Spark

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Parallel Processing Spark and Spark SQL

HadoopDB: An open source hybrid of MapReduce

MapR Enterprise Hadoop

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Cloud Computing & Visualization

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Big Data Hadoop Stack

Spark: A Brief History.

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Batch Processing Basic architecture

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

CS Spark. Slides from Matei Zaharia and Databricks

Apache Spark 2.0. Matei

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Unifying Big Data Workloads in Apache Spark

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

a Spark in the cloud iterative and interactive cluster computing

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Spark, Shark and Spark Streaming Introduction

Hive SQL over Hadoop

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory

Big Data Architect.

Distributed Computation Models

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Data Platforms and Pattern Mining

Apache Hive for Oracle DBAs. Luís Marques

Shark: Hive (SQL) on Spark

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Big Data Analytics using Apache Hadoop and Spark with Scala

Turning Relational Database Tables into Spark Data Sources

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Infrastructures & Technologies. SQL on Big Data

Large-Scale Data Engineering

Delft University of Technology Parallel and Distributed Systems Report Series

Practical Big Data Processing An Overview of Apache Flink

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Programming Systems for Big Data

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop course content

Outline. CS-562 Introduction to data analysis using Apache Spark

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Chapter 4: Apache Spark

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Importing and Exporting Data Between Hadoop and MySQL

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Big Data Hadoop Course Content

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SparkSQL 11/14/2018 1

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

A Tutorial on Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Transcription:

Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko

What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines is hard

What Are The Problems? Data volumes are expanding dramatically Needs to scale out Managing hundreds of machines is hard Increasing incidence of faults and stragglers Why Is It Hard? High scale increase possibility of faults and stragglers Complicate parallel design

What Are The Problems? Data volumes are expanding dramatically Needs to scale out Managing hundreds of machines is hard Increasing incidence of faults and stragglers Complicate parallel design Data analysis becomes more complex Why Is It Hard? Database systems need to handle statistical methods beyond their simple roll-up, drill-down capability

What Are The Problems? Data volumes are expanding dramatically Needs to scale out Managing hundreds of machines is hard Increasing incidence of faults and stragglers Complicate parallel design Data analysis becomes more complex Databases have to handle complex algorithm efficiently Users expect to query at interactive speeds Why Is It Hard? The increase of scale and complexity makes it hard

Existing Solutions -Class AMapReduce-based Systems Hive, Dryad, etc.

Why don t they work? Although they have: Scalability Fine-grained fault tolerance Allows complex statistical analysis (e.g. ML) But: Slow SQL query processing latency

Existing Solutions -Class BMassive Parallel Processing Databases Vertica, Teradata, etc.

Why don t they work? Although they have: Fast parallel SQL query processing But: Not scalable Expensive/impossible complex analysis (e.g. UDFS, ML) Coarse-grained fault tolerance

Shark: A Hybrid Approach - Class A+B MapReduce parallel query processing + RDBS data processing + =

Features of Shark Build on top of Spark using RDD Dynamic Query Optimization (PDE) Supports low-latency, interactive SQL queries Support efficient complex analytics such as ML Compatibility with Apache Hive, Metastores, HiveQL, etc.

Shark Architecture

Example: SQL on RDD Step 1: User submits a query Step 2 : Hive query compiler parses the query Abstract syntax tree Query SELECT id, price FROM Customer, Transactions JOIN Customer ON customer_id.id=transaction.buyer_id WHERE Transaction.price > 100

Example: SQL on RDD Step 3: Turns syntax tree into logical plan (Basic logical optimization) Abstract syntax tree Logical execution plan

Example: SQL on RDD Step 4: Create physical plan (i.e. RDD transformation) (Additional optimization) Logical execution plan Physical execution plan

RDD RDD is a small partition of records Each RDD is either created from external FS (e.g. HDFS) or derived from applying operators to other RDDs. Each RDD is immutable, and deterministic Lineage of the RDDs are used for fault recovery

What are the benefits of RDDs?

What Does Hive Do Instead? Step 4: Create physical plan (MapReduce stages) Logical execution plan Physical execution plan

Diff(MapReduce, Spark RDD)

Partial DAG Execution

Columnar Memory Store What are the benefits?

Distributed Data Loading Tracking metadata to customize data compression of each RDD

Data Co-partitioning Each partition of the parent RDD is used by at most one partition of the child RDD partition Schema-aware for faster join operations

Partition Statistics & Map Pruning Avoid scanning irrelevant part of the data

Experiments Pavlo et al. Benchmark (2.1TB data) TPC-H Dataset (100GB/1TB data) Real Hive Warehouse (1.7TB dara) Machine Learning Dataset (100GB data)

Experiment Setup 100 m2.4xlarge nodes 8 virtual cores each node 68GB memory each node 1.6TB local storage each node

Result Summary Up to 100X faster than Hive for SQL More than 100X faster than Hadoop for ML Comparable performance to MPP databases Even faster than MPP in some cases

Aggregation Queries SELECT sourceip, SUM (adrevenue) FROM uservisits GROUP BY sourceip

Join Query Effects of data co-partitioning Effects of dynamic query optimization

Fault Tolerance Measuring Shark performance in presence of node failures

Real Hive Warehouse

Machine Learning Logistic regression per iteration runtime (seconds) K-means clustering per iteration runtime (seconds)

Next Steps? Implement optimizations promised in the paper (e.g. bytecode compilation of expression evaluators) Exploit specialized data structures for more compact data representations and better cache behavior More benchmark experiment should be conducted after optimization for comparison