Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Similar documents
Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Evolution From Shark To Spark SQL:

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Resilient Distributed Datasets

Shark: Hive on Spark

HadoopDB: An open source hybrid of MapReduce

Lecture 11 Hadoop & Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Shark: Hive (SQL) on Spark

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

CompSci 516: Database Systems

Cloud Computing & Visualization

Spark Overview. Professor Sasu Tarkoma.

Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CSE 444: Database Internals. Lecture 23 Spark

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

April Copyright 2013 Cloudera Inc. All rights reserved.

Database in the Cloud Benchmark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Massive Online Analysis - Storm,Spark

Spark, Shark and Spark Streaming Introduction

Big Data Infrastructures & Technologies

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Cloud Analytics Database Performance Report

The Stratosphere Platform for Big Data Analytics

MapReduce, Hadoop and Spark. Bompotas Agorakis

DATA SCIENCE USING SPARK: AN INTRODUCTION

Spark: A Brief History.

Apache Hive for Oracle DBAs. Luís Marques

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Practical Big Data Processing An Overview of Apache Flink

7. Query Processing and Optimization

Accelerate Big Data Insights

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Fast, Interactive, Language-Integrated Cluster Computing

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Simba: Towards Building Interactive Big Data Analytics Systems. Feifei Li

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Distributed Computation Models

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Shark: Hive (SQL) on Spark

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Unifying Big Data Workloads in Apache Spark

Batch Processing Basic architecture

Parallel Processing Spark and Spark SQL

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters

Apache Spark Internals

Technical Report. Recomputation-based data reliability for MapReduce using lineage. Sherif Akoush, Ripduman Sohan, Andy Hopper. Number 888.

Oracle Big Data Connectors

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Chapter 4: Apache Spark

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Accelerating Spark Workloads using GPUs

microsoft

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

An Introduction to Apache Spark

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Delft University of Technology Parallel and Distributed Systems Report Series

Apache Flink Big Data Stream Processing

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

A Fast and High Throughput SQL Query System for Big Data

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Global Analytics in the Face of Bandwidth and Regulatory Constraints

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Transcription:

Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679)

RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at the speed of DRAM instead of the speed of network. Spark can keep just one copy of each RDD partition in memory. Lost RDD partitions can be rebuilt in parallel across the other nodes. Partitions can be recomputed on other nodes if any of the node is slow.

Fault Tolerance: Shark can tolerate the loss of any set of worker nodes. Even true within a query. Recovery is parallelized across the cluster. The deterministic nature of RDDs also enables straggler mitigation. Recovery works even for queries that combine SQL and machine learning UDFs.

Engine Extensions: Partial DAG Execution (PDE): is a technique that allows dynamic alteration of query plans based on data statistics collected at run-time. Mechanism: 1. Gathers customizable statistics at global and per-partition granularities while materializing map outputs. 2. It allows the DAG to be altered based on the statistics collected. Join Optimization: Shark uses PDE to select the join strategy at runtime based on its inputs exact sizes.

Skew-handling & Degree of Parallelism: Using PDE, Shark can use individual partitions sizes to determine the number of reducers at runtime by coalescing many small, fine-grained partitions into fewer coarse partitions that are used by reduce tasks. Columnar Memory Store: Shark implements a columnar memory store on top of Spark s native memory store. 1. It stores all columns of primitive types as JVM primitive arrays. 2. Map and array are serialized and concatenated into a single byte array. 3. Each column creates only one JVM object, leading to fast GCs and a compact data representation. 4. Shark implements CPU-efficient compression schemes such as dictionary encoding, run-length encoding and bit packing.

Distributed Data Loading: Shark can load data into memory at the aggregated throughput of the CPU s processing incoming data. Data Co-partitioning: Shark allows co-partitioning two tables on a common key for faster joins in subsequent queries. This is accomplished with DISTRIBUTE clause: CREATE TABLE l_mem TBLPROPERTIES ("shark.cache"=true) AS SELECT * FROM lineitem DISTRIBUTE BY L_ORDERKEY;

Partitioning Statistics and Map Pruning: 1. Shark s memory store on each worker piggybacks the data loading process to collect statistics. 2. The information collected for each partition includes the range of each column and distinct values if the number of distinct values is small. 3. The collected statistics are sent back to master program and kept in memory for pruning during query execution. 4. When a query is issued, Shark evaluates the query s predicates against all partition statistics; partitions that do not satisfy the predicate are pruned and Shark does not launch tasks to scan them.

Experiments: Aggregate Performance: Ran two aggregation queries- 1. SELECT sourceip, SUM(adRevenue) FROM uservisits GROUP BY sourceip; 2. SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);

Experiments: Join Query: They joined 2 TB uservisits table with 100 GB rankings table. SELECT INTO Temp sourceip, AVG(pageRank),SUM(adRevenue) as totalrevenue FROM rankings AS R, uservisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date( 2000-01-15 ) AND Date( 2000-01-22 ) GROUP BY UV.sourceIP;

Conclusion: Shark can perform more than 100 times faster than Hive and Hadoop, even though some performance optimizations are still to be implemented. Shark exceeds the performance reported for MPP databases. Future Analytics Works: Shark can leverage the underlying engine to perform machine learning functions on the same data.

References: https://github.com/cloudera/impala. http://hadoop.apache.org/. A. Abouzeid et al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2009. S. Agarwal et al. Re-optimizing data-parallel computing. In NSDI 12. G. Ananthanarayanan et al. Pacman: Coordinated memory caching for parallel jobs. In NSDI, 2012.

THANK YOU