Parallel Processing Spark and Spark SQL

Similar documents
Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

An Introduction to Apache Spark

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Machine Learning with Spark. Amir H. Payberah 02/11/2018

Data-intensive computing systems

An Introduction to Apache Spark

CS Spark. Slides from Matei Zaharia and Databricks

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Programming Systems for Big Data

Resilient Distributed Datasets

Introduction to Apache Spark. Patrick Wendell - Databricks

CSE 444: Database Internals. Lecture 23 Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Spark Overview. Professor Sasu Tarkoma.

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Chapter 4: Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Fast, Interactive, Language-Integrated Cluster Computing

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Turning Relational Database Tables into Spark Data Sources

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Big Data Infrastructures & Technologies

Apache Spark Internals

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

MapReduce, Hadoop and Spark. Bompotas Agorakis

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Apache Hive for Oracle DBAs. Luís Marques

Massive Online Analysis - Storm,Spark

Adding Native SQL Support to Spark with C talyst. Michael Armbrust

Spark: A Brief History.

Introduction to Apache Spark

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

An Introduction to Big Data Analysis using Spark

6.830 Lecture Spark 11/15/2017

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Spark, Shark and Spark Streaming Introduction

Outline. CS-562 Introduction to data analysis using Apache Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CompSci 516: Database Systems

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Data-intensive computing systems

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Unifying Big Data Workloads in Apache Spark

SparkSQL 11/14/2018 1

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

a Spark in the cloud iterative and interactive cluster computing

Distributed Computing with Spark and MapReduce

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Batch Processing Basic architecture

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Cloud Computing & Visualization

Distributed Computation Models

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

An Overview of Apache Spark

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Shark: Hive (SQL) on Spark

Apache Spark 2.0. Matei

Introduction to Hive Cloudera, Inc.

Data processing in Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

New Developments in Spark

15.1 Data flow vs. traditional network programming

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Data processing in Apache Spark

Lecture 11 Hadoop & Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Data processing in Apache Spark

Evolution From Shark To Spark SQL:

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

APACHE HIVE CIS 612 SUNNIE CHUNG

Comparing SQL and NOSQL databases

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Certified Big Data and Hadoop Course Curriculum

Survey on Incremental MapReduce for Data Mining

Introduction to Spark

Certified Big Data Hadoop and Spark Scala Course Curriculum

Pyspark standalone code

Big data systems 12/8/17

Shark: Hive on Spark

Big Data Analytics with Apache Spark. Nastaran Fatemi

Higher level data processing in Apache Spark

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Data Informatics. Seon Ho Kim, Ph.D.

Hadoop Development Introduction

Distributed File Systems II

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Big Data Hadoop Course Content

Transcription:

Parallel Processing Spark and Spark SQL Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 1 / 82

Motivation (1/4) Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 2 / 82

Motivation (1/4) Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. MapReduce greatly simplified big data analysis on large unreliable clusters. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 2 / 82

Motivation (2/4) MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 3 / 82

Motivation (3/4) Very expensive (slow), i.e., always goes to disk and HDFS. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 4 / 82

Motivation (4/4) Extends MapReduce with more operators. Support for advanced data flow graphs. In-memory and out-of-core processing. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 5 / 82

Spark vs. MapReduce (1/2) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 6 / 82

Spark vs. MapReduce (1/2) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 6 / 82

Spark vs. MapReduce (2/2) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 7 / 82

Spark vs. MapReduce (2/2) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 7 / 82

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 8 / 82

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Solution Resilient Distributed Datasets (RDD) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 8 / 82

Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Immutable collections of objects spread across a cluster. Like a LinkedList <MyObjects> Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 9 / 82

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 10 / 82

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Built through coarse grained transformations, e.g., map, filter, join. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 10 / 82

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Built through coarse grained transformations, e.g., map, filter, join. Fault tolerance via automatic rebuild (no replication). Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 10 / 82

RDD Applications Applications suitable for RDDs Batch applications that apply the same operation to all elements of a dataset. Applications not suitable for RDDs Applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 11 / 82

Programming Model Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 12 / 82

Spark Programming Model (1/2) Spark programming model is based on parallelizable operators. Parallelizable operators are higher-order functions that execute userdefined functions in parallel. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 13 / 82

Spark Programming Model (2/2) A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Job description based on directed acyclic graphs (DAG). Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 14 / 82

Higher-Order Functions (1/3) Higher-order functions: RDDs operators. There are two types of RDD operators: transformations and actions. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 15 / 82

Higher-Order Functions (2/3) Transformations: lazy operators that create new RDDs. Actions: lunch a computation and return a value to the program or write data to the external storage. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 16 / 82

Higher-Order Functions (3/3) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 17 / 82

RDD Transformations - Map All pairs are independently processed. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 18 / 82

RDD Transformations - Map All pairs are independently processed. // passing each element through a function. val nums = sc.parallelize(array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9} // selecting those elements that func returns true. val even = squares.filter(x => x % 2 == 0) // {4} // mapping each element to zero or more others. nums.flatmap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2} Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 18 / 82

RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 19 / 82

RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. val pets = sc.parallelize(seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.groupbykey() // {(cat, (1, 2)), (dog, (1))} pets.reducebykey((x, y) => x + y) // {(cat, 3), (dog, 1)} Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 19 / 82

RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently processed. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 20 / 82

RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently processed. val visits = sc.parallelize(seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pagenames = sc.parallelize(seq(("index.html", "Home"), ("about.html", "About"))) visits.join(pagenames) // ("index.html", ("1.2.3.4", "Home")) // ("index.html", ("1.3.3.1", "Home")) // ("about.html", ("3.4.5.6", "About")) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 20 / 82

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 21 / 82

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 21 / 82

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Return the number of elements in the RDD. nums.count() // 3 Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 21 / 82

Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 22 / 82

Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Write the elements of the RDD as a text file. nums.saveastextfile("hdfs://file.txt") Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 22 / 82

SparkContext Main entry point to Spark functionality. Available in shell as variable sc. Only one SparkContext may be active per JVM. // master: the master URL to connect to, e.g., // "local", "local[4]", "spark://master:7077" val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 23 / 82

Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(array(1, 2, 3)) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 24 / 82

Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(array(1, 2, 3)) Load text file from local FS, HDFS, or S3. val a = sc.textfile("file.txt") val b = sc.textfile("directory/*.txt") val c = sc.textfile("hdfs://namenode:9000/path/file") Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 24 / 82

Example 1 val textfile = sc.textfile("hdfs://...") val words = textfile.flatmap(line => line.split(" ")) val ones = words.map(word => (word, 1)) val counts = ones.reducebykey(_ + _) counts.saveastextfile("hdfs://...") Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 25 / 82

Example 2 val textfile = sc.textfile("hdfs://...") val sics = textfile.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_ + _) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 26 / 82

Example 2 val textfile = sc.textfile("hdfs://...") val sics = textfile.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_ + _) val textfile = sc.textfile("hdfs://...") val count = textfile.filter(_.contains("sics")).count() Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 26 / 82

Execution Engine Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 27 / 82

Spark Programming Interface A Spark application consists of a driver program that runs the user s main function and executes various parallel operations on a cluster. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 28 / 82

Lineage Lineage: transformations used to build an RDD. RDDs are stored as a chain of objects capturing the lineage of each RDD. val file = sc.textfile("hdfs://...") val sics = file.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 29 / 82

RDD Dependencies (1/3) Two types of dependencies between RDDs: Narrow and Wide. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 30 / 82

RDD Dependencies: Narrow (2/3) Narrow: each partition of a parent RDD is used by at most one partition of the child RDD. Narrow dependencies allow pipelined execution on one cluster node, e.g., a map followed by a filter. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 31 / 82

RDD Dependencies: Wide (3/3) Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 32 / 82

Job Scheduling (1/3) Similar to Dryad. But, it takes into account which partitions of persistent RDDs are available in memory. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 33 / 82

Job Scheduling (2/3) When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph. A stage contains as many pipelined transformations with narrow dependencies. The boundary of a stage: Shuffles for wide dependencies. Already computed partitions. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 34 / 82

Job Scheduling (3/3) The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD. Tasks are assigned to machines based on data locality. If a task needs a partition, which is available in the memory of a node, the task is sent to that node. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 35 / 82

RDD Fault Tolerance (1/2) RDDs maintain lineage information that can be used to reconstruct lost partitions. Logging lineage rather than the actual data. No replication. Recompute only the lost partitions of an RDD. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 36 / 82

RDD Fault Tolerance (2/2) The intermediate records of wide dependencies are materialized on the nodes holding the parent partitions: to simplify fault recovery. If a task fails, it will be re-ran on another node, as long as its stages parents are available. If some stages become unavailable, the tasks are submitted to compute the missing partitions in parallel. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 37 / 82

Memory Management (1/2) If there is not enough space in memory for a new computed RDD partition: a partition from the least recently used RDD is evicted. Spark provides three options for storage of persistent RDDs: 1 In memory storage as deserialized Java objects. 2 In memory storage as serialized Java objects. 3 On disk storage. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 38 / 82

Memory Management (2/2) When an RDD is persisted, each node stores any partitions of the RDD that it computes in memory. This allows future actions to be much faster. Persisting an RDD using persist() or cache() methods. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 39 / 82

Structured Data Processing Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 40 / 82

Motivation Users often prefer writing declarative queries. Lack of schema. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 41 / 82

Hive A system for managing and querying structured data built on top of MapReduce. Converts a query to a series of MapReduce phases. Initially developed by Facebook. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 42 / 82

Hive Data Model Re-used from RDBMS: Database: Set of Tables. Table: Set of Rows that have the same schema (same columns). Row: A single record; a set of columns. Column: provides value and type for a single value. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 43 / 82

Hive API (1/2) HiveQL: SQL-like query languages Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 44 / 82

Hive API (1/2) HiveQL: SQL-like query languages DDL operations (Data Definition Language) Create, Alter, Drop Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 44 / 82

Hive API (1/2) HiveQL: SQL-like query languages DDL operations (Data Definition Language) Create, Alter, Drop DML operations (Data Manipulation Language) Load and Insert (overwrite) Does not support updating and deleting Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 44 / 82

Hive API (1/2) HiveQL: SQL-like query languages DDL operations (Data Definition Language) Create, Alter, Drop DML operations (Data Manipulation Language) Load and Insert (overwrite) Does not support updating and deleting Query operations Select, Filter, Join, Groupby Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 44 / 82

Hive API (2/2) -- DDL: creating a table with three columns CREATE TABLE customer (id INT, name STRING, address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t ; -- DML: loading data from a flat file LOAD DATA LOCAL INPATH data.txt OVERWRITE INTO TABLE customer; -- Query: joining two tables SELECT * FROM customer c JOIN order o ON (c.id = o.cus_id); Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 45 / 82

Executing SQL Questions Processes HiveQL statements and generates the execution plan through three-phase processes. 1 Query parsing: transforms a query string to a parse tree representation. 2 Logical plan generation: converts the internal query representation to a logical plan, and optimizes it. 3 Physical plan generation: split the optimized logical plan into multiple map/reduce and HDFS tasks. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 46 / 82

Hive Components (1/8) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 47 / 82

Hive Components (2/8) External interfaces User interfaces, e.g., CLI and web UI Application programming interfaces, e.g., JDBC and ODBC Thrift, a framework for cross-language services. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 48 / 82

Hive Components (3/8) Driver Manages the life cycle of a HiveQL statement during compilation, optimization and execution. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 49 / 82

Hive Components (4/8) Compiler (Parser/Query Optimizer) Translates the HiveQL statement into a a logical plan, and optimizes it. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 50 / 82

Hive Components (5/8) Physical plan Transforms the logical plan into a DAG of Map/Reduce jobs. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 51 / 82

Hive Components (6/8) Execution engine The driver submits the individual mapreduce jobs from the DAG to the execution engine in a topological order. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 52 / 82

Hive Components (7/8) SerDe Serializer/Deserializer allows Hive to read and write table rows in any custom format. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 53 / 82

Hive Components (8/8) Metastore The system catalog. Contains metadata about the tables. Metadata is specified during table creation and reused every time the table is referenced in HiveQL. Metadatas are stored on either a traditional relational database, e.g., MySQL, or file system and not HDFS. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 54 / 82

Spark SQL Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 55 / 82

Shark Shark modified the Hive backend to run over Spark. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 56 / 82

In-Memory Column Store Simply caching Hive records as JVM objects is inefficient. 12 to 16 bytes of overhead per object in JVM implementation: e.g., storing a 270MB table as JVM objects uses approximately 971 MB of memory. Shark employs column-oriented storage using arrays of primitive objects. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 57 / 82

Shark Limitations Limited integration with Spark programs. Hive optimizer not designed for Spark. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 58 / 82

From Shark to Spark SQL Borrows from Shark Hive data loading In-memory column store Adds by Spark RDD-aware optimizer (catalyst optimizer) Adds schema to RDD (DataFrame) Rich language interfaces Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 59 / 82

Spark and Spark SQL Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 60 / 82

DataFrame A DataFrame is a distributed collection of rows with a homogeneous schema. it is equivalent to a table in a relational database. It can also be manipulated in similar ways to RDDs. DataFrames are lazy. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 61 / 82

Adding Schema to RDDs Spark + RDD: functional transformations on partitioned collections of opaque objects. SQL + DataFrame: declarative transformations on partitioned collections of tuples. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 62 / 82

Creating DataFrames The entry point into all functionality in Spark SQL is the SQLContext. With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources. val sc: SparkContext // An existing SparkContext. val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) val df = sqlcontext.read.json(...) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 63 / 82

DataFrame Operations (1/2) Domain-specific language for structured data manipulation. // Show the content of the DataFrame df.show() // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printschema() // root // -- age: long (nullable = true) // -- name: string (nullable = true) // Select only the "name" column df.select("name").show() // name // Michael // Andy // Justin Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 64 / 82

DataFrame Operations (2/2) Domain-specific language for structured data manipulation. // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") > 21).show() // age name // 30 Andy // Count people by age df.groupby("age").count().show() // age count // null 1 // 19 1 // 30 1 Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 65 / 82

Running SQL Queries Programmatically Running SQL queries programmatically and returns the result as a DataFrame. Using the sql function on a SQLContext. val sqlcontext =... // An existing SQLContext val df = sqlcontext.sql("select * FROM table") Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 66 / 82

Converting RDDs into DataFrames // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile(...).map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)).todf() people.registertemptable("people") Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 67 / 82

Converting RDDs into DataFrames // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile(...).map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)).todf() people.registertemptable("people") // SQL statements can be run by using the sql methods provided by sqlcontext. val teenagers = sqlcontext.sql("select name, age FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are DataFrames. teenagers.map(t => "Name: " + t(0)).collect().foreach(println) teenagers.map(t => "Name: " + t.getas[string]("name")).collect().foreach(println) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 67 / 82

Catalyst Optimizer Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 68 / 82

Using Catalyst in Spark SQL The Catalyst is used in four phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 69 / 82

Query Planning in Spark SQL - Analysis A relation may contain unresolved attribute references or relations Example: SQL query SELECT col FROM sales The col is unresolved until we look up the table sales. Spark SQL uses Catalyst rules and a Catalog object that tracks the tables in all data sources to resolve these attributes. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 70 / 82

Query Planning in Spark SQL - Logical Optimization (1/5) Applies standard rule-based optimizations to the logical plan. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 71 / 82

Query Planning in Spark SQL - Logical Optimization (1/5) Applies standard rule-based optimizations to the logical plan. val users = sqlcontext.read.parquet("...") val events = sqlcontext.read.parquet("...") val joined = events.join(users,...) val result = joined.select(...) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 71 / 82

Query Planning in Spark SQL - Logical Optimization (2/5) Null propagation and constant folding Replace expressions that can be evaluated with some literal value to the value. 1 + null null 1 + 2 3 Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 72 / 82

Query Planning in Spark SQL - Logical Optimization (2/5) Null propagation and constant folding Replace expressions that can be evaluated with some literal value to the value. 1 + null null 1 + 2 3 Boolean simplification Simplifies boolean expressions that can be determined. false AND x false true AND x x true OR x true false OR x x Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 72 / 82

Query Planning in Spark SQL - Logical Optimization (3/5) Simplify filters Removes filters that can be evaluated trivially. Filter(true, child) child Filter(false, child) empty Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 73 / 82

Query Planning in Spark SQL - Logical Optimization (3/5) Simplify filters Removes filters that can be evaluated trivially. Filter(true, child) child Filter(false, child) empty Combine filters Merges two filters. Filter($fc, Filter($nc, child)) Filter(AND($fc, $nc), child) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 73 / 82

Query Planning in Spark SQL - Logical Optimization (4/5) Push predicate through project Pushes filter operators through project operator. Filter(i == 1, Project(i, j, child)) Project(i, j, Filter(i == 1, child)) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 74 / 82

Query Planning in Spark SQL - Logical Optimization (4/5) Push predicate through project Pushes filter operators through project operator. Filter(i == 1, Project(i, j, child)) Project(i, j, Filter(i == 1, child)) Push predicate through join Pushes filter operators through join operator. Filter("left.i".attr == 1, Join(left, right)) Join(Filter(i == 1, left), right) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 74 / 82

Query Planning in Spark SQL - Logical Optimization (5/5) Column pruning Eliminates the reading of unused columns. Join(left, right, LeftSemi, "left.id".attr == "right.id".attr) Join(left, Project(id, right), LeftSemi) Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 75 / 82

Query Planning in Spark SQL - Physical Optimization Generates one or more physical plans using physical operators that match the Spark execution engine. Selects a plan using a cost model: based on join algorithms. Broadcast join for small relations Performs rule-based physical optimizations Pipelining projections or filters into one map operation. Pushing operations from the logical plan into data sources that support predicate or projection pushdown. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 76 / 82

Project Tungsten Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 77 / 82

Project Tungsten Spark workloads are increasingly bottlenecked by CPU and memory use rather than IO and network communication. Goals of Project Tungsten: improve the memory and CPU efficiency of Spark backend execution and push performance closer to the limits of modern hardware. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 78 / 82

Project Tungsten Initiatives Perform manual memory management instead of relying on Java objects. Reduce memory footprint. Eliminate garbage collection overheads. Use java.unsafe and off heap memory. Code generation for expression evaluation. Reduce virtual function calls and interpretation overhead. Cache conscious sorting. Reduce bad memory access patterns. Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 79 / 82

Summary Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 80 / 82

Summary RDD: a distributed memory abstraction Two types of operations: transformations and actions Lineage graph DataFrame: structured processing Logical and physical plans Catalyst optmizer Tungsten project Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 81 / 82

Questions? Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 82 / 82