Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Similar documents
Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Fast, Interactive, Language-Integrated Cluster Computing

Resilient Distributed Datasets

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

CS Spark. Slides from Matei Zaharia and Databricks

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Spark: A Brief History.

CSE 444: Database Internals. Lecture 23 Spark

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

a Spark in the cloud iterative and interactive cluster computing

Data-intensive computing systems

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark Overview. Professor Sasu Tarkoma.

Big Data Infrastructures & Technologies

Programming Systems for Big Data

Chapter 4: Apache Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Processing of big data with Apache Spark

CompSci 516: Database Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Data processing in Apache Spark

Turning Relational Database Tables into Spark Data Sources

Distributed Computing with Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Introduction to Apache Spark. Patrick Wendell - Databricks

An Introduction to Apache Spark

CS November 2018

Unifying Big Data Workloads in Apache Spark

Dell In-Memory Appliance for Cloudera Enterprise

Data processing in Apache Spark

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Data processing in Apache Spark

Massive Online Analysis - Storm,Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Spark, Shark and Spark Streaming Introduction

Apache Spark 2.0. Matei

Evolution From Shark To Spark SQL:

2/4/2019 Week 3- A Sangmi Lee Pallickara

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to Spark

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

New Developments in Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CS November 2017

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Spark Streaming: Hands-on Session A.A. 2017/18

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

A Tutorial on Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

An Introduction to Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Distributed Computing with Spark and MapReduce

Spark and distributed data processing

Webinar Series TMIP VISION

Cloud Computing & Visualization

15.1 Data flow vs. traditional network programming

Big data systems 12/8/17

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

A very short introduction

Introduction to Apache Spark

Fault Tolerant Distributed Main Memory Systems

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Apurva Nandan Tommi Jalkanen

Lecture 11 Hadoop & Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Intro to Apache Spark

Specialist ICT Learning

Backtesting with Spark

Apache Spark Internals

An Overview of Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

An Introduction to Big Data Analysis using Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Chase Wu New Jersey Institute of Technology

Big Data Analytics with Apache Spark. Nastaran Fatemi

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Apache Spark and Scala Certification Training

Analyzing Flight Data

Integration of Machine Learning Library in Apache Apex

Introduction to Apache Spark

Transcription:

Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics 4. What does unification bring? 5. Structured programming with DataFrames 6. Machine Learning with MLlib

2009: State-of-the-art in Big Data Apache Hadoop Open Source: HDFS, Hbase, MapReduce, Hive Large scale, flexible data processing engine Batch computation (e.g., 10s minutes to hours) New requirements emerging Iterative computations, e.g., machine learning Interactive computations, e.g., ad-hoc analytics 4

Prior MapReduce Model MapReduce model for clusters transforms data flowing from stable storage to stable storage, e.g., Hadoop: Map Reduce Input Map Map Reduce Output

How to Support ML Apps Better? Iterative and interactive applications Spark core (RDD API)

How to Support Iterative Apps Iterative and interactive applications share data between stages via memory Spark core (RDD API)

How to Support Interactive Apps Iterative and interactive applications Cache data in memory Spark core (RDD API)

Motivation for Spark Acyclic data flow (MapReduce) is a powerful abstraction, but not efficient for apps that repeatedly reuse a working set of data: iterative algorithms (many in machine learning) interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps

2. Technical Summary of Spark

What is Apache Spark? 1) Parallel execution engine for big data Implements BSP (Bulk Synchronous Processing) model 2) Data abstraction: Resilient Distributed Datasets (RDDs) Sets of objects partitioned & distributed across a cluster Stored in RAM or on Disk 3) Automatic recovery based on lineage of bulk transformations

1) Bulk Synchronous Processing (BSP) Model In this classic model for designing parallel algorithms, computation proceeds in a series of supersteps: Concurrent computation: parallel processes perform local computation Communication: processes exchange data Barrier synchronization: when a process reaches the barrier, it waits until all other processes have reached the same barrier

Spark, as a BSP System RDD tasks (processors) Shuffle RDD tasks (processors) stage (super-step) stage (super-step)

Spark, as a BSP System tasks tasks RDD (processors) Shuffle RDD All tasks in same stage (processors) run same operation, single-threaded, deterministic execution Immutable dataset Barrier implicit by data dependency such as group data by key stage (super-step) stage (super-step)

2) Programming Model Resilient distributed datasets (RDDs) Immutable collections partitioned across cluster that can be rebuilt if a partition is lost Partitioning can be based on a key in each record (using hash or range partitioning) Created by transforming data in stable storage using data flow operators (map, filter, group-by, ) Can be cached across parallel operations Restricted shared variables Accumulators, broadcast variables

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count...

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base Cache 1 Transformed RDD RDD lines = spark.textfile( hdfs://... ) Worker results errors = lines.filter(_.startswith( ERROR )) tasks Block 1 messages = errors.map(_.split( \t )(2)) Driver Cached RDD cachedmsgs = messages.cache() Parallel operation cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count Cache 2 Worker... Cache 3 Worker Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

RDD Operations map filter sample union groupbykey reducebykey join cache Transformations (define a new RDD) reduce collect count save lookupkey Actions (return a result to driver) http://spark.apache.org/docs/latest/programming-guide.html

Transformations (define a new RDD) map(func): Return a new distributed dataset formed by passing each element of the source through a function func. filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true. flatmap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). mappartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. sample: Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed. union(otherdataset): Return a new dataset that contains the union of the elements in the source dataset and the argument. intersection(otherdataset): Return a new RDD that is the intersection of elements in the source dataset and the argument. distinct: Return a new dataset that contains the distinct elements of the source dataset. groupbykey: When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: To perform an aggregation (such as a sum or average) over each key, using reducebykey or aggregatebykey will yield much better performance. reducebykey(func): When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. sort([ascending]): When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order. join(otherdataset): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. cogroup(otherdataset): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples.

Actions (return a result to driver) count(): Return the number of elements in the dataset. collect(): Return all the elements of the dataset as an array at the driver program. reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. take(n): Return an array with the first n elements of the dataset. takesample(n): Return an array with a random sample of num elements of the dataset saveastextfile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Call tostring on each element to convert it to a line of text in the file. saveasobjectfile(path): Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile(). lookupkey

3) RDD Fault Tolerance RDDs maintain lineage (like logical logging in Aries) that can be used to reconstruct lost partitions Ex: cachedmsgs = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)).cache() HdfsRDD path: hdfs:// FilteredRDD func: contains(...) MappedRDD func: split( ) CachedRDD

Benefits of RDD Model Consistency is easy due to immutability (no updates) Fault tolerance is inexpensive (log lineage rather than replicating/checkpointing data) Locality-aware scheduling of tasks on partitions Despite being restricted, model seems applicable to a broad variety of applications

3. Apache Spark s Path to Unification

Apache Spark s Path to Unification Unified engine across data workloads and data sources Streaming SQL ML Graph Batch

DataFrame API DataFrame logically equivalent to a relational table Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew Popularized by R and Python/pandas, languages of choice for Data Scientists

pdata.map(lambda x: (x.dept, [x.age, 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() data.groupby( dept ).avg( age )

DataFrames, a Unifying Abstraction Make DataFrame declarative, unify DataFrame and SQL DataFrame and SQL share same query optimizer, and execution engine Tightly integrated with rest of Spark ML library takes DataFrames as input & output Easily convert RDDs DataFrames Python DF Java/Scala DF Logical Plan Every optimizations automatically applies to Execution SQL, and Scala, Python and R DataFrames R DF

Today s Apache Spark Iterative, interactive, batch, and streaming applications SQL SparkSQL Structured Streams MLlib Graph Frames SparkR Spark core (DataFrames API, Catalyst, RDD API)