New Developments in Spark

Similar documents
Apache Spark 2.0. Matei

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Spark and distributed data processing

Unifying Big Data Workloads in Apache Spark

Weld: A Common Runtime for Data Analytics

Big Data Processing (and Friends)

CSE 444: Database Internals. Lecture 23 Spark

Big Data Infrastructures & Technologies

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

The Evolution of Big Data Platforms and Data Science

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

An Introduction to Apache Spark

A Tutorial on Apache Spark

Intro to Spark and Spark SQL. AMP Camp 2014 Michael Armbrust

Lambda Architecture with Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Backtesting with Spark

An Introduction to Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Spark: A Brief History.

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

DATA SCIENCE USING SPARK: AN INTRODUCTION

Distributed Computing with Spark and MapReduce

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Chapter 4: Apache Spark

SparkSQL 11/14/2018 1

Big data systems 12/8/17

Spark, Shark and Spark Streaming Introduction

Distributed Machine Learning" on Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Resilient Distributed Datasets

Evolution From Shark To Spark SQL:

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation


In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Higher level data processing in Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Distributed Computing with Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Analyzing Flight Data

Conquering Big Data with Apache Spark

Turning Relational Database Tables into Spark Data Sources

Dell In-Memory Appliance for Cloudera Enterprise

Cloud Computing & Visualization

The Future of Real-Time in Spark

Data at the Speed of your Users

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Processing of big data with Apache Spark

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Big Data Architect.

Weld: A Common Runtime for Data Analytics

Databricks, an Introduction

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to Apache Spark. Patrick Wendell - Databricks

April Copyright 2013 Cloudera Inc. All rights reserved.

MapR Enterprise Hadoop

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Webinar Series TMIP VISION

Programming Systems for Big Data

Data Platforms and Pattern Mining

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

The Stratosphere Platform for Big Data Analytics

Big Data Analytics using Apache Hadoop and Spark with Scala

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Specialist ICT Learning

Stream Processing on IoT Devices using Calvin Framework

Databases and Big Data Today. CS634 Class 22

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Batch Processing Basic architecture

Lecture 11 Hadoop & Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Accelerating Spark Workloads using GPUs

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Generating Fast Data Planes for Data-Intensive Systems

Pig A language for data processing in Hadoop

/ Cloud Computing. Recitation 13 April 17th 2018

An Introduction to Big Data Analysis using Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Hadoop Development Introduction

TPCX-BB (BigBench) Big Data Analytics Benchmark

Data Analytics and Machine Learning: From Node to Cluster

Data Science Bootcamp Curriculum. NYC Data Science Academy

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Data processing in Apache Spark

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Transcription:

New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others

What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level APIs > One of first widely used systems with a functional API > Libraries for SQL, ML, graph, SQL Streaming MLlib GraphX Spark

Project Growth June 2013 January 2016 Lines of code 70,000 450,000 Total contributors 80 1000 Monthly contributors 20 140 Largest cluster 400 nodes 8000 nodes

Project Growth June 2013 January 2016 Lines of code 70,000 450,000 Total contributors 80 1000 Monthly contributors 20 140 Largest cluster 400 nodes 8000 nodes Most active open source project in big data

This Talk Original Spark vision How did the vision hold up? New APIs: DataFrames + Spark SQL New capabilities under these APIs Ongoing research

Original Spark Vision 1) Unified engine for big data processing > Combines batch, interactive, streaming 2) Concise, language-integrated API > Functional programming in Scala/Java/Python

Motivation: Unification MapReduce Pregel Giraph Dremel Drill Impala Presto Storm S4... General batch processing Specialized systems for new workloads Hard to to compose manage, into tune, pipelines deploy

Motivation: Unification MapReduce Pregel Giraph Dremel Drill Impala Presto? Storm S4... General batch processing Specialized systems for new workloads Unified engine

Motivation: Concise API Much of data analysis is exploratory / interactive Answer: Resilient Distributed Datasets (RDDs) > Distributed collections with simple functional API lines = spark.textfile( hdfs://... ) points = lines.map(line => parsepoint(line)) points.filter(p => p.x > 100).count()

This Talk Original Spark vision How did the vision hold up? New APIs: DataFrames + Spark SQL New capabilities under these APIs Ongoing research

How Did the Vision Hold Up? Mostly well Users really appreciate unification Functional API causes some challenges, which we are now tackling

Libraries Built on Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core Largest integrated library for big data

Which Libraries do People Use? Spark SQL 69% Streaming 58% MLlib 54% GraphX 18% 80% of users use more than one component 60% use three or more

Which Languages do People Use? 2014 Languages Used 2015 Languages Used 84% 71% 58% 38% 38% 31% 18%

Main Challenge: Functional API Looks high-level, but hides many semantics of computation from the engine > Functions passed in are arbitrary blocks of code > Data stored is arbitrary Java/Python objects Users can mix APIs in suboptimal ways

Which API Call Causes Most Tickets? map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

What People Do pairs = data.map(word => (word, 1)) groups = pairs.groupbykey() groups.map((k, vs) => (k, vs.sum)) Materializes all groups as lists of integers ( the, [1, 1, 1, 1, 1, 1]) ( quick, [1, 1]) ( fox, [1, 1]) Then sums each list ( the, 6) ( quick, 2) ( fox, 2) Better code: pairs.reducebykey(_ + _)

Challenge: Data Representation Object graphs much larger than underlying data class User(name: String, friends: Array[Int]) User 0x 0x int[] 3 1 2 String 0 5 0x char[] 5 B o b b y

This Talk Original Spark vision How did the vision hold up? New APIs: DataFrames + Spark SQL New capabilities under these APIs Ongoing research

DataFrames and Spark SQL Efficient library for working with structured data > Two interfaces: SQL for data analysts + external apps, DataFrames for programmers Optimized computation and storage SIGMOD 2015

Spark SQL Architecture SQL Data Frames Logical Plan Optimizer Physical Plan Code Generator RDDs Data Source API Catalog

DataFrame API DataFrames hold rows with a known schema and offer relational operations on them through a DSL users = sql( select * from users ) ma_users = users[users.state == MA ] ma_users.count() Expression AST ma_users.groupby( name ).avg( age ) ma_users.map(lambda u: u.name.toupper())

API Details Based on data frame concept in R and Python > Spark is first system to make this API declarative Integrated with the rest of Spark > MLlib takes DataFrames as input/output > Easily convert RDDs DataFrames Google trends for data frame

What DataFrames Enable 1. Compact binary representation Columnar format outside Java heap 2. Optimization across operators (join reordering, predicate pushdown, etc) 3. Runtime code generation

Performance DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s)

Performance DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s)

DataFrames vs SQL Easier to compose into large programs: organize code into functions, classes, etc [DataFrames are] concise and declarative like SQL, but I can name intermediate values Spark 1.6 adds static typing over DataFrames (Datasets: tinyurl.com/spark-datasets)

This Talk Original Spark vision How did the vision hold up? New APIs: DataFrames + Spark SQL New capabilities under these APIs Ongoing research

New Capabilities under Spark SQL Uniform and efficient access to data sources Rich optimization across libraries

Data Sources Having a uniform API for structured data lets apps migrate across data sources > Hive, MySQL, Cassandra, JSON, API semantics allow query pushdown into sources (not possible with old RDD API) users[users.age > 20] select id from users Spark SQL

Examples JSON: select user.id, text from tweets JDBC: select age from users where lang = en { text : hi, user : { name : bob, id : 15 } } tweets.json Together: select id, age from users where lang= en select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = en Spark SQL {JSON}

Library Composition One of our goals was to unify processing types Problem: optimizing across libraries > Big data is expensive to copy & scan > Libraries are written in isolation Not a problem for small data Spark SQL gives more semantics to do this SQL Data Frames ML Graph Logical Plan

Example: ML Pipelines New API in MLlib that lets users express and optimize end-to-end workflows > Feature preparation, training, evaluation > Similar to scikit-learn, but declarative tokenizer = Tokenizer() tf = HashingTF(features=1000) lr = LogisticRegression(r=0.1) p = Pipeline(tokenizer, tf, lr) p.fit(df) CrossValidator.fit(p, df, args) Filters pushed into data source DataFrame tokenizer TF LR Fused into one pass over data model Repeated queries

This Talk Original Spark vision How did the vision hold up? New APIs: DataFrames + Spark SQL New capabilities under these APIs Ongoing research

The Problem Hardware has changed a lot since big data systems were first designed Storage Network CPU 2010 50+MB/s (HDD) 1Gbps ~3GHz

The Problem Hardware has changed a lot since big data systems were first designed Storage 2010 2015 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz

The Problem Hardware has changed a lot since big data systems were first designed Storage 2010 2015 50+MB/s (HDD) 500+MB/s (SSD) 10x Network 1Gbps 10Gbps 10x CPU ~3GHz ~3GHz! New bottleneck in Spark, Hadoop, etc

To Make Matters Worse In response to the slowdown of Moore s Law, hardware is becoming more diverse Have to optimize separately for each platform! App 1 App 2 App 3 CPU GPU FPGA

Observation Many common algorithms can be written with embarrassingly data-parallel operations > See how many run on MapReduce / Spark Focus on optimizing these as opposed to general programs (e.g. C++)

The Goal SQL machine learning graph algorithms intermediate language transformations CPUs GPUs...

Nested Vector Language (NVL) Functional-like parallel language > Captures SQL, machine learning, and graphs, but very easy to analyze Closed under composition (nested calls) and common transformations (e.g. loop fusion) > Unlike relational algebra, OpenCL, NESL

Example Transformations def query(products: vec[{dept:int, price:int}]): sum = 0 for p in products: if p.dept == 20: sum += p.price row-to-column def query(dept: vec[int], price: vec[int]): sum = 0 for i in 0..len(users): if dept[i] == 20: sum += price[i] vectorization for i in 0..len(products) by 4: sum += price[i..i+4] * (dept[i..i+4] == [20,20,20,20])

Results: TPC-H Q6 0.60 0.53 0.50 Runtime (sec) 0.40 0.30 0.20 0.10 0.14 0.08 0.11 0.03 0.00 Python Java C HyPer Database NVL

Effect of Transformations Runtime (sec) 0.25 0.20 0.15 0.10 0.05 0.23 Transformations usable on any NVL program 0.08 0.03 0.00 Row-Oriented Program After Row-To- Column After Vectorization

Library Composition API Disjoint libraries can take & return NVL objects to build up a combined program Example: optimize across Spark and NumPy data = sql( select features from users where age>20 ) scores = data.map(lambda vec: scorematrix * vec) mean = scores.mean()

Conclusion Large data volumes + changing hardware pose a formidable challenge for next-generation apps Spark shows a unified API for data apps is useful NVL targets a new range of optimizations and environments