CS Spark. Slides from Matei Zaharia and Databricks

Similar documents
Introduction to Apache Spark. Patrick Wendell - Databricks

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Fast, Interactive, Language-Integrated Cluster Computing

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Resilient Distributed Datasets

Programming Systems for Big Data

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

a Spark in the cloud iterative and interactive cluster computing

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Data-intensive computing systems

Distributed Computing with Spark

CSE 444: Database Internals. Lecture 23 Spark

Chase Wu New Jersey Institute of Technology

Big Data Infrastructures & Technologies

MapReduce, Hadoop and Spark. Bompotas Agorakis

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Fault Tolerant Distributed Main Memory Systems

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

An Introduction to Apache Spark

Giovanni Simonini. Slides partially taken from the Spark Summit, and Amp Camp:

Spark: A Brief History.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Apache Spark Introduction

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Parallel Processing Spark and Spark SQL

An Introduction to Big Data Analysis using Spark

Evolution From Shark To Spark SQL:

Chapter 4: Apache Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CompSci 516: Database Systems

Distributed Computing with Spark and MapReduce

Numerical Computing with Spark. Hossein Falaki

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Database Systems CSE 414

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Spark, Shark and Spark Streaming Introduction

Data processing in Apache Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Big data systems 12/8/17

Spark Overview. Professor Sasu Tarkoma.

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Dell In-Memory Appliance for Cloudera Enterprise

Data processing in Apache Spark

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

An Introduction to Apache Spark

An Introduction to Apache Spark

Data processing in Apache Spark

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Processing of big data with Apache Spark

Batch Processing Basic architecture

DATA SCIENCE USING SPARK: AN INTRODUCTION

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Turning Relational Database Tables into Spark Data Sources

Massive Online Analysis - Storm,Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Apache Spark

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

CS246: Mining Massive Datasets Crash Course in Spark

/ Cloud Computing. Recitation 13 April 14 th 2015

CS535 Big Data Fall 2017 Colorado State University 9/19/2017 Sangmi Lee Pallickara Week 5- A.

15.1 Data flow vs. traditional network programming

Introduction to Spark

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Lecture 11 Hadoop & Spark

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Apache Spark Internals

2/4/2019 Week 3- A Sangmi Lee Pallickara

/ Cloud Computing. Recitation 13 April 17th 2018

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Distributed Machine Learning" on Spark

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Introduction to Hadoop

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Lecture11 Introduction to BigData Analysis Stack

Specialist ICT Learning

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

CS60021: Scalable Data Mining. Spark. Sourangshu Bha<acharya

Data Engineering. How MapReduce Works. Shivnath Babu

An Overview of Apache Spark

In-Memory Processing with Apache Spark. Vincent Leroy

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Spark and distributed data processing

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Outline. CS-562 Introduction to data analysis using Apache Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Transcription:

CS 5450 Spark Slides from Matei Zaharia and Databricks

Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive data mining uenhance programmability Integrate into Scala programming language Allow interactive use from Scala interpreter Also support for Java, Python slide 2

Cluster Programming Models Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Map Output Benefits of data flow: runtime can Reduce decide Map where to run tasks and can automatically recover from failures slide 3

Acyclic Data Flow Inefficient...... for applications that repeatedly reuse a working set of data Iterative algorithms (machine learning, graphs) Interactive data mining (R, Excel, Python) because apps have to reload data from stable storage on every query slide 4

Resilient Distributed Datasets uresilient distributed datasets (RDDs) Immutable, partitioned collections of objects spread across a cluster, stored in RAM or on disk Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage uallow apps to cache working sets in memory for efficient reuse uretain the attractive properties of MapReduce Fault tolerance, data locality, scalability uactions on RDDs support many applications Count, reduce, collect, save slide 5

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Driver cachedmsgs = messages.cache() Cache 1 Worker Block 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count Action Cache 2 Worker Full-text search of Wikipedia 60GB on 20 EC2 machine 0.5 sec vs. 20s for on-disk Cache 3 Worker Block 3 Block 2 slide 6

Spark Operations Transformations define a new RDD map filter sample groupbykey reducebykey sortbykey flatmap union join cogroup cross mapvalues Actions return a result to driver program collect reduce count save lookupkey slide 7

Creating RDDs # Turn a Python collection into an RDD >sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 >sc.textfile( file.txt ) >sc.textfile( directory/*.txt ) >sc.textfile( hdfs://namenode:9000/path/file ) # Use existing Hadoop InputFormat (Java/Scala only) >sc.hadoopfile(keyclass, valclass, inputfmt, conf) slide 8

Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatmap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1,, x-1) slide 9

Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveastextfile( hdfs://file.txt ) slide 10

Working with Key-Value Pairs Spark s distributed reduce transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b slide 11

Some Key-Value Operations > pets = sc.parallelize( [( cat, 1), ( dog, 1), ( cat, 2)]) > pets.reducebykey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupbykey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortbykey() # => {(cat, 1), (cat, 2), (dog, 1)} reducebykey also automatically implements combiners on the map side slide 12

Example: Word Count > lines = sc.textfile( hamlet.txt ) > counts = lines.flatmap(lambda line: line.split( )).map(lambda word => (word, 1)).reduceByKey(lambda x, y: x + y) to be or to be or (to, 1) (be, 1) (or, 1) (be, 2) (not, 1) not to be not to be (not, 1) (to, 1) (be, 1) (or, 1) (to, 2) slide 13

Other Key-Value Operations > visits = sc.parallelize([ ( index.html, 1.2.3.4 ), ( about.html, 3.4.5.6 ), ( index.html, 1.3.3.1 ) ]) > pagenames = sc.parallelize([ ( index.html, Home ), ( about.html, About ) ]) > visits.join(pagenames) # ( index.html, ( 1.2.3.4, Home )) # ( index.html, ( 1.3.3.1, Home )) # ( about.html, ( 3.4.5.6, About )) > visits.cogroup(pagenames) # ( index.html, ([ 1.2.3.4, 1.3.3.1 ], [ Home ])) # ( about.html, ([ 3.4.5.6 ], [ About ])) slide 14

Example: Logistic Regression Goal: find best line separating two sets of points random initial line target slide 15

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("final w: " + w) slide 16

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s slide 17

Setting the Level of Parallelism All pair RDD operations take an optional second parameter for the number of tasks > words.reducebykey(lambda x, y: x + y, 5) > words.groupbykey(5) > visits.join(pageviews, 5) slide 18

Using Local Variables Any external variables used in a closure are automatically be shipped to the cluster > query = sys.stdin.readline() > pages.filter(lambda x: query in x).count() Some caveats: Each task gets a new copy (updates aren t sent back) Variable must be serializable / pickle-able Don t use fields of an outer object (ships all of it!) slide 19

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions messages = textfile(...).filter(_.startswith( ERROR )).map(_.split( \t )(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) slide 20

Spark Applications uin-memory data mining on Hive data (Conviva) upredictive analytics (Quantifind) ucity traffic prediction (Mobile Millennium) utwitter spam classification (Monarch)... many others slide 21

Conviva GeoReport Time (hours) u Aggregations on many keys w/ same WHERE clause u 40 gain comes from: Not re-reading unused columns or filtered records Avoiding repeated decompression In-memory storage of de-serialized objects slide 22

Frameworks Built on Spark upregel on Spark (Bagel) Google message passing model for graph computation 200 lines of code uhive on Spark (Shark) 3000 lines of code Compatible with Apache Hive ML operators in Scala slide 23

Implementation Runs on Apache Mesos to share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS) Spark Hadoop MPI Mesos Node Node Node Node slide 24

Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles A: B: Stage 1 groupby C: D: F: map E: join G: Stage 2 union Stage 3 = cached data partition slide 25

Example: PageRank ubasic idea: gives pages ranks (scores) based on links to them Links from many pages è high rank Link from a high-rank page è high rank ugood example of a more complex algorithm Multiple stages of map & reduce ubenefits from Spark s in-memory caching Multiple iterations over the same data slide 26

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.0 1 0.5 1 1.0 0.5 1.0 0.5 0.5 1.0 slide 27

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.85 0.58 0.5 1.85 0.58 1.0 0.29 0.5 0.29 0.58 slide 28

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.31 0.39 1.72 0.58 slide 29

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs Final state 1.44 0.46 1.37 0.73 slide 30

Spark Implementation (in Scala) val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...) slide 31

PageRank Performance 200 171 Iteration time (s) 150 100 50 0 23 80 14 Hadoop Spark 30 60 Number of machines slide 32