Numerical Computing with Spark. Hossein Falaki

Similar documents
Data-intensive computing systems

Introduction to Apache Spark. Patrick Wendell - Databricks

CS Spark. Slides from Matei Zaharia and Databricks

CS535 Big Data Fall 2017 Colorado State University 9/19/2017 Sangmi Lee Pallickara Week 5- A.

/ Cloud Computing. Recitation 13 April 14 th 2015

Resilient Distributed Datasets

2

MapReduce, Hadoop and Spark. Bompotas Agorakis

Distributed Computing with Spark

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Evolution From Shark To Spark SQL:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

/ Cloud Computing. Recitation 13 April 12 th 2016

Database Systems CSE 414

Fault Tolerant Distributed Main Memory Systems

/ Cloud Computing. Recitation 13 April 17th 2018

MLBase, MLLib and GraphX

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Advanced Database Systems

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Lecture 22 : Distributed Systems for ML

Distributed Computing with Spark and MapReduce

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Massive Online Analysis - Storm,Spark

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Introduction to Data Mining

COSC 282 BIG DATA ANALYTICS FALL 2015 LECTURE 18 - DEC 2

15.1 Optimization, scaling, and gradient descent in Spark

Programming Systems for Big Data

Cloud, Big Data & Linear Algebra

A project report submitted to Indiana University

iihadoop: an asynchronous distributed framework for incremental iterative computations

Big data systems 12/8/17

Introduction to Spark

A project report submitted to Indiana University

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Lecture 4. Distributed sketching. Graph problems

15.1 Data flow vs. traditional network programming

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Distributed computing: index building and use

Using Apache Spark for generating ElasticSearch indices offline

BSP, Pregel and the need for Graph Processing

Processing of big data with Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Searching the Web for Information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

Lecture 15: The Details of Joins

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Systems research enables application development by defining and implementing abstractions:

Big Data Exercises. Fall 2016 Week 10 ETH Zurich. Search Engine Pipeline - Solution. #load dataset

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

CS 102. Big Data. Spring Big Data Platforms

Specialist ICT Learning

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark

Information Networks: PageRank

Spark and distributed data processing

Chapter 13: Query Processing

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Large Scale Graph Processing Pregel, GraphLab and GraphX

Chapter 12: Query Processing

CSE 190D Spring 2017 Final Exam Answers

CSE 444: Database Internals. Lecture 23 Spark

Why do we need graph processing?

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Interactive Graph Analytics with Spark

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Lecture 2: January 24

Data at the Speed of your Users

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Mathematical Analysis of Google PageRank

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Impact of Search Engines on Page Popularity

Chapter 12: Query Processing. Chapter 12: Query Processing

Spark.jl: Resilient Distributed Datasets in Julia

Dremel: Interactive Analysis of Web-Scale Datasets

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Top 10 Essbase Optimization Tips that Give You 99+% Improvements

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Spark Overview. Professor Sasu Tarkoma.

An Introduction to Big Data Analysis using Spark

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Databases and Big Data Today. CS634 Class 22

Transcription:

Numerical Computing with Spark Hossein Falaki

Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2

Three Practical Examples Point estimation (Variance) Approximate estimation (Cardinality) Matrix operations (PageRank) We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data. 3

1. Big Data Variance The plain variance formula requires two passes over data Second pass Var(X) = 1 N N i=1 (x i µ) 2 4 First pass

Fast but inaccurate solution Var(X) = E[X 2 ] E[X] 2 = x 2 N x N 2 Can be performed in a single pass, but Subtracts two very close and large numbers! 5

Accumulator Pattern An object that incrementally tracks the variance Class RunningVar { var variance: Double = 0.0! // Compute initial variance for numbers def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }! // Update variance for a single value def add(value: Double) {... } } 6

Parallelize for performance Distribute adding values in map phase Merge partial results in reduce phase Class RunningVar {... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = {... } } 7

Computing Variance in Spark Use the RunningVar in Spark doublerdd.mappartitions(v => Iterator(new RunningVar(v))).reduce((a, b) => a.merge(b)) Or simply use the Spark API doublerdd.variance() 8

2. Approximate Estimations Often an approximate estimate is good enough especially if it can be computed faster or cheaper 1. Trade accuracy with memory 2. Trade accuracy with running time We really like the cases where there is a bound on error that can be controlled 9

Cardinality Problem Example: Count number of unique words in Shakespeare s work. Using a HashSet requires ~10GB of memory This can be much worse in many real world applications involving large strings, such as counting web visitors 10

count mln v m Linear Probabilistic Counting 1. Allocate a bitmap of size m and initialize to zero. A. Hash each value to a position in the bitmap B. Set corresponding bit to 1 2. Count number of empty bit entries: v 11

The Spark API Use the LogLinearCounter in Spark rdd.mappartitions(v => Iterator(new LPCounter(v))).reduce((a, b) => a.merge(b)).getcardinality Or simply use the Spark API myrdd.countapproxdistinct(0.01) 12

3. Google PageRank Popular algorithm originally introduced by Google 13

PageRank Algorithm PageRank Algorithm Start each page with a rank of 1 On each iteration: A. contrib = currank neighbors B. currank = 0.15 + 0.85 neighbors contrib i 14

PageRank Example 1.0 1.0 1.0 1.0 15

PageRank Example 1.0 1.0 0.5 1.0 1.0 1.0 0.5 0.5 1.0 16

PageRank Example 1.85 0.58 1.0 0.58 17

PageRank Example 0.58 1.85 0.5 1.85 0.58 1.0 0.29 0.5 0.58 18

PageRank Example 1.31 0.39 1.72 0.58 19

PageRank Example 1.44 0.46 Eventually 1.37 0.73 20

PageRank as Matrix Multiplication Rank of each page is the probability of landing on that page for a random surfer on the web Probability of visiting all pages after k steps is V k = A k V t V: the initial rank vector A: the link structure (sparse matrix) 21

Data Representation in Spark Each page is identified by its unique URL rather than an index Ranks vectors (V): RDD[(URL, Double)] Links matrix (A): RDD[(URL, List(URL))] 22

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs! for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...) 23

Matrix Multiplication Repeatedly multiply sparse matrix and vector Same file read! over and over Links (url, neighbors) Ranks (url, rank) iteration 1 iteration 2 iteration 3 24

Spark can do much better Using cache(), keep neighbors in memory Do not write intermediate results on disk Grouping same RDD! over and over Links (url, neighbors) Ranks (url, rank) join 25 join join

Spark can do much better Do not partition neighbors every time partitionby Links (url, neighbors) Same node Ranks (url, rank) join 26 join join

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs! links.partitionby(hashfunction).cache()! for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...) 27

Conclusions When applying any algorithm to big data watch for 1. Correctness 2. Performance Cache RDDs to avoid I/O Avoid unnecessary computation 3. Trade-off between accuracy and performance 28

Numerical Computing with Spark