An Introduction to Apache Spark

An Introduction to Apache Spark 1

History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as: Databricks, Yahoo!, Intel, Cloudera, IBM, 2

What is Spark? Fast and general cluster computing system interoperable with Hadoop datasets. 3

What are Spark improvements? Improves efficiency through: In-memory computing primitives. General computation graphs. Improves usability through Rich APIs in Scala, Java, Python Interactive shell (Scala/Python) 4

MapReduce is a DAG in General 5

MapReduce MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner... 6

What improvements Spark made on running MapReduce? Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop. Note: Spark is a hadoop successor. 7

How Spark Made it? A Wise Data Sharing! 8

Data Sharing in Hadoop MapReduce 9

Data Sharing in Spark 10

Data Sharing in Spark 10-100x Faster than network and disk! 11

Spark Programming Model At a high level, every Spark application consists of a driver program that runs the user s main function. Promotes you to write programs in term of making transformations on distributed datasets. 12

Spark Programming Model The main abstraction Spark provides is a resilient distributed dataset (RDD). Collection of elements partitioned across the cluster (Memory of Disk) Can be accessed and operated in parallel (map, filter,...) Automatically rebuilt on failure 13

Spark Programming Model RDDs Operations Transformations: Create a new dataset from an existing one. Example: map() Actions: Return a value to the driver program after running a computation on the dataset. Example: reduce() 14

Spark Programming Model 15

Spark Programming Model Another abstraction is Shared Variables Broadcast Variables, which can be used to cache a value in memory on all nodes. Accumulator 16

Spark Programming Model 17

Spark Programming Model 18

Spark Programming Model 19

Ease of Use Spark offers over 80 high-level operators that make it easy to build parallel apps. Scala and Python shells to use it interactively. 20

A General Stack 21

Apache Spark Core 22

Apache Spark Core Spark Core is the general engine for the Spark platform. In-memory computing capabilities deliver speed General execution model supports wide delivery of use cases Ease of development native APIs in Java, Scala, Python (+ SQL, Clojure, R) 23

Spark SQL 24

Spark SQL 25

Spark SQL 26

Spark SQL 27

Spark SQL 28

Spark Streaming 29

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. 30

Spark Streaming 31

Spark Streaming 32

Spark Streaming 33

Spark Streaming 34

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data 35

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data val hashtags = tweets.flatmap (status => gettags(status)) new DStream transformation: modify data in one DStream to create another DStream 36

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) val hashtags = tweets.flatmap (status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval 37

Spark Streaming val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() 38

MLLib 39

MLLib MLLib is Spark's scaleable machine learning engine. MLLib works on any hadoop datasource such as HDFS, HBase and local files. 40

MLLib Algorithms: linear SVM and logistic regression classification and regression tree k-means clustering recommendation via alternating least squares singular value decomposition linear regression with L1- and L2-regularization multinomial naive Bayes basic statistics feature transformations 41

GraphX 42

GraphX GraphX is Spark's API for graphs and graphparallel computation. Works with both graphs and collections. 43

GraphX Comparable performance to the fastest specialized graph processing systems 44

GraphX Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count 45

Spark Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3. 46

Resources http://spark.apache.org Intro to Apache Spark by Paco Nathan Building a Unified Data Pipeline in Spark by Aaron Davidson. 47