An Introduction to Apache Spark 1
History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as: Databricks, Yahoo!, Intel, Cloudera, IBM, 2
What is Spark? Fast and general cluster computing system interoperable with Hadoop datasets. 3
What are Spark improvements? Improves efficiency through: In-memory computing primitives. General computation graphs. Improves usability through Rich APIs in Scala, Java, Python Interactive shell (Scala/Python) 4
MapReduce is a DAG in General 5
MapReduce MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner... 6
What improvements Spark made on running MapReduce? Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop. Note: Spark is a hadoop successor. 7
How Spark Made it? A Wise Data Sharing! 8
Data Sharing in Hadoop MapReduce 9
Data Sharing in Spark 10
Data Sharing in Spark 10-100x Faster than network and disk! 11
Spark Programming Model At a high level, every Spark application consists of a driver program that runs the user s main function. Promotes you to write programs in term of making transformations on distributed datasets. 12
Spark Programming Model The main abstraction Spark provides is a resilient distributed dataset (RDD). Collection of elements partitioned across the cluster (Memory of Disk) Can be accessed and operated in parallel (map, filter,...) Automatically rebuilt on failure 13
Spark Programming Model RDDs Operations Transformations: Create a new dataset from an existing one. Example: map() Actions: Return a value to the driver program after running a computation on the dataset. Example: reduce() 14
Spark Programming Model 15
Spark Programming Model Another abstraction is Shared Variables Broadcast Variables, which can be used to cache a value in memory on all nodes. Accumulator 16
Spark Programming Model 17
Spark Programming Model 18
Spark Programming Model 19
Ease of Use Spark offers over 80 high-level operators that make it easy to build parallel apps. Scala and Python shells to use it interactively. 20
A General Stack 21
Apache Spark Core 22
Apache Spark Core Spark Core is the general engine for the Spark platform. In-memory computing capabilities deliver speed General execution model supports wide delivery of use cases Ease of development native APIs in Java, Scala, Python (+ SQL, Clojure, R) 23
Spark SQL 24
Spark SQL 25
Spark SQL 26
Spark SQL 27
Spark SQL 28
Spark Streaming 29
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. 30
Spark Streaming 31
Spark Streaming 32
Spark Streaming 33
Spark Streaming 34
Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data 35
Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data val hashtags = tweets.flatmap (status => gettags(status)) new DStream transformation: modify data in one DStream to create another DStream 36
Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) val hashtags = tweets.flatmap (status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval 37
Spark Streaming val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() 38
MLLib 39
MLLib MLLib is Spark's scaleable machine learning engine. MLLib works on any hadoop datasource such as HDFS, HBase and local files. 40
MLLib Algorithms: linear SVM and logistic regression classification and regression tree k-means clustering recommendation via alternating least squares singular value decomposition linear regression with L1- and L2-regularization multinomial naive Bayes basic statistics feature transformations 41
GraphX 42
GraphX GraphX is Spark's API for graphs and graphparallel computation. Works with both graphs and collections. 43
GraphX Comparable performance to the fastest specialized graph processing systems 44
GraphX Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count 45
Spark Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3. 46
Resources http://spark.apache.org Intro to Apache Spark by Paco Nathan Building a Unified Data Pipeline in Spark by Aaron Davidson. 47