An Introduction to Apache Spark

Size: px

Start display at page:

Download "An Introduction to Apache Spark"

Preston Long
6 years ago
Views:

1 An Introduction to Apache Spark 1

2 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as: Databricks, Yahoo!, Intel, Cloudera, IBM, 2

3 What is Spark? Fast and general cluster computing system interoperable with Hadoop datasets. 3

4 What are Spark improvements? Improves efficiency through: In-memory computing primitives. General computation graphs. Improves usability through Rich APIs in Scala, Java, Python Interactive shell (Scala/Python) 4

5 MapReduce is a DAG in General 5

6 MapReduce MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner... 6

7 What improvements Spark made on running MapReduce? Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop. Note: Spark is a hadoop successor. 7

8 How Spark Made it? A Wise Data Sharing! 8

9 Data Sharing in Hadoop MapReduce 9

10 Data Sharing in Spark 10

11 Data Sharing in Spark x Faster than network and disk! 11

12 Spark Programming Model At a high level, every Spark application consists of a driver program that runs the user s main function. Promotes you to write programs in term of making transformations on distributed datasets. 12

13 Spark Programming Model The main abstraction Spark provides is a resilient distributed dataset (RDD). Collection of elements partitioned across the cluster (Memory of Disk) Can be accessed and operated in parallel (map, filter,...) Automatically rebuilt on failure 13

14 Spark Programming Model RDDs Operations Transformations: Create a new dataset from an existing one. Example: map() Actions: Return a value to the driver program after running a computation on the dataset. Example: reduce() 14

15 Spark Programming Model 15

16 Spark Programming Model Another abstraction is Shared Variables Broadcast Variables, which can be used to cache a value in memory on all nodes. Accumulator 16

17 Spark Programming Model 17

18 Spark Programming Model 18

19 Spark Programming Model 19

20 Ease of Use Spark offers over 80 high-level operators that make it easy to build parallel apps. Scala and Python shells to use it interactively. 20

21 A General Stack 21

22 Apache Spark Core 22

23 Apache Spark Core Spark Core is the general engine for the Spark platform. In-memory computing capabilities deliver speed General execution model supports wide delivery of use cases Ease of development native APIs in Java, Scala, Python (+ SQL, Clojure, R) 23

24 Spark SQL 24

25 Spark SQL 25

26 Spark SQL 26

27 Spark SQL 27

28 Spark SQL 28

29 Spark Streaming 29

30 Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. 30

31 Spark Streaming 31

32 Spark Streaming 32

33 Spark Streaming 33

34 Spark Streaming 34

35 Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data 35

36 Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data val hashtags = tweets.flatmap (status => gettags(status)) new DStream transformation: modify data in one DStream to create another DStream 36

37 Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) val hashtags = tweets.flatmap (status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval 37

38 Spark Streaming val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() 38

39 MLLib 39

40 MLLib MLLib is Spark's scaleable machine learning engine. MLLib works on any hadoop datasource such as HDFS, HBase and local files. 40

41 MLLib Algorithms: linear SVM and logistic regression classification and regression tree k-means clustering recommendation via alternating least squares singular value decomposition linear regression with L1- and L2-regularization multinomial naive Bayes basic statistics feature transformations 41

42 GraphX 42

43 GraphX GraphX is Spark's API for graphs and graphparallel computation. Works with both graphs and collections. 43

44 GraphX Comparable performance to the fastest specialized graph processing systems 44

45 GraphX Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count 45

46 Spark Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3. 46

47 Resources Intro to Apache Spark by Paco Nathan Building a Unified Data Pipeline in Spark by Aaron Davidson. 47

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References