An Introduction to Big Data Analysis using Spark

Size: px

Start display at page:

Download "An Introduction to Big Data Analysis using Spark"

Mervyn Chapman
5 years ago
Views:

1 An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17, / 43

2 Big Data 1 Big Data 2 Apache Spark 3 Distributed File System Mohamad Jaber (AUB) Spark May 17, / 43

3 Big Data Big Data We live in the data age Mohamad Jaber (AUB) Spark May 17, / 43

4 Big Data Big Data - Some Numbers ( ) Storage and Processing Mohamad Jaber (AUB) Spark May 17, / 43

5 Big Data Big Data - Some Numbers ( ) Storage and Processing Facebook hosts more than 240 billion photos, growing at 7 petabytes per month Mohamad Jaber (AUB) Spark May 17, / 43

6 Big Data Big Data - Some Numbers ( ) Storage and Processing Facebook hosts more than 240 billion photos, growing at 7 petabytes per month New York Stock Exchange generates about 4 5 terabytes of data per day Mohamad Jaber (AUB) Spark May 17, / 43

7 Big Data Big Data - Some Numbers ( ) Storage and Processing Facebook hosts more than 240 billion photos, growing at 7 petabytes per month New York Stock Exchange generates about 4 5 terabytes of data per day Google processes 20 petabytes of information per day... Mohamad Jaber (AUB) Spark May 17, / 43

8 Big Data Big Data - Some Numbers ( ) Storage and Processing Facebook hosts more than 240 billion photos, growing at 7 petabytes per month New York Stock Exchange generates about 4 5 terabytes of data per day Google processes 20 petabytes of information per day... Estimation The size of digital universe is 4.4 zettabytes (10 21 ) in 2013 Mohamad Jaber (AUB) Spark May 17, / 43

9 Big Data Big Data - Some Numbers ( ) Storage and Processing Facebook hosts more than 240 billion photos, growing at 7 petabytes per month New York Stock Exchange generates about 4 5 terabytes of data per day Google processes 20 petabytes of information per day... Estimation The size of digital universe is 4.4 zettabytes (10 21 ) in : 44 zettabytes Mohamad Jaber (AUB) Spark May 17, / 43

10 Big Data Simple Java Program to Analyze Data p u b l i c s t a t i c l o n g analyze ( String filename, Analyzer analyzer ) throws IOException { // Read Input BufferedReader reader = new BufferedReader (new FileReader ( filename )); l o n g score = 0; String line = n u l l ; } // Processing w h i l e (( line = reader. readline ())!= n u l l ) { score += analyzer. analyze ( line ); } r e t u r n score ; Mohamad Jaber (AUB) Spark May 17, / 43

11 Big Data Simple Java Program to Analyze Data p u b l i c s t a t i c l o n g analyze ( String filename, Analyzer analyzer ) throws IOException { // Read Input BufferedReader reader = new BufferedReader (new FileReader ( filename )); l o n g score = 0; String line = n u l l ; } // Processing w h i l e (( line = reader. readline ())!= n u l l ) { score += analyzer. analyze ( line ); } r e t u r n score ; Throughput 1GB per hour 10GB data set 10 hours Mohamad Jaber (AUB) Spark May 17, / 43

12 Big Data How can we Improve the Performance? Mohamad Jaber (AUB) Spark May 17, / 43

13 Big Data How can we Improve the Performance? Faster CPU Scale up (vertically) More/Faster memory Scale up (vertically) Increase the number of cores Increase the number of threads Mohamad Jaber (AUB) Spark May 17, / 43

14 Big Data How can we Improve the Performance? Faster CPU Scale up (vertically) More/Faster memory Scale up (vertically) Increase the number of cores Increase the number of threads Increase the number of threads and cores Shared Memory (pthreads) Message Passing (MPI) Mohamad Jaber (AUB) Spark May 17, / 43

15 Big Data How can we Improve the Performance? Faster CPU Scale up (vertically) More/Faster memory Scale up (vertically) Increase the number of cores Increase the number of threads Increase the number of threads and cores Shared Memory (pthreads) Message Passing (MPI) Multi-threaded Throughput 10GB per hour Mohamad Jaber (AUB) Spark May 17, / 43

16 Big Data How can we Improve the Performance? Faster CPU Scale up (vertically) More/Faster memory Scale up (vertically) Increase the number of cores Increase the number of threads Increase the number of threads and cores Shared Memory (pthreads) Message Passing (MPI) Multi-threaded Throughput 10GB per hour 1PB? Fault? Mohamad Jaber (AUB) Spark May 17, / 43

17 What do we Need? Big Data Mohamad Jaber (AUB) Spark May 17, / 43

18 What do we Need? Big Data We need a framework that abstracts away / hides: Mohamad Jaber (AUB) Spark May 17, / 43

19 What do we Need? Big Data We need a framework that abstracts away / hides: Scale Out (horizontally) Parallelization Data distribution Fault-tolerance Load Balancing Mohamad Jaber (AUB) Spark May 17, / 43

20 Apache Spark 1 Big Data 2 Apache Spark 3 Distributed File System Mohamad Jaber (AUB) Spark May 17, / 43

21 Why Spark? Apache Spark Normally, data science and analytics is done in the small, in R/Python/MATLAB, etc. Mohamad Jaber (AUB) Spark May 17, / 43

22 Why Spark? Apache Spark Normally, data science and analytics is done in the small, in R/Python/MATLAB, etc. If your dataset ever gets too large to fit into memory, these languages/frameworks won t allow you to scale Mohamad Jaber (AUB) Spark May 17, / 43

23 Why Spark? Apache Spark Normally, data science and analytics is done in the small, in R/Python/MATLAB, etc. If your dataset ever gets too large to fit into memory, these languages/frameworks won t allow you to scale You have to re-implement everything in some other language or system Mohamad Jaber (AUB) Spark May 17, / 43

24 Why Spark? Apache Spark Normally, data science and analytics is done in the small, in R/Python/MATLAB, etc. If your dataset ever gets too large to fit into memory, these languages/frameworks won t allow you to scale You have to re-implement everything in some other language or system Moreover, there is a massive shift in industry to data-oriented decision making too! data science in the large Mohamad Jaber (AUB) Spark May 17, / 43

25 Why Spark? Apache Spark Normally, data science and analytics is done in the small, in R/Python/MATLAB, etc. If your dataset ever gets too large to fit into memory, these languages/frameworks won t allow you to scale You have to re-implement everything in some other language or system Moreover, there is a massive shift in industry to data-oriented decision making too! data science in the large According to the popular IT job portal, Dice.com, a keyword search for the term Spark Developer showed listings as of 16th December, Mohamad Jaber (AUB) Spark May 17, / 43

26 Why Spark? Apache Spark Spark is More expressive. APIs modeled after Scala collections. Look like functional lists! Richer, more composable operations possible than in MapReduce Efficient. Not only performant in terms of running time... But also in terms of developer productivity! Interactive! Good for data science. Not just because of performance, but because it enables iteration, which is required by most algorithms in a data scientist s toolbox (e.g., machine learning, graph analytics) Mohamad Jaber (AUB) Spark May 17, / 43

27 Scala Quick Tour Apache Spark Scala is a high-level language for the Java VM (object oriented + functional programming) supports interactive shell // declare variables v a r x: Int = 7 v a r x = 7 // type inferred v a l y = "hi" // read - only // Functions d e f square (x: Int ): Int = x * x d e f square (x: Int ): Int = { x*x } d e f announce ( text : String ) { println ( text ) } // Generic Types v a r arr = new Array [ Int ](8) v a r lst = List (1, 2, 3) arr (5) = 7 // processing collections v a l list = List (1, 2, 3) list. foreach (x => println (x)) list. foreach ( println ) // shortcut v a l incmap = list. map (x => x + 2) // same with place holder notation v a l incmap = list. map (_ + 2) v a l f = list. filter (x => x % 2 == 1) v a l f = list. filter (_ % 2 == 1) v a l n = list. reduce ((x,y) => x + y ) v a l n = list. reduce (_ + _) // List is immutable Mohamad Jaber (AUB) Spark May 17, / 43

28 Apache Spark Visualizing Shared Memory Data Parallelism v a l res = jar. map ( jellybean => dosomething ( jellybean )) Shared Memory Data Parallelism Split the data Workers/threads independently operates on the data shared in parallel Combine when done (if necessary) Mohamad Jaber (AUB) Spark May 17, / 43

29 Apache Spark Visualizing Distributed Data Parallelism v a l res = jar. map ( jellybean => dosomething ( jellybean )) Distributed Data Parallelism Split the data over several nodes (machines) Workers/threads independently operates on the data shared in parallel Combine when done (if necessary) Mohamad Jaber (AUB) Spark May 17, / 43

30 Apache Spark Visualizing Distributed Data Parallelism v a l res = jar. map ( jellybean => dosomething ( jellybean )) Distributed Data Parallelism Split the data over several nodes (machines) Workers/threads independently operates on the data shared in parallel Combine when done (if necessary) New concern: Mohamad Jaber (AUB) Spark May 17, / 43

31 Apache Spark Visualizing Distributed Data Parallelism v a l res = jar. map ( jellybean => dosomething ( jellybean )) Distributed Data Parallelism Split the data over several nodes (machines) Workers/threads independently operates on the data shared in parallel Combine when done (if necessary) New concern: now we need to worry about network latency (combining)! Mohamad Jaber (AUB) Spark May 17, / 43

32 Apache Spark Apache Spark Apache Spark is a framework for distributed data processing! Spark implements a distributed data parallel model called Resilient Distributed Datasets (RDDs) Mohamad Jaber (AUB) Spark May 17, / 43

33 Apache Spark Distributed Data Parallel: High-Level Given some large dataset that cannot fit into memory on a single node Mohamad Jaber (AUB) Spark May 17, / 43

34 Apache Spark Distributed Data Parallel: High-Level Chunk up (partition) the data Distribute it over a cluster of machines From there think of your distributed data like a single collection! Mohamad Jaber (AUB) Spark May 17, / 43

Example (transform the text of all wiki articles to lowercase) v a l wiki : RDD [ WikiArticle ].

35 Apache Spark Distributed Data Parallel: High-Level Chunk up (partition) the data Distribute it over a cluster of machines From there think of your distributed data like a single collection! Example (transform the text of all wiki articles to lowercase) v a l wiki : RDD [ WikiArticle ]... v a l lowerwiki = wiki. map ( article => article. text. tolowercase ) Mohamad Jaber (AUB) Spark May 17, / 43

36 Distribution Apache Spark Distribution introduces important concerns beyond parallelism in the shared memory case (single node/machine) Mohamad Jaber (AUB) Spark May 17, / 43

37 Distribution Apache Spark Distribution introduces important concerns beyond parallelism in the shared memory case (single node/machine) Partial failure: crash failure of a subset of machines Mohamad Jaber (AUB) Spark May 17, / 43

38 Distribution Apache Spark Distribution introduces important concerns beyond parallelism in the shared memory case (single node/machine) Partial failure: crash failure of a subset of machines Latency: certain operations (combining) have a much higher latency than other operations due to network communication Mohamad Jaber (AUB) Spark May 17, / 43

39 Distribution Apache Spark Distribution introduces important concerns beyond parallelism in the shared memory case (single node/machine) Partial failure: crash failure of a subset of machines Latency: certain operations (combining) have a much higher latency than other operations due to network communication Important Latency Numbers Main memory reference Send 2K bytes over 1Gbps network SSD random read Read 1 MB sequentially from memory Read 1 MB sequentially from SSD Read 1 MB sequentially from disk Send packet US Europe US 100ns 20,000ns 150,000ns 250,000ns 1,000,000ns 20,000,000ns 150,000,000ns Mohamad Jaber (AUB) Spark May 17, / 43

40 Apache Spark Big Data Processing and Latency Network communication and disk operations can be very expensive! How do these latency numbers related to big data processing? Mohamad Jaber (AUB) Spark May 17, / 43

41 Apache Spark Big Data Processing and Latency Network communication and disk operations can be very expensive! How do these latency numbers related to big data processing? To answer this question let us discuss Spark s predecessor, Hadoop Mohamad Jaber (AUB) Spark May 17, / 43

42 Apache Spark Big Data Processing and Latency Network communication and disk operations can be very expensive! How do these latency numbers related to big data processing? To answer this question let us discuss Spark s predecessor, Hadoop Hadoop is a widely-used large scale batch data processing framework It is an open source implementation of Google s MapReduce (2004) Mohamad Jaber (AUB) Spark May 17, / 43

Apache Spark Big Data Processing and Latency Network communication and disk operations can be very expensive! How do these latency numbers related to big data processing?

43 Apache Spark Big Data Processing and Latency Network communication and disk operations can be very expensive! How do these latency numbers related to big data processing? To answer this question let us discuss Spark s predecessor, Hadoop Hadoop is a widely-used large scale batch data processing framework It is an open source implementation of Google s MapReduce (2004) Ground breaking because of: (1) simplicity (map and reduce); and (2) fault tolerance Fault tolerance is what made it possible for Hadoop MapReduce to scale up to 1000 nodes (recover from node failure) Mohamad Jaber (AUB) Spark May 17, / 43

44 Hadoop MapReduce Apache Spark MapReduce works by breaking the processing into two phases Each phase has key-value pairs as input and output Mohamad Jaber (AUB) Spark May 17, / 43

Hadoop MapReduce Apache Spark MapReduce works by breaking the processing into two phases Each phase has key-value pairs as input and output Map: Grab the relevant data from the source and output

45 Hadoop MapReduce Apache Spark MapReduce works by breaking the processing into two phases Each phase has key-value pairs as input and output Map: Grab the relevant data from the source and output intermediate (key, value) pairs (local file system - disk) Reduce: Aggregate the results for each unique key of the generated intermediate (key, value) pairs (HDFS) Mohamad Jaber (AUB) Spark May 17, / 43

46 Why Spark? Apache Spark Fault-tolerance in Hadoop MapReduce comes at a cost Between each map and reduce step, in order to recover from potential failures, Hadoop MapReduce shuffles its data and write intermediate data to disk Mohamad Jaber (AUB) Spark May 17, / 43

47 Why Spark? Apache Spark Fault-tolerance in Hadoop MapReduce comes at a cost Between each map and reduce step, in order to recover from potential failures, Hadoop MapReduce shuffles its data and write intermediate data to disk Cons. of Hadoop MapReduce Not efficient to use the same data multiple times (iterative and interactive) Intermediate results written into stable storage Output of reducers written on HDFS Disk I/O, network I/O, [de]serialization Mohamad Jaber (AUB) Spark May 17, / 43

48 Why Spark? Apache Spark Retains fault-tolerance Different strategy handling latency Mohamad Jaber (AUB) Spark May 17, / 43

49 Why Spark? Apache Spark Retains fault-tolerance Different strategy handling latency Achieves this using ideas from functional programming Keep all data immutable and in-memory All operations on data are just functional transformations Fault tolerance is achieved by replaying function transformations over original dataset Mohamad Jaber (AUB) Spark May 17, / 43

50 Why Spark? Apache Spark Retains fault-tolerance Different strategy handling latency Achieves this using ideas from functional programming Keep all data immutable and in-memory All operations on data are just functional transformations Fault tolerance is achieved by replaying function transformations over original dataset Spark has been shown to be 100x more efficient than Hadoop MapReduce while adding even more expressive APIs! Mohamad Jaber (AUB) Spark May 17, / 43

51 Apache Spark Spark vs Hadoop Performance Mohamad Jaber (AUB) Spark May 17, / 43

52 Apache Spark Spark vs Hadoop Popularity According to Google trends, Spark has surpassed Hadoop in popularity! Mohamad Jaber (AUB) Spark May 17, / 43

53 Spark - RDD Apache Spark Spark extends MapReduce model to better support two common classes analytics applications Iterative algorithms (e.g., machine learning, graph) Interactive: efficiently analyze data sets interactively Mohamad Jaber (AUB) Spark May 17, / 43

54 Spark - RDD Apache Spark Spark extends MapReduce model to better support two common classes analytics applications Iterative algorithms (e.g., machine learning, graph) Interactive: efficiently analyze data sets interactively Spark implements a distributed data parallel model called Resilient Distributed Datasets (RDDs) RDDs look just like immutable sequential or parallel Scala collections RDD is big parallel collection that distributed (in-memory or Disk) across the cluster Mohamad Jaber (AUB) Spark May 17, / 43

55 Spark - RDD Apache Spark Spark extends MapReduce model to better support two common classes analytics applications Iterative algorithms (e.g., machine learning, graph) Interactive: efficiently analyze data sets interactively Spark implements a distributed data parallel model called Resilient Distributed Datasets (RDDs) RDDs look just like immutable sequential or parallel Scala collections RDD is big parallel collection that distributed (in-memory or Disk) across the cluster Spark provides high-level APIs in Java, Scala, Python and R Mohamad Jaber (AUB) Spark May 17, / 43

56 RDD Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS) or through parallel transformation of another RDD (e.g., map, filter) Mohamad Jaber (AUB) Spark May 17, / 43

57 RDD Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS) or through parallel transformation of another RDD (e.g., map, filter) It is also possible to execute actions on RDD An action returns single values (not collections) as results (e.g., reduce, count, first) Mohamad Jaber (AUB) Spark May 17, / 43

58 RDD Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS) or through parallel transformation of another RDD (e.g., map, filter) It is also possible to execute actions on RDD An action returns single values (not collections) as results (e.g., reduce, count, first) RDD can be cashed for efficient (later) use Mohamad Jaber (AUB) Spark May 17, / 43

59 Apache Spark Programming with RDDs - Spark Context Spark Context Main entry point to Spark functionality Available in shell as variable sc scala > val rdd = sc. textfile (" input. txt ") Mohamad Jaber (AUB) Spark May 17, / 43

60 Apache Spark Programming with RDDs - Spark Context Spark Context Main entry point to Spark functionality Available in shell as variable sc scala > val rdd = sc. textfile (" input. txt ") Standalone application v a l conf = new SparkConf (). setappname (" Simple Application "). setmaster (" local [*] ") v a l sc = new SparkContext ( conf ) Mohamad Jaber (AUB) Spark May 17, / 43

61 Create RDDs Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS): v a l conf = new SparkConf (). setappname (" Simple Application "). setmaster (" local [*] ") v a l sc = new SparkContext ( conf ) v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ", 32) // 32 partitions Mohamad Jaber (AUB) Spark May 17, / 43

62 Create RDDs Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS): v a l conf = new SparkConf (). setappname (" Simple Application "). setmaster (" local [*] ") v a l sc = new SparkContext ( conf ) v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ", 32) // 32 partitions Or, through parallel transformation of another RDD // remove lines containing the word error v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) // count the number of words per line v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) // combined v a l countdata = logdata. filter (x =>!x. contains (" error ")). map (x => x. split ("\\ s+"). count (x => true )) Mohamad Jaber (AUB) Spark May 17, / 43

63 Create RDDs Apache Spark An RDD can be created either from a stable storage (e.g., local, HDFS): v a l conf = new SparkConf (). setappname (" Simple Application "). setmaster (" local [*] ") v a l sc = new SparkContext ( conf ) v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ", 32) // 32 partitions Or, through parallel transformation of another RDD // remove lines containing the word error v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) // count the number of words per line v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) // combined v a l countdata = logdata. filter (x =>!x. contains (" error ")). map (x => x. split ("\\ s+"). count (x => true )) Or, you can turn a Scala collection into an RDD v a l rdd = sc. parallelize ( Array (1, 2, 3, 4, 5)) Mohamad Jaber (AUB) Spark May 17, / 43

64 Actions on RDDs Apache Spark v a l nums = sc. parallelize ( List (1, 2, 3)) // Retrieve RDD contents as a local collection nums. collect () // => Array (1, 2, 3) // Return first K elements nums. take (2) // => Array (1, 2) // Count number of elements nums. count () // => 3 // Merge elements with an associative function nums. reduce (_ + _) // => 6 -- equivalent to nums. reduce ((x,y) => x + y) // Write elements to a text file nums. saveastextfile (" hdfs :// file. txt ") // loop over all elements nums. foreach ( println ) Mohamad Jaber (AUB) Spark May 17, / 43

65 Apache Spark Lazy Operations and Caching All transformations in Spark are lazy They do not compute their results right away They just remember the transformations applied to some base dataset The transformations are only computed when an action requires a result to be returned to the driver program v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ") v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) println ( countdata. count ()) countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => x!= " mohamad ") ) print ( countdata. count ()) // will repeat from the begining Mohamad Jaber (AUB) Spark May 17, / 43

66 Apache Spark Lazy Operations and Caching All transformations in Spark are lazy They do not compute their results right away They just remember the transformations applied to some base dataset The transformations are only computed when an action requires a result to be returned to the driver program Upon execution an action a result will be computed and intermediate RDDs are stored in RAM (if possible) v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ") v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) println ( countdata. count ()) countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => x!= " mohamad ") ) print ( countdata. count ()) // will repeat from the begining Mohamad Jaber (AUB) Spark May 17, / 43

67 Apache Spark Lazy Operations and Caching All transformations in Spark are lazy They do not compute their results right away They just remember the transformations applied to some base dataset The transformations are only computed when an action requires a result to be returned to the driver program Upon execution an action a result will be computed and intermediate RDDs are stored in RAM (if possible) Executing another action would repeat the reconstruction from the beginning v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ") v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) println ( countdata. count ()) countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => x!= " mohamad ") ) print ( countdata. count ()) // will repeat from the begining Mohamad Jaber (AUB) Spark May 17, / 43

68 Apache Spark Lazy Operations and Caching All transformations in Spark are lazy They do not compute their results right away They just remember the transformations applied to some base dataset The transformations are only computed when an action requires a result to be returned to the driver program Upon execution an action a result will be computed and intermediate RDDs are stored in RAM (if possible) Executing another action would repeat the reconstruction from the beginning However, you can cache some RDDs! v a l logdata = sc. textfile (" hdfs :// hadoop - master /a. txt ") v a l logdatafilter = logdata. filter (x =>!x. contains (" error ")) logdatafilter.cache() v a l countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => true )) println ( countdata. count ()) countdata = logdatafilter. map (x => x. split ("\\ s+"). count (x => x!= " mohamad ") ) print ( countdata. count ()) // will repeat from the begining Mohamad Jaber (AUB) Spark May 17, / 43

69 Pair RDDs Apache Spark Spark s distributed reduce transformations operate on RDDs of key-value pairs // Scala pair v a l pair = (a, b) pair._1 // => a pair._2 // => b // pets is a Pair RDD pets = sc. parallelize ( Array (("cat ", 1), (" dog ", 1), (" cat ", 2))) Mohamad Jaber (AUB) Spark May 17, / 43

70 Pair RDDs Apache Spark Spark s distributed reduce transformations operate on RDDs of key-value pairs // Scala pair v a l pair = (a, b) pair._1 // => a pair._2 // => b // pets is a Pair RDD pets = sc. parallelize ( Array (("cat ", 1), (" dog ", 1), (" cat ", 2))) Some transformations: reducebykey, join, sortbykey, mapvalues v a l data = sc. textfile (" input. txt ") v a l pairdata = data. map (v => { v a l split = v. split ("\\s+") ( split (0), split (1). toint ) }). cache () // automatically implements combiners v a l rdd1 = pairdata. reducebykey ((x,y) => x + y)) // (" cat ", 3), (" dog ", 1) v a l rdd2 = pairdata. groubbykey () // (" cat ", [1, 2]), (" dog ", [1]) v a l rdd3 = pairdata. sortbykey () // (" cat ", 1), (" cat ", 2), (" dog ", 1) Mohamad Jaber (AUB) Spark May 17, / 43

71 Example: Word Count Apache Spark v a l lines = sc. textfile (" input. txt ") v a l counts = lines. flatmap ( line => line. split ("\\ s+")). map ( word => (word, 1)). reducebykey (_ + _) Mohamad Jaber (AUB) Spark May 17, / 43

72 Apache Spark Example: Simple Linear Regression v a l rdddata = sc. textfile (" input ") v a r teta = Math. random () v a l learningrate = v a r i = 0 v a l iterations = 100 v a l rdddataxy = rdddata. map ( item => { v a l itemsplit = item. split (" ") ( itemsplit (0). todouble, itemsplit (1). todouble ) }). cache () f o r (i < 1 to iterations ) { v a l rddinnergradient = rdddataxy. map ( item => 2 * ( teta * item._1 - item._2 ) * item._1) v a l gradient = rddinnergradient. reduce (( v1, v2) => v1 + v2) teta = teta - learningrate * gradient. todouble } Mohamad Jaber (AUB) Spark May 17, / 43

73 Fault Tolerance Apache Spark One option to do fault tolerance is to replicate the data (into multiple nodes) However, this may drastically affect the performance (disk and network I/O) Mohamad Jaber (AUB) Spark May 17, / 43

74 Fault Tolerance Apache Spark One option to do fault tolerance is to replicate the data (into multiple nodes) However, this may drastically affect the performance (disk and network I/O) Spark uses method called lineage Remember how an RDD it was built from a given source Automatically rebuilt on failure Recompute only lost partitions on failures, that is, no cost if nothing fails! Mohamad Jaber (AUB) Spark May 17, / 43

75 Apache Spark Spark Execution Engine - Stages and Tasks sc. textfile (" hdfs :// master - node / input / data "). map (x => (x (0), x)). groupbykey (). mapvalues (f => f. count (x => true )) Mohamad Jad Mary stage1 (M, Mohamad) (J, Jad) (M, Mary) M, (Mohamad, Mary) J, (Jad) stage2 M, 2 J, 1 Mohamad Jaber (AUB) Spark May 17, / 43

76 Apache Spark Spark Execution Engine - Stages and Tasks stage 1 stage 2 RDD1 map map RDD2 DAGScheduler is the scheduling layer of Spark that implements stage-oriented scheduling RDD3 RDD5 filter RDD4 It transforms a logical execution plan to a physical execution plan (using stages) join Stages are submitted as tasks RDD6 map stage 3 When the result generate is independent of any other data then we can pipeline! RDD7 Mohamad Jaber (AUB) Spark May 17, / 43

77 Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) Mohamad Jaber (AUB) Spark May 17, / 43

78 Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) SparkContext can connect to several types of cluster managers either Spark s own standalone cluster manager, Mesos or YARN > spark - submit -- class path. to. your. Class -- master yarn -- deploy - mode cluster \ [ options ] <app jar > [ app options ] Mohamad Jaber (AUB) Spark May 17, / 43

79 Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) SparkContext can connect to several types of cluster managers either Spark s own standalone cluster manager, Mesos or YARN > spark - submit -- class path. to. your. Class -- master yarn -- deploy - mode cluster \ [ options ] <app jar > [ app options ] Once connected, Spark acquires executors on nodes in the cluster Mohamad Jaber (AUB) Spark May 17, / 43

80 Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) SparkContext can connect to several types of cluster managers either Spark s own standalone cluster manager, Mesos or YARN > spark - submit -- class path. to. your. Class -- master yarn -- deploy - mode cluster \ [ options ] <app jar > [ app options ] Once connected, Spark acquires executors on nodes in the cluster Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors Mohamad Jaber (AUB) Spark May 17, / 43

Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) SparkContext can connect

81 Spark Execution Flow Apache Spark Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (driver) SparkContext can connect to several types of cluster managers either Spark s own standalone cluster manager, Mesos or YARN > spark - submit -- class path. to. your. Class -- master yarn -- deploy - mode cluster \ [ options ] <app jar > [ app options ] Once connected, Spark acquires executors on nodes in the cluster Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors Finally, SparkContext sends tasks to the executors to run Mohamad Jaber (AUB) Spark May 17, / 43

82 More about Spark Apache Spark Modules on top of Spark Spark also supports a rich set of higher-level tools GraphX forgraph processing MLLib for machine learning Spark SQL for structured data processing Spark Streaming Geo and Spatial Spark: geographical and spatial data Mohamad Jaber (AUB) Spark May 17, / 43

83 Distributed File System 1 Big Data 2 Apache Spark 3 Distributed File System Mohamad Jaber (AUB) Spark May 17, / 43

84 Distributed File System Distributed File Systems Client/Server-based Distributed File Systems The actual file service is offered/stored by a single machine Network File System (NFS) Andrew File System (AFS) Client 1 Client 2 Client 3 Server HDD Mohamad Jaber (AUB) Spark May 17, / 43

85 Distributed File System Distributed File Systems Client/Server-based Distributed File Systems The actual file service is offered/stored by a single machine Network File System (NFS) Andrew File System (AFS) Client 1 Client 2 Client 3 Server HDD Cluster-based Distributed File Systems Divide files among tens, hundreds, thousands or tens of thousands of machines Server Server Google File System (GFS - appeared in SOSP 2003) Hadoop Distributed File System (HDFS) Client 1 Client 2 Client 3 HDD Server HDD HDD Server HDD Mohamad Jaber (AUB) Spark May 17, / 43

86 Distributed File System Hadoop Distributed File System (HDFS) HDFS (open source) is inspired by GFS file DataNode DataNode DataNode DataNode DataNode FsImage EditLog metadata Mohamad Jaber (AUB) Spark May 17, / 43

87 Distributed File System HDFS Components in Cluster Mohamad Jaber (AUB) Spark May 17, / 43

88 HDFS Commands Distributed File System HDFS provides a shell like and a list of commands are available to interact with it # format the file system hdfs namenode - format # starts namenode and datanode daemons start - dfs.sh # Operations on the HDFS hdfs dfs <args > # Create directory hdfs dfs - mkdir / input # List Files hdfs dfs -ls / # Transfer and store a data file from local systems to the HDFS hdfs dfs -put / home / jaber / file. txt / input # View the data from HDFS using cat command hdfs dfs -cat / input / file. txt # Get the file from HDFS to the local file system hdfs dfs -get / input / file. txt / home / jaber / Desktop Mohamad Jaber (AUB) Spark May 17, / 43

89 Distributed File System Hadoop - Yarn Cluster Mohamad Jaber (AUB) Spark May 17, / 43

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based