Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Size: px

Start display at page:

Download "Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark"

William Powell
5 years ago
Views:

1 Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark JT Olds Space Monkey Vivint R&D January

2 Outline 1 Map Reduce 2 Spark 3 Conclusion?

3 Map Reduce Outline 1 Map Reduce 2 Spark 3 Conclusion?

4 Map Reduce Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

5 Map Reduce MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google s clusters every day. 1 Introduction given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and

6 Map Reduce Context Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

7 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

8 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

9 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

10 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

11 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

12 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

13 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

14 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

15 Map Reduce Overall idea Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

16 Map Reduce Overall idea Google Map Reduce: Map map (n 1, d 1 ) [(k 1, v 1 ), (k 2, v 2 ),...]

17 Map Reduce Overall idea Google Map Reduce: Map map (n 1, d 1 ) [(k 1, v 1 ), (k 2, v 2 ),...] map (n 2, d 2 ) [(k 3, v 3 ), (k 1, v 4 ),...]

18 Map Reduce Overall idea Google Map Reduce: Reduce reduce (k 1, [v 1, v 4,...]) r 1

19 Map Reduce Overall idea Google Map Reduce: Reduce reduce (k 1, [v 1, v 4,...]) r 1 reduce (k 2, [v 2,...]) r 2

20 Map Reduce Examples Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

21 Map Reduce Examples Example: Word Count map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

22 Map Reduce Examples Example: Word Count reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

23 Map Reduce Examples Aside: Combiner combine(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); EmitIntermediate(key, AsString(result));

24 Map Reduce Examples Example: Reverse Web Link Graph map(string key, String value): // key: source url // value: document contents for each link target in value: EmitIntermediate(target, key);

25 Map Reduce Examples Example: Reverse Web Link Graph reduce(string key, Iterator values): // key: target url // values: a list of source urls Emit(values.serialize());

26 Map Reduce Examples Example: Inverted index map(string key, String value): // key: document id // value: document contents for each unique word in value: EmitIntermediate(word, key);

27 Map Reduce Examples Example: Inverted index reduce(string key, Iterator values): // key: word // values: list of document ids Emit(values.serialize());

28 Map Reduce Examples Example: Grep map(string key, String value): // key: document name // value: document contents for each line lineno, line in value: if PATTERN matches line: EmitIntermediate( "%s:%d" % (key, lineno), line);

29 Map Reduce Examples Example: Grep reduce(string key, Iterator values): // key: document/lineno pair // values: line contents Emit(values.first);

30 Map Reduce Examples Example: Distributed sort map(string key, String value): // key: key // value: record EmitIntermediate(key, value);

31 Map Reduce Examples Example: Distributed sort reduce(string key, Iterator values): // key: key // values: list of records for each record in records: Emit(record);

32 Map Reduce Architecture Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

33 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

34 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

35 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

36 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

37 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

38 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

39 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

40 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

41 Map Reduce Architecture Stragglers Input (MB/s) Done Input (MB/s) Done Input (MB/s) Done Shuffle (MB/s) Shuffle (MB/s) Shuffle (MB/s) Output (MB/s) Seconds Output (MB/s) Seconds Seconds (a) Normal execution (b) No backup tasks (c) 200 tasks killed Figure 3: Data transfer rates over time for different executions of the sort program Output (MB/s) original text line as the intermediate key/value pair. We used a built-in Identity function as the Reduce operator. the first batch of approximately 1700 reduce tasks (the entire MapReduce was assigned about 1700 machines,

42 Map Reduce Challenges Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

43 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

44 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

45 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

46 Spark Outline 1 Map Reduce 2 Spark 3 Conclusion?

47 Spark Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

48 Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley Abstract We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 1 Introduction Cluster computing frameworks like MapReduce [10] and Dryad [19] have been widely adopted for large-scale data analytics. These systems let users write parallel computations using a set of high-level operators, without having to worry about work distribution and fault tolerance. Although current frameworks provide numerous abstractions for accessing a cluster s computational resources, they lack abstractions for leveraging distributed memory. This makes them inefficient for an important class of emerging applications: those that reuse intermediate results across multiple computations. Data reuse is tion, which can dominate application execution times. Recognizing this problem, researchers have developed specialized frameworks for some applications that require data reuse. For example, Pregel [22] is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop [7] offers an iterative MapReduce interface. However, these frameworks only support specific computation patterns (e.g., looping a series of MapReduce steps), and perform data sharing implicitly for these patterns. They do not provide abstractions for more general reuse, e.g., to let a user load several datasets into memory and run ad-hoc queries across them. In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators. The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage on clusters, such as distributed shared memory [24], keyvalue stores [25], databases, and Piccolo [27], offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.

49 Spark Context Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

50 Spark Context Resilient Distributed Datasets Existing frameworks are a poor fit for: Iterative algorithms Interactive data mining

51 Spark Context Resilient Distributed Datasets Existing frameworks are a poor fit for: Iterative algorithms Interactive data mining

52 Spark Context Resilient Distributed Datasets Both things can be sped up orders of magnitude by keeping stuff in memory!

53 Spark Overall idea Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

54 Spark Overall idea Spark: Overall idea RDDs are implemented as lightweight objects inside of a Scala shell, where each object represents a sequence of deterministic transformations on some data. RDDs can only be constructed in the Scala shell by referencing some files on disk (usually HDFS), or by operations on other RDDs.

55 Spark Overall idea Spark: Overall idea RDDs are implemented as lightweight objects inside of a Scala shell, where each object represents a sequence of deterministic transformations on some data. RDDs can only be constructed in the Scala shell by referencing some files on disk (usually HDFS), or by operations on other RDDs.

56 Spark Overall idea Spark: Overall idea (Example) val lines = sc.textfile("hdfs://path/to/log") val errors = lines.filter(_.startswith("error")) val timestamps = (errors.filter(_.contains("hdfs")).map(_.split("\t")(3)))

57 Spark Overall idea Spark: Overall idea An RDD is a read-only, partitioned collection of records.

58 Spark Overall idea Narrow Dependencies: Spark: Overall idea (Map) map, filter Map (flatmap in Spark)

59 Spark Overall idea Wide Dependencies: Spark: Overall idea (Reduce) groupbykey Reduce (reducebykey in Spark)

60 Spark Overall idea Spark: Overall idea (Transformations) RDD operations are coarse-grained and high level, like map, sample, filter, reducebykey, etc. RDDs are evaluated lazily. Data is processed as late as possible. The RDD simply records the high level operation order, dependencies, and data partitions.

61 Spark Overall idea Spark: Overall idea (Transformations) RDD operations are coarse-grained and high level, like map, sample, filter, reducebykey, etc. RDDs are evaluated lazily. Data is processed as late as possible. The RDD simply records the high level operation order, dependencies, and data partitions.

62 Spark Overall idea Spark: Overall idea (Transformations) lines errors HDFS errors time fields filter(_.startswith( ERROR )) filter(_.contains( HDFS ))) map(_.split( \t )(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textfile("hdfs://...") A Reads Write Consi Fault Stragg mitiga Work placem Behav enoug

63 Spark Overall idea Spark: Overall idea Transformations Actions map( f : T U) : RDD[T] RDD[U] filter( f : T Bool) : RDD[T] RDD[T] flatmap( f : T Seq[U]) : RDD[T] RDD[U] sample(fraction : Float) : RDD[T] RDD[T] (Deterministic sampling) groupbykey() : RDD[(K, V)] RDD[(K, Seq[V])] reducebykey( f : (V,V) V) : RDD[(K, V)] RDD[(K, V)] union() : (RDD[T],RDD[T]) RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (V, W))] cogroup() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (Seq[V], Seq[W]))] crossproduct() : (RDD[T],RDD[U]) RDD[(T, U)] mapvalues( f : V W) : RDD[(K, V)] RDD[(K, W)] (Preserves partitioning) sort(c : Comparator[K]) : RDD[(K, V)] RDD[(K, V)] partitionby(p : Partitioner[K]) : RDD[(K, V)] RDD[(K, V)] count() : RDD[T] Long collect() : RDD[T] Seq[T] reduce( f : (T,T) T) : RDD[T] T lookup(k : K) : RDD[(K, V)] Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. that searches for a hyperplane w that best separates two sets of points (e.g., spam and non-spam s). The algorithm uses gradient descent: it starts w at a random value, and on each iteration, it sums a function of w over the data to move w in a direction that improves it. input file map links ranks 0 join contribs 0 reduce + map ranks 1

64 Spark Overall idea Spark: Overall idea (Tuning) A user can indicate which RDDs may get reused and should be persisted (usually in-memory, but can be spilled to disk via priority) (persist or cache). A user can also configure the partitioning of the data, and the algorithm through which nodes are chosen by key (helps join efficiency, etc).

65 Spark Overall idea Spark: Overall idea (Tuning) A user can indicate which RDDs may get reused and should be persisted (usually in-memory, but can be spilled to disk via priority) (persist or cache). A user can also configure the partitioning of the data, and the algorithm through which nodes are chosen by key (helps join efficiency, etc).

66 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

67 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

68 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

69 Spark Overall idea Spark: Overall idea Transformations Actions map( f : T U) : RDD[T] RDD[U] filter( f : T Bool) : RDD[T] RDD[T] flatmap( f : T Seq[U]) : RDD[T] RDD[U] sample(fraction : Float) : RDD[T] RDD[T] (Deterministic sampling) groupbykey() : RDD[(K, V)] RDD[(K, Seq[V])] reducebykey( f : (V,V) V) : RDD[(K, V)] RDD[(K, V)] union() : (RDD[T],RDD[T]) RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (V, W))] cogroup() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (Seq[V], Seq[W]))] crossproduct() : (RDD[T],RDD[U]) RDD[(T, U)] mapvalues( f : V W) : RDD[(K, V)] RDD[(K, W)] (Preserves partitioning) sort(c : Comparator[K]) : RDD[(K, V)] RDD[(K, V)] partitionby(p : Partitioner[K]) : RDD[(K, V)] RDD[(K, V)] count() : RDD[T] Long collect() : RDD[T] Seq[T] reduce( f : (T,T) T) : RDD[T] T lookup(k : K) : RDD[(K, V)] Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. that searches for a hyperplane w that best separates two sets of points (e.g., spam and non-spam s). The algorithm uses gradient descent: it starts w at a random value, and on each iteration, it sums a function of w over the data to move w in a direction that improves it. input file map links ranks 0 join contribs 0 reduce + map ranks 1

70 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

71 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

72 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

73 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

74 Spark Simple example Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

75 Spark Simple example Example: Log querying val lines = sc.textfile("hdfs://path/to/log") val errors = lines.filter(_.startswith("error")) errors.cache()

76 Spark Simple example Example: Log querying errors.count()

77 Spark Simple example Example: Log querying errors.filter(_.contains("mysql")).count()

78 Spark Simple example Example: Log querying (errors.filter(_.contains("hdfs")).map(_.split("\t")(3)).collect())

79 Spark Simple example Example: Log querying lines errors HDFS errors time fields filter(_.startswith( ERROR )) filter(_.contains( HDFS ))) map(_.split( \t )(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textfile("hdfs://...") A Reads Write Consi Fault Stragg mitiga Work placem Behav enoug

80 Spark Scheduling Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

81 Spark Scheduling RDD Representation ns, it may e versions The user However, eplicated, by rerunataset will ach docuts rank, so stems that. PageRank f we specn the link ranks in partitions() Operation Meaning Return a list of Partition objects preferredlocations(p) List nodes where partition p can be accessed faster due to data locality dependencies() Return a list of dependencies iterator(p, parentiters) Compute the elements of partition p given iterators for its parent partitions partitioner() Return metadata specifying whether the RDD is hash/range partitioned Table 3: Interface used to represent RDDs in Spark. of a map on this RDD has the same partitions, but applies the map function to the parent s data when computing its

82 Spark Scheduling Narrow vs Wide Dependencies Narrow Dependencies: Wide Dependencies: map, filter groupbykey Stag C: union join with inputs co-partitioned join with inputs not co-partitioned Figure 4: Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles. Sta Figure 5: E with solid o in black if th

83 Spark Scheduling Job Scheduling ndencies: A: B: ykey Stage 1 C: D: groupby F: G: map E: join nputs not itioned cies. Each tangles. Stage 2 union Stage 3 Figure 5: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD

84 Spark More Examples Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

85 Spark More Examples Example: Word Count val m = documents.flatmap( _._2.split("\\s+").map(word => (word, "1"))) val r = m.reducebykey( (a, b) => (a.toint + b.toint).tostring) // do we need a combine step?

86 Spark More Examples Example: Reverse Index val m = documents.flatmap( doc => (doc._2.split("\\s+").distinct.map(word => (word, doc._1)))) val r = m.reducebykey( (a, b) => a + "\n" + b)

87 Spark More Examples Example: Grep def search(keyword: String) = { (documents.filter(_._2.contains(keyword)).map(_._1).reduce((a, b) => (a + "\n" + b))) }

88 Spark More Examples Example: Page Rank val links = spark.textfile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targeturl, float) pairs // with the contributions sent by each page val contribs = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = (contribs.reducebykey((x,y) => x+y).mapvalues(sum => a/n + (1-a)*sum)) }

89 t() Map: Reduce RDD[T] & Spark Seq[T] T) Spark : RDD[T] T ) : More RDD[(K, Examples V)] Seq[V] (On hash/range partitioned RDDs) g) : Outputs RDD to a storage system, e.g., HDFS Example: Page Rank le on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. rates two ). The ala random of w over it. input file map links ranks 0 join contribs 0 reduce + map ranks 1 contribs 1 st() ranks 2 contribs 2 )*p.y... Figure 3: Lineage graph for datasets in PageRank.

90 Conclusion? Outline 1 Map Reduce 2 Spark 3 Conclusion?

91 Conclusion? Conclusion? 3 Conclusion?

92 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

93 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

94 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

95 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

96 Meetup Wrap-up Shameless plug

97 Meetup Wrap-up Shameless plug Space Monkey! Large scale storage Distributed system design and implementation Security and cryptography engineering Erasure codes Monitoring and sooo much data

98 Meetup Wrap-up Shameless plug Space Monkey! Come work with us!

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,