Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Size: px
Start display at page:

Download "Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark"

Transcription

1 Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark JT Olds Space Monkey Vivint R&D January

2 Outline 1 Map Reduce 2 Spark 3 Conclusion?

3 Map Reduce Outline 1 Map Reduce 2 Spark 3 Conclusion?

4 Map Reduce Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

5 Map Reduce MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google s clusters every day. 1 Introduction given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and

6 Map Reduce Context Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

7 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

8 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

9 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

10 Map Reduce Context Google Map Reduce context Lots of conceptually simple tasks On an internet s worth of data Spread across thousands of commodity servers That are constantly failing

11 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

12 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

13 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

14 Map Reduce Context Google Map Reduce context: abstraction? Parallelize the computation Distribute the data Handle failures With simple code

15 Map Reduce Overall idea Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

16 Map Reduce Overall idea Google Map Reduce: Map map (n 1, d 1 ) [(k 1, v 1 ), (k 2, v 2 ),...]

17 Map Reduce Overall idea Google Map Reduce: Map map (n 1, d 1 ) [(k 1, v 1 ), (k 2, v 2 ),...] map (n 2, d 2 ) [(k 3, v 3 ), (k 1, v 4 ),...]

18 Map Reduce Overall idea Google Map Reduce: Reduce reduce (k 1, [v 1, v 4,...]) r 1

19 Map Reduce Overall idea Google Map Reduce: Reduce reduce (k 1, [v 1, v 4,...]) r 1 reduce (k 2, [v 2,...]) r 2

20 Map Reduce Examples Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

21 Map Reduce Examples Example: Word Count map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

22 Map Reduce Examples Example: Word Count reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

23 Map Reduce Examples Aside: Combiner combine(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); EmitIntermediate(key, AsString(result));

24 Map Reduce Examples Example: Reverse Web Link Graph map(string key, String value): // key: source url // value: document contents for each link target in value: EmitIntermediate(target, key);

25 Map Reduce Examples Example: Reverse Web Link Graph reduce(string key, Iterator values): // key: target url // values: a list of source urls Emit(values.serialize());

26 Map Reduce Examples Example: Inverted index map(string key, String value): // key: document id // value: document contents for each unique word in value: EmitIntermediate(word, key);

27 Map Reduce Examples Example: Inverted index reduce(string key, Iterator values): // key: word // values: list of document ids Emit(values.serialize());

28 Map Reduce Examples Example: Grep map(string key, String value): // key: document name // value: document contents for each line lineno, line in value: if PATTERN matches line: EmitIntermediate( "%s:%d" % (key, lineno), line);

29 Map Reduce Examples Example: Grep reduce(string key, Iterator values): // key: document/lineno pair // values: line contents Emit(values.first);

30 Map Reduce Examples Example: Distributed sort map(string key, String value): // key: key // value: record EmitIntermediate(key, value);

31 Map Reduce Examples Example: Distributed sort reduce(string key, Iterator values): // key: key // values: list of records for each record in records: Emit(record);

32 Map Reduce Architecture Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

33 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

34 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

35 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

36 Map Reduce Architecture Architecture Master with workers Master pings all workers to track liveness Master keeps track of map and reduce task state, restarting failed ones Map task output gets written to local disk. So, even completed tasks on failed workers need to be restarted.

37 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

38 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

39 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

40 Map Reduce Architecture Architecture Input comes from GFS (triplicate), master tries to schedule map worker on or near server with input replicate. User configures reduce partitioning, within a partition all keys are processed in sorted order Output often goes back to GFS (output is often the input to another job) Stragglers a real problem - some jobs are started multiple times.

41 Map Reduce Architecture Stragglers Input (MB/s) Done Input (MB/s) Done Input (MB/s) Done Shuffle (MB/s) Shuffle (MB/s) Shuffle (MB/s) Output (MB/s) Seconds Output (MB/s) Seconds Seconds (a) Normal execution (b) No backup tasks (c) 200 tasks killed Figure 3: Data transfer rates over time for different executions of the sort program Output (MB/s) original text line as the intermediate key/value pair. We used a built-in Identity function as the Reduce operator. the first batch of approximately 1700 reduce tasks (the entire MapReduce was assigned about 1700 machines,

42 Map Reduce Challenges Map Reduce 1 Map Reduce Context Overall idea Examples Architecture Challenges

43 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

44 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

45 Map Reduce Challenges Challenges Must fit into map/reduce framework. Everything is written to disk, often to GFS, which means triplicate. Lots of disk I/O and network I/O. I/O exacerbated when output is the input to another job.

46 Spark Outline 1 Map Reduce 2 Spark 3 Conclusion?

47 Spark Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

48 Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley Abstract We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 1 Introduction Cluster computing frameworks like MapReduce [10] and Dryad [19] have been widely adopted for large-scale data analytics. These systems let users write parallel computations using a set of high-level operators, without having to worry about work distribution and fault tolerance. Although current frameworks provide numerous abstractions for accessing a cluster s computational resources, they lack abstractions for leveraging distributed memory. This makes them inefficient for an important class of emerging applications: those that reuse intermediate results across multiple computations. Data reuse is tion, which can dominate application execution times. Recognizing this problem, researchers have developed specialized frameworks for some applications that require data reuse. For example, Pregel [22] is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop [7] offers an iterative MapReduce interface. However, these frameworks only support specific computation patterns (e.g., looping a series of MapReduce steps), and perform data sharing implicitly for these patterns. They do not provide abstractions for more general reuse, e.g., to let a user load several datasets into memory and run ad-hoc queries across them. In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators. The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage on clusters, such as distributed shared memory [24], keyvalue stores [25], databases, and Piccolo [27], offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.

49 Spark Context Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

50 Spark Context Resilient Distributed Datasets Existing frameworks are a poor fit for: Iterative algorithms Interactive data mining

51 Spark Context Resilient Distributed Datasets Existing frameworks are a poor fit for: Iterative algorithms Interactive data mining

52 Spark Context Resilient Distributed Datasets Both things can be sped up orders of magnitude by keeping stuff in memory!

53 Spark Overall idea Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

54 Spark Overall idea Spark: Overall idea RDDs are implemented as lightweight objects inside of a Scala shell, where each object represents a sequence of deterministic transformations on some data. RDDs can only be constructed in the Scala shell by referencing some files on disk (usually HDFS), or by operations on other RDDs.

55 Spark Overall idea Spark: Overall idea RDDs are implemented as lightweight objects inside of a Scala shell, where each object represents a sequence of deterministic transformations on some data. RDDs can only be constructed in the Scala shell by referencing some files on disk (usually HDFS), or by operations on other RDDs.

56 Spark Overall idea Spark: Overall idea (Example) val lines = sc.textfile("hdfs://path/to/log") val errors = lines.filter(_.startswith("error")) val timestamps = (errors.filter(_.contains("hdfs")).map(_.split("\t")(3)))

57 Spark Overall idea Spark: Overall idea An RDD is a read-only, partitioned collection of records.

58 Spark Overall idea Narrow Dependencies: Spark: Overall idea (Map) map, filter Map (flatmap in Spark)

59 Spark Overall idea Wide Dependencies: Spark: Overall idea (Reduce) groupbykey Reduce (reducebykey in Spark)

60 Spark Overall idea Spark: Overall idea (Transformations) RDD operations are coarse-grained and high level, like map, sample, filter, reducebykey, etc. RDDs are evaluated lazily. Data is processed as late as possible. The RDD simply records the high level operation order, dependencies, and data partitions.

61 Spark Overall idea Spark: Overall idea (Transformations) RDD operations are coarse-grained and high level, like map, sample, filter, reducebykey, etc. RDDs are evaluated lazily. Data is processed as late as possible. The RDD simply records the high level operation order, dependencies, and data partitions.

62 Spark Overall idea Spark: Overall idea (Transformations) lines errors HDFS errors time fields filter(_.startswith( ERROR )) filter(_.contains( HDFS ))) map(_.split( \t )(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textfile("hdfs://...") A Reads Write Consi Fault Stragg mitiga Work placem Behav enoug

63 Spark Overall idea Spark: Overall idea Transformations Actions map( f : T U) : RDD[T] RDD[U] filter( f : T Bool) : RDD[T] RDD[T] flatmap( f : T Seq[U]) : RDD[T] RDD[U] sample(fraction : Float) : RDD[T] RDD[T] (Deterministic sampling) groupbykey() : RDD[(K, V)] RDD[(K, Seq[V])] reducebykey( f : (V,V) V) : RDD[(K, V)] RDD[(K, V)] union() : (RDD[T],RDD[T]) RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (V, W))] cogroup() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (Seq[V], Seq[W]))] crossproduct() : (RDD[T],RDD[U]) RDD[(T, U)] mapvalues( f : V W) : RDD[(K, V)] RDD[(K, W)] (Preserves partitioning) sort(c : Comparator[K]) : RDD[(K, V)] RDD[(K, V)] partitionby(p : Partitioner[K]) : RDD[(K, V)] RDD[(K, V)] count() : RDD[T] Long collect() : RDD[T] Seq[T] reduce( f : (T,T) T) : RDD[T] T lookup(k : K) : RDD[(K, V)] Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. that searches for a hyperplane w that best separates two sets of points (e.g., spam and non-spam s). The algorithm uses gradient descent: it starts w at a random value, and on each iteration, it sums a function of w over the data to move w in a direction that improves it. input file map links ranks 0 join contribs 0 reduce + map ranks 1

64 Spark Overall idea Spark: Overall idea (Tuning) A user can indicate which RDDs may get reused and should be persisted (usually in-memory, but can be spilled to disk via priority) (persist or cache). A user can also configure the partitioning of the data, and the algorithm through which nodes are chosen by key (helps join efficiency, etc).

65 Spark Overall idea Spark: Overall idea (Tuning) A user can indicate which RDDs may get reused and should be persisted (usually in-memory, but can be spilled to disk via priority) (persist or cache). A user can also configure the partitioning of the data, and the algorithm through which nodes are chosen by key (helps join efficiency, etc).

66 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

67 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

68 Spark Overall idea Spark: Overall idea (Actions) Once you have constructed an RDD with approriate transformations, the user can perform an action. Actions materialize an RDD, fire up the job scheduler, and cause work to be performed. Actions include count, collect, reduce, save.

69 Spark Overall idea Spark: Overall idea Transformations Actions map( f : T U) : RDD[T] RDD[U] filter( f : T Bool) : RDD[T] RDD[T] flatmap( f : T Seq[U]) : RDD[T] RDD[U] sample(fraction : Float) : RDD[T] RDD[T] (Deterministic sampling) groupbykey() : RDD[(K, V)] RDD[(K, Seq[V])] reducebykey( f : (V,V) V) : RDD[(K, V)] RDD[(K, V)] union() : (RDD[T],RDD[T]) RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (V, W))] cogroup() : (RDD[(K, V)],RDD[(K, W)]) RDD[(K, (Seq[V], Seq[W]))] crossproduct() : (RDD[T],RDD[U]) RDD[(T, U)] mapvalues( f : V W) : RDD[(K, V)] RDD[(K, W)] (Preserves partitioning) sort(c : Comparator[K]) : RDD[(K, V)] RDD[(K, V)] partitionby(p : Partitioner[K]) : RDD[(K, V)] RDD[(K, V)] count() : RDD[T] Long collect() : RDD[T] Seq[T] reduce( f : (T,T) T) : RDD[T] T lookup(k : K) : RDD[(K, V)] Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. that searches for a hyperplane w that best separates two sets of points (e.g., spam and non-spam s). The algorithm uses gradient descent: it starts w at a random value, and on each iteration, it sums a function of w over the data to move w in a direction that improves it. input file map links ranks 0 join contribs 0 reduce + map ranks 1

70 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

71 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

72 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

73 Spark Overall idea Spark: Overall idea RDDs are immutable and deterministic. Easy to launch backup workers for stragglers (clear win from Map Reduce) Easy to relaunch computations from failed nodes. Computations are computed lazily, which allows for rich optimizations for locality/partitioning by a job scheduler and query planner.

74 Spark Simple example Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

75 Spark Simple example Example: Log querying val lines = sc.textfile("hdfs://path/to/log") val errors = lines.filter(_.startswith("error")) errors.cache()

76 Spark Simple example Example: Log querying errors.count()

77 Spark Simple example Example: Log querying errors.filter(_.contains("mysql")).count()

78 Spark Simple example Example: Log querying (errors.filter(_.contains("hdfs")).map(_.split("\t")(3)).collect())

79 Spark Simple example Example: Log querying lines errors HDFS errors time fields filter(_.startswith( ERROR )) filter(_.contains( HDFS ))) map(_.split( \t )(3)) Figure 1: Lineage graph for the third query in our example. Boxes represent RDDs and arrows represent transformations. lines = spark.textfile("hdfs://...") A Reads Write Consi Fault Stragg mitiga Work placem Behav enoug

80 Spark Scheduling Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

81 Spark Scheduling RDD Representation ns, it may e versions The user However, eplicated, by rerunataset will ach docuts rank, so stems that. PageRank f we specn the link ranks in partitions() Operation Meaning Return a list of Partition objects preferredlocations(p) List nodes where partition p can be accessed faster due to data locality dependencies() Return a list of dependencies iterator(p, parentiters) Compute the elements of partition p given iterators for its parent partitions partitioner() Return metadata specifying whether the RDD is hash/range partitioned Table 3: Interface used to represent RDDs in Spark. of a map on this RDD has the same partitions, but applies the map function to the parent s data when computing its

82 Spark Scheduling Narrow vs Wide Dependencies Narrow Dependencies: Wide Dependencies: map, filter groupbykey Stag C: union join with inputs co-partitioned join with inputs not co-partitioned Figure 4: Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles. Sta Figure 5: E with solid o in black if th

83 Spark Scheduling Job Scheduling ndencies: A: B: ykey Stage 1 C: D: groupby F: G: map E: join nputs not itioned cies. Each tangles. Stage 2 union Stage 3 Figure 5: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD

84 Spark More Examples Spark 2 Spark Context Overall idea Simple example Scheduling More Examples

85 Spark More Examples Example: Word Count val m = documents.flatmap( _._2.split("\\s+").map(word => (word, "1"))) val r = m.reducebykey( (a, b) => (a.toint + b.toint).tostring) // do we need a combine step?

86 Spark More Examples Example: Reverse Index val m = documents.flatmap( doc => (doc._2.split("\\s+").distinct.map(word => (word, doc._1)))) val r = m.reducebykey( (a, b) => a + "\n" + b)

87 Spark More Examples Example: Grep def search(keyword: String) = { (documents.filter(_._2.contains(keyword)).map(_._1).reduce((a, b) => (a + "\n" + b))) }

88 Spark More Examples Example: Page Rank val links = spark.textfile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targeturl, float) pairs // with the contributions sent by each page val contribs = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = (contribs.reducebykey((x,y) => x+y).mapvalues(sum => a/n + (1-a)*sum)) }

89 t() Map: Reduce RDD[T] & Spark Seq[T] T) Spark : RDD[T] T ) : More RDD[(K, Examples V)] Seq[V] (On hash/range partitioned RDDs) g) : Outputs RDD to a storage system, e.g., HDFS Example: Page Rank le on RDDs in Spark. Seq[T] denotes a sequence of elements of type T. rates two ). The ala random of w over it. input file map links ranks 0 join contribs 0 reduce + map ranks 1 contribs 1 st() ranks 2 contribs 2 )*p.y... Figure 3: Lineage graph for datasets in PageRank.

90 Conclusion? Outline 1 Map Reduce 2 Spark 3 Conclusion?

91 Conclusion? Conclusion? 3 Conclusion?

92 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

93 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

94 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

95 Conclusion? Conclusion? Can anything really be dead? Is Map Reduce dead? Is Spark the last word? Questions?

96 Meetup Wrap-up Shameless plug

97 Meetup Wrap-up Shameless plug Space Monkey! Large scale storage Distributed system design and implementation Security and cryptography engineering Erasure codes Monitoring and sooo much data

98 Meetup Wrap-up Shameless plug Space Monkey! Come work with us!

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 25: Spark (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Yeah Yeah Yeahs Sacrilege (Mosquito) In-memory performance

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs. 10/22/2018 - FALL 2018 W10.A.0.0 10/22/2018 - FALL 2018 W10.A.1 FAQs Term project: Proposal 5:00PM October 23, 2018 PART 1. LARGE SCALE DATA ANALYTICS IN-MEMORY CLUSTER COMPUTING Computer Science, Colorado

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 26: Spark CSE 414 - Spring 2017 1 HW8 due next Fri Announcements Extra office hours today: Rajiv @ 6pm in CSE 220 No lecture Monday (holiday) Guest lecture Wednesday Kris

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Fault Tolerant Distributed Main Memory Systems

Fault Tolerant Distributed Main Memory Systems Fault Tolerant Distributed Main Memory Systems CompSci 590.04 Instructor: Ashwin Machanavajjhala Lecture 16 : 590.04 Fall 15 1 Recap: Map Reduce! map!!,!! list!!!!reduce!!, list(!! )!! Map Phase (per record

More information

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services Part II: Software Infrastructure in Data Centers: Distributed Execution Engines 1 MapReduce: Simplified

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

CS Spark. Slides from Matei Zaharia and Databricks

CS Spark. Slides from Matei Zaharia and Databricks CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Parallel Programming Concepts

Parallel Programming Concepts Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

a Spark in the cloud iterative and interactive cluster computing

a Spark in the cloud iterative and interactive cluster computing a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of

More information

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 13 April 14 th 2015 15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important

More information

Concurrency for data-intensive applications

Concurrency for data-intensive applications Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

6.830 Lecture Spark 11/15/2017

6.830 Lecture Spark 11/15/2017 6.830 Lecture 19 -- Spark 11/15/2017 Recap / finish dynamo Sloppy Quorum (healthy N) Dynamo authors don't think quorums are sufficient, for 2 reasons: - Decreased durability (want to write all data at

More information

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 13 April 12 th 2016 15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2

More information

Apache Spark Internals

Apache Spark Internals Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:

More information

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

Deep Learning Comes of Age. Presenter: 齐琦 海南大学 Deep Learning Comes of Age Presenter: 齐琦 海南大学 原文引用 Anthes, G. (2013). "Deep learning comes of age." Commun. ACM 56(6): 13-15. AI/Machine learning s new technology deep learning --- multilayer artificial

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

Evolution From Shark To Spark SQL:

Evolution From Shark To Spark SQL: Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese

More information

/ Cloud Computing. Recitation 13 April 17th 2018

/ Cloud Computing. Recitation 13 April 17th 2018 15-319 / 15-619 Cloud Computing Recitation 13 April 17th 2018 Overview Last week s reflection Team Project Phase 2 Quiz 11 OLI Unit 5: Modules 21 & 22 This week s schedule Project 4.2 No more OLI modules

More information

Why compute in parallel?

Why compute in parallel? HW 6 releases tonight Announcements Due Nov. 20th Waiting for AWS credit can take up to two days Sign up early: https://aws.amazon.com/education/awseducate/apply/ https://piazza.com/class/jmftm54e88t2kk?cid=452

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Lecture 22: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Lecture 22: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2017 Lecture 22: Spark (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Yeah Yeah Yeahs Sacrilege (Mosquito) In-memory performance

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35 MapReduce Kiril Valev LMU valevk@cip.ifi.lmu.de 23.11.2013 Kiril Valev (LMU) MapReduce 23.11.2013 1 / 35 Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Machine learning library for Apache Flink

Machine learning library for Apache Flink Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

CS535 Big Data Fall 2017 Colorado State University 9/19/2017 Sangmi Lee Pallickara Week 5- A.

CS535 Big Data Fall 2017 Colorado State University   9/19/2017 Sangmi Lee Pallickara Week 5- A. http://wwwcscolostateedu/~cs535 9/19/2016 CS535 Big Data - Fall 2017 Week 5-A-1 CS535 BIG DATA FAQs PART 1 BATCH COMPUTING MODEL FOR BIG DATA ANALYTICS 3 IN-MEMORY CLUSTER COMPUTING Computer Science, Colorado

More information

2

2 2 3 4 5 Prepare data for the next iteration 6 7 >>>input_rdd = sc.textfile("text.file") >>>transform_rdd = input_rdd.filter(lambda x: "abcd" in x) >>>print "Number of abcd :" + transform_rdd.count() 8

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Spark.jl: Resilient Distributed Datasets in Julia

Spark.jl: Resilient Distributed Datasets in Julia Spark.jl: Resilient Distributed Datasets in Julia Dennis Wilson, Martín Martínez Rivera, Nicole Power, Tim Mickel [dennisw, martinmr, npower, tmickel]@mit.edu INTRODUCTION 1 Julia is a new high level,

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau CSE 6242 / CX 4242 Data and Visual Analytics Georgia Tech Spark & Spark SQL High- Speed In- Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau Slides adopted from Matei Zaharia

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Docker @ CDS André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 1Centre de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Paul Trehiou Université de technologie de Belfort-Montbéliard

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground Google: A Computer Scientist s Playground Jochen Hollmann Google Zürich und Trondheim joho@google.com Outline Mission, data, and scaling Systems infrastructure Parallel programming model: MapReduce Googles

More information

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground Outline Mission, data, and scaling Google: A Computer Scientist s Playground Jochen Hollmann Google Zürich und Trondheim joho@google.com Systems infrastructure Parallel programming model: MapReduce Googles

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems 18. MapReduce Paul Krzyzanowski Rutgers University Fall 2015 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Credit Much of this information is from Google: Google Code University [no

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71 Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information