Databases 2 (VU) ( )

Size: px
Start display at page:

Download "Databases 2 (VU) ( )"

Transcription

1 Databases 2 (VU) ( ) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

2 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment 5 Map-Reduce Skew 6 Map-Reduce Examples Slides Slides are partially based on Mining Massive Datasets course from Stanford University by Jure Leskovec Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

3 Map-Reduce Motivation Today s data is huge Challenges How to distribute computation? Distributed/parallel programming is hard Map-reduce addresses both of these points Google s computational/data manipulation model Elegant way to work with huge data Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

4 Motivation Single node architecture CPU Memory Data fits in memory Machine learning, statistics Classical data mining Memory Disk Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

5 Motivation Motivation: Google example 20+ billion Web pages Approx. 20 KB per page Approx TB for the whole Web Approx hard drives to store the Web A single computer reads MB/s from disk Approx. 4 months to read the Web with a single computer Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

6 Motivation Motivation: Google example Takes even more time to do something with the data E.g. to calculate the PageRank If m is the number of the links on the Web Average degree on the Web is approx. 10, thus m To calculate PageRank we need per iteration step m multiplications We need approx iteration steps Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

7 Motivation Motivation: Google example Today a standard architecture for such problems is emerging Cluster of commodity Linux nodes Commodity network (ethernet) to connect them 2 10 Gbps between racks 1 Gbps within racks Each rack contains nodes Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

8 Cluster architecture Motivation Switch Switch Switch CPU CPU CPU CPU Memory Memory Memory Memory Memory Disk Memory Disk Memory Disk Memory Disk Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

9 Motivation Motivation: Google example 2011 estimation: Google had approx. 1 million machines report-google-uses-about servers/ Other examples: Facebook, Twitter, Amazon, etc. But also smaller examples: e.g. Wikipedia Single source shortest path: m + n time complexity, approx Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

10 Large Scale Computation Large scale computation Large scale computation for data mining on commodity hardware Challenges How to distribute computation? How can we make it easy to write distributed programs? How to cope with machine failures? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

11 Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

12 Large Scale Computation Large scale computation: machine failures One server may stay up 3 years (1000 days) The failure rate per day: p = 10 3 How many failures per day if we have n machines? Binomial r.v. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

13 Large Scale Computation Large scale computation: machine failures PMF of a Binomial r.v. p(k) = ( ) n (1 p) n k p k k Expectation of a Binomial r.v. E[X ] = np Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

14 Large Scale Computation Large scale computation: machine failures n = 1000, E[X ] = 1 If we have 1000 machines we lose one per day n = , E[X ] = 1000 If we have 1 million machines (Google) we lose 1 thousand per day Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

15 Large Scale Computation Large scale computation: data copying Copying data over network takes time Bring data closer to computation I.e. process data locally at each node Replicate data to increase reliability Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

16 Solution: Map-reduce Map-Reduce Storage infrastructure Distributed file system Google: GFS Hadoop: HDFS Programming model Map-reduce Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

17 Storage infrastructure Map-Reduce Problem: if node fails how to store a file persistently Distributed file system Provides global file namespace Typical usage pattern Huge files: several 100s GB to 1 TB Data is rarely updated in place Reads and appends are common Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

18 Distributed file system Map-Reduce Chunk servers File is split into contiguous chunks Typically each chunk is MB Each chunk is replicated (usually 2x or 3x) Try to keep replicas in different racks Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

19 Distributed file system Map-Reduce Master node Stores metadata about where files are stored Might be replicated Client library for file access Talks to master node to find chunk servers Connects directly to chunk servers to access data Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

20 Distributed file system Map-Reduce Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Reliable distributed file system Seamless recovery from node failures Bring computation directly to data Chunk servers also used as computation nodes Seamless recovery from disk or machine failure C 0 C 1 C 5 C 2 D 0 C 5 C 1 C 3 C 2 D 0 C 5 D 1 C 0 C 5 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N Bring computation directly to the data! Figure: Figure from slides by Jure Leskovec Chunk servers also serve as compute servers 1/8/2013 Denis Helic (KMI, TU Graz) Jure Leskovec, Stanford CS246: Mining Map-Reduce Massive Datasets, Nov 4, / 9033

21 Map-Reduce Programming model: Map-reduce Running example We want to count the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element. Example Example is meanwhile a standard Map-reduce example. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

22 Map-Reduce Programming model: Map-reduce words input_file sort uniq -c Three step process 1 Split file into words, each word on a separate line 2 Group and sort all words 3 Count the occurrences Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

23 Map-Reduce Programming model: Map-reduce This captures the essence of Map-reduce Split Group Count Naturally parallelizable E.g. split and count Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

24 Map-Reduce Programming model: Map-reduce Sequentially read a lot of data Map: extract something that you care about (key, value) Group by key: sort and shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline Outline is always the same: Map and Reduce change to fit the problem Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

25 tasks, so all key-value Map-Reduce pairs with the same key wind up at the same Reduce task. Programming model: Map-reduce 3. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Figure 2.2 suggests this computation. Input chunks Key value pairs (k,v) Keys with all their values (k, [v, w,...]) Combined output Map tasks Group by keys Reduce tasks Figure 2.2: Schematic of a map-reduce computation Figure: Figure from the book: Mining massive datasets The Map Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

26 Map-Reduce Programming model: Map-reduce Input: a set of (key, value) pairs (e.g. key is the filename, value is a single line in the file) Map(k, v) (k, v ) Takes a (k, v) pair and outputs a set of (k, v ) pairs There is one Map call for each (k, v) pair Reduce(k, (v ) ) (k, v ) All values v with same key k are reduced together and processed in v order There is one Reduce call for each unique k Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

27 Map-Reduce Programming model: Map-reduce Big Document Map Group by key Reduce Star Wars is an American epic space opera franchise centered on a film series created by George Lucas. The film series has spawned an extensive media franchise called the Expanded Universe including books, television series, computer and video games, and comic books. These supplements to the two film trilogies... (Star, 1) (Wars, 1) (is, 1) (an, 1) (American, 1) (epic, 1) (space, 1) (opera, 1) (franchise, 1) (centered, 1) (on, 1) (a, 1) (film, 1) (series, 1) (created, 1) (by, 1) (Star, 1) (Star, 1) (Wars, 1) (Wars, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (a, 1) (film, 1) (film, 1) (film, 1) (franchise, 1) (series, 1) (series, 1) (Star, 2) (Wars, 2) (a, 6) (film, 3) (franchise, 1) (series, 2) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

28 Map-Reduce Programming model: Map-reduce map(key, value): // key: document name // value: a single line from a document foreach word w in value: emit(w, 1) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

29 Map-Reduce Programming model: Map-reduce reduce(key, values): // key: a word // values: an iterator over counts result = 0 foreach count c in values: result += c emit(key, result) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

30 Map-reduce computation Environment Map-reduce environment takes care of: 1 Partitioning the input data 2 Scheduling the program s execution across a set of machines 3 Performing the group by key step 4 Handling machine failures 5 Managing required inter-machine communication Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

31 Map-reduce computation Environment MAP: Read input and produces a set of key-value pairs Big document Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, Partition) Reduce: Collect all values belonging to the key and output 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 43 Figure: Figure from the course by Jure Leskovec (Stanford University) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

32 Map-reduce computation Environment All phases are distributed with many tasks doing the work Figure: Figure from the course by Jure Leskovec (Stanford University) 1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 44 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

33 Environment Data flow Input and final output are stored in distributed file system Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local file systems of Map and Reduce workers Output is often input to another Map-reduce computation Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

34 Environment Coordination: Master Master node takes care of coordination Task status, e.g. idle, in-progress, completed Idle tasks get scheduled as workers become available When a Map task completes, it notifies the master about the size and location of its intermediate files Master pushes this info to reducers Master pings workers periodically to detect failures Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

35 Environment Map-reduce execution details 2.2. MAP-REDUCE 27 User Program fork fork Master fork assign Map assign Reduce Worker Worker Worker Worker Input Data Worker Intermediate Files Output File Figure 2.3: Overview of the execution of a map-reduce program Figure: Figure from the book: Mining massive datasets executing at a particular Worker, or completed). A Worker process reports to the Master when it finishes a task, and a new task is scheduled by the Master for that Worker process. Each Map task is assigned one or more chunks of the input file(s) and executes on it the code written by the user. The Map task creates a file for Denis Helic (KMI, TU each Graz) Reduce task on the local disk Map-Reduce of the Worker that executes the Map task. Nov 4, / 90

36 Maximizing parallelism Map-Reduce Skew If we want maximum parallelism then Use one Reduce task for each reducer (i.e. a single key and its associated value list) Execute each Reduce task at a different compute node The plan is typically not the best one Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

37 Map-Reduce Skew Maximizing parallelism There is overhead associated with each task we create We might want to keep the number of Reduce tasks lower than the number of different keys We do not want to create a task for a key with a short list There are often far more keys than there are compute nodes E.g. count words from Wikipedia or from the Web Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

38 Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? Example Example is based on the example from Mining Massive Datasets. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

39 Maximizing parallelism Map-Reduce Skew There is often significant variation in the lengths of value list for different keys Different reducers take different amounts of time to finish If we make each reducer a separate Reduce task then the task execution times will exhibit significant variance Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

40 Map-Reduce Skew Input data skew Input data skew describes an uneven distribution of the number of values per key Examples include power-law graphs, e.g. the Web or Wikipedia Other data with Zipfian distribution E.g. the number of word occurrences Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

41 Map-Reduce Skew Power-law (Zipf) random variable PMF p(k) = k α ζ(α) k N, k 1, α > 1 ζ(α) is the Riemann zeta function ζ(α) = k=1 k α Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

42 Map-Reduce Skew Power-law (Zipf) random variable Probability mass function of a Zipf random variable; differing α values 0.9 α = α = probability of k k Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

43 Map-Reduce Skew Power-law (Zipf) input data Key size: sum =189681,µ =1.897,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

44 Map-Reduce Skew Tackling input data skew We need to distribute a skewed (power-law) input data into a number of reducers/reduce tasks/compute nodes The distribution of the key lengths inside of reducers/reduce tasks/compute nodes should be approximately normal The variance of these distributions should be smaller than the original variance If variance is small an efficient load balancing is possible Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

45 Map-Reduce Skew Tackling input data skew Each Reduce task receives a number of keys The total number of values to process is the sum of the number of values over all keys The average number of values that a Reduce task processes is the average of the number of values over all keys Equivalently, each compute node receives a number of Reduce tasks The sum and average for a compute node is the sum and average over all Reduce tasks for that node Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

46 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

47 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

48 Map-Reduce Skew Tackling input data skew How should we distribute keys to Reduce tasks? Uniformly at random Other possibilities? Calculate the capacity of a single Reduce task Add keys until capacity is reached, etc. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

49 Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

50 Map-Reduce Skew Tackling input data skew We are averaging over a skewed distribution Are there laws that describe how the averages of sufficiently large samples drawn from a probability distribution behaves? In other words, how are the averages of samples of a r.v. distributed? Central-limit Theorem Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

51 Map-Reduce Skew Central-Limit Theorem The central-limit theorem describes the distribution of the arithmetic mean of sufficiently large samples of independent and identically distributed random variables The means are normally distributed The mean of the new distribution equals the mean of the original distribution The variance of the new distribution equals σ2 n, where σ2 is the variance of the original distribution Thus, we keep the mean and reduce the variance Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

52 Central-Limit Theorem Map-Reduce Skew Theorem Suppose X 1,..., X n are independent and identical r.v. with the expectation µ and variance σ 2. Let Y be a r.v. defined as: Y n = 1 n n i=1 X i The CDF F n (y) tends to PDF of a normal r.v. with the mean µ and variance σ 2 for n : lim F 1 y n(y) = e (x µ)2 2σ 2 n 2πσ 2 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

53 Map-Reduce Skew Central-Limit Theorem Example Practically, it is possible to replace F n (y) with a normal distribution for n > 30 We should always average over at least 30 values Approximating uniform r.v. with a normal r.v. by sampling and averaging Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

54 Central-Limit Theorem Map-Reduce Skew 18 Averages: µ =0.5,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

55 Central-Limit Theorem Map-Reduce Skew 40 Averages: µ =0.499,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

56 Map-Reduce Skew Central-Limit Theorem IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/clt.ipynb Command Line ipython notebook pylab=inline clt.ipynb Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

57 Map-Reduce Skew Input data skew We can reduce impact of the skew by using fewer Reduce tasks than there are reducers If keys are sent randomly to Reduce tasks we average over value list lengths Thus, we average over the total time for each Reduce task (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

58 Map-Reduce Skew Input data skew We can further reduce the skew by using more Reduce tasks than there are compute nodes Long Reduce tasks might occupy a compute node fully Several shorter Reduce tasks are executed sequentially at a single compute node Thus, we average over the total time for each compute node (Central-limit Theorem) We should make sure that the sample size is large enough (n > 30) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

59 Input data skew Map-Reduce Skew Key size: sum =196524,µ =1.965,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

60 Input data skew Map-Reduce Skew 4500 Task key size: sum =196524,µ = ,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

61 Input data skew Map-Reduce Skew Task key averages: µ =1.958,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

62 Input data skew Map-Reduce Skew Node key size: sum =196524,µ = ,σ 2 = Node key averages: µ =1.976,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

63 Map-Reduce Skew Input data skew IPython Notebook examples http: //kti.tugraz.at/staff/denis/courses/kddm1/mrskew.ipynb Command Line ipython notebook pylab=inline mrskew.ipynb Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

64 Map-Reduce Skew Combiners Sometimes a Reduce function is associative and commutative Commutative: x y = y x Associative: (x y) z = x (y z) The values can be combined in any order, with the same result The addition in reducer of the word count example is such an operation Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

65 Map-Reduce Skew Combiners When the Reduce function is associative and commutative we can push some of the reducers work to the Map tasks E.g. instead of emitting (w, 1), (w, 1),... We can apply the Reduce function within the Map task In that way the output of the Map task is combined before grouping and sorting Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

66 Combiners Map-Reduce Skew Map Combiner Group by key Reduce (Star, 1) (Wars, 1) (is, 1)... (American, 1) (epic, 1) (space, 1)... (franchise, 1) (centered, 1) (on, 1)... (film, 1) (series, 1) (created, 1) (Star, 2)... (Wars, 2) (a, 6)... (a,3)... (film, 3) (franchise, 1) (series, 2) (Star, 2) (Wars, 2) (a, 6) (a, 3) (a, 4)... (film, 3) (franchise, 1) (series, 2) (Star, 2) (Wars, 2) (a, 13) (film, 3) (franchise, 1) (series, 2) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

67 Map-Reduce Skew Input data skew: Exercise Exercise Suppose we execute the word-count map-reduce program on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. 1 Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? 2 If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? 3 Suppose we do use a combiner at the 100 Map tasks. Do you expect skew to be significant? Why or why not? Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

68 Input data skew Map-Reduce Skew Key size: sum =195279,µ =1.953,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

69 Input data skew Map-Reduce Skew Task key averages: µ =1.793,σ 2 = Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

70 Map-Reduce Skew Input data skew IPython Notebook examples combiner.ipynb Command Line ipython notebook pylab=inline combiner.ipynb Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

71 Map-Reduce Examples Map-Reduce: Applications Map-reduce computation make sense when files are large and rarely updated in place E.g. we will not see map-reduce computation when managing online sales We will not see map-reduce for handling Web request (even if we have millions of users) However, you want use map-reduce for analytic queries on the data generated by an e.g. Web application E.g. find users with similar bying patterns Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

72 Map-Reduce Examples Map-Reduce: Applications Computations such as analytic queries typically involve matrix operations The original purpose for the map-reduce implementation was to execute large matrix-vector multiplications to calculate the PageRank Matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into map-reduce programming model Another important class of operations that can use map-reduce effectively are the relational-algebra operations Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

73 Map-Reduce: Exercise Map-Reduce Examples Exercise Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

74 Map-Reduce Examples Matrix-vector multiplication Let us first assume that the vector v is large but it still can fit into the memory The matrix M and the vector v will be each stored in a file of the DFS We assume that the row-column coordinates of a matrix element (indices) can be discovered E.g. each value is stored as a triple (i, j, m ij ) Similarly, the position of v j can be discovered in the analogous way Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

75 Map-Reduce Examples Matrix-vector multiplication The Map Function: The map function applies to one element of the matrix M The vector v is first read in its entirety and is subsequentelly available for all Map tasks at that computaion node From each matrix element m ij the map function produces the key-value pair (i, m ij v j ) All terms of the sum that make up the component x i of the matrix-vector product will get the same key i The Reduce Function: The reduce function sums all the values associated with a given key i The result is a pair (i, x i ) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

76 Map-Reduce Examples Matrix-vector multiplication However, it is possible that the vector v can not fit into main memory It is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses As an alternative we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height The goal is to use enough stripes so that the portion of the vector in one stripe can fit into main memoray at a compute node Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

77 Map-Reduce Examples disk accesses as we move pieces of the vector into main memory to multiply Matrix-vector components by multiplication elements of the matrix. Thus, as an alternative, we can divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes, of the same height. Our goal is to use enough stripes so that the portion of the vector in one stripe can fit conveniently into main memory at a compute node. Figure 2.4 suggests what the partition looks like if the matrix and vector are each divided into five stripes. Matrix M Vector v Figure 2.4: Division of a matrix and vector into five stripes The ith stripe of the matrix multiplies only components from the ith stripe of the vector. Thus, we can divide the matrix into one file for each stripe, and do the same for the vector. Each Map task is assigned a chunk from one of the stripes of the matrix and gets the entire corresponding stripe of the vector. Denis Helic The(KMI, MapTU and Graz) Reduce tasks can then Map-Reduce act exactly as was described abovenov for4, the / 90

78 Map-Reduce Examples Matrix-vector multiplication The ith stripe of the matrix multiplies only components from the ith stripe of the vector We can divide matrix into one file for each stripe, and do the same for the vector Each Map task is assigned a chunk from one of the stripes in the matrix and it gets the entire corresponding stripe of the vector The Map and Reduce tasks can then act exactly as before We need to sum up once more the results of the stripes multiplication Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

79 Map-Reduce Examples Relational-Algebra operations There are many operation on data that can be described easily in terms of the common database-query primitives The queries themselves must not be executed within a DBMS E.g. standard operations on relations A relation is a table with column headers called attributes The set of attributes of a relation is called its schema: R(A 1, A 2,..., A n ) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

80 Example of a relation Map-Reduce Examples From To url1 url2 url1 url3 url2 url3 url2 url Table: Relation Links Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

81 Map-Reduce Examples Example of a relation A tuple is a pair or URLs such that there is at least one link from the first URL to the second The first row (url1, url2) states that the Web page at url1 points to the WEb page at url2 You would typically have a similar relation stored by a search engine With billions of tuples Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

82 Relational-Algebra Map-Reduce Examples Standard operation on relations are 1 Selection: apply a condiotion C to each tuple and output only tuples that satisfy C 2 Projection: Produce from each tuple only a subset S of attributes 3 Union, Intersection, Difference: set operations on tuples 4 Natural Join: Given two relations compare each pair of tuples and output those that agree on all common attributes 5 Grouping and Aggregation: Partition the tuples in a relation according to their values in a set of attributes. For each group perform one of the operations such as SUM, COUNT, AVG, MIN or MAX Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

83 Map-Reduce Examples Example of relational algebra Find the paths of length two in the Web using the Links relation In other words find triples of URLs (u, v, w) such that there is a link between u and v and a link between v and w We want to take natural join of Links with itself Let us describe this with two copies of Links: L1(U1, U2) and L2(U2, U3) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

84 Map-Reduce Examples Example of relational algebra Now we compute L1 L2 For eaxh tuple t1 of L1 and each tuple t2 of L2 we see if their U2 components are same If these two components agree we produce (U1, U2, U3) as a result If we want only to check for the existence of the path of length two we might want to project onto U1 and U3 π U1,U3 (L1 L2) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

85 Map-Reduce Examples Example of relational algebra Imagine that a social-networking site has a relation Friends(User, Friend) Suppose we want to calculate the statistics about the number of friends of each user In terms of relational algebra we would perform groupng and aggregation: γ User,COUNT (Friend) (Friends) This operation groups all tuples by the value of the first component and then counts the number of friends Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

86 Map-Reduce Examples Selection by Map-Reduce Selection Given is a relation R. We want to compute σ C (R) The Map Function: For each tuple t in R test if it satisfies C If so produce the key value pair (t, t) The Reduce Function: The reduce function is identity It simply passes each key-value pair to the output Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

87 Map-Reduce Examples Projection by Map-Reduce Selection Given is a relation R. We want to compute π S (R) The Map Function: For each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S Output the key-value pair (t, t ) The Reduce Function: For each key t there will be one or more key-value pairs (t, t ) The reduce function turns (t, [t, t,..., t ]) into (t, t ) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

88 Map-Reduce Examples Natural-Join by Map-Reduce Selection Given are relations R(A, B) and S(B, C). We want to compute R S The Map Function: For each tuple (a, b) of R produce the key-value pair (b, (R, a)) For each tuple (b, c) of S produce the key-value pair (b, (S, c)) The Reduce Function: Each key b will be associated with a list of pairs that are either of the form (R, a) or (S, c) Construct all pairs consisting of the values (a, b, c) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

89 Map-Reduce Examples Grouping and Aggregation by Map-Reduce Selection Given is a relation R(A, B, C). We want to compute γ A,θ(B) (R) The Map Function: Map produces the grouping For each tuple (a, b, c) produce the key-value pair (a, b) The Reduce Function: The reduce function produces the aggregation Each key a represents a group Apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of the values associated with a The outpur is a pair (a, x), where x is the result of θ applied to the list Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

90 Map-Reduce Examples Matrix-matrix multiplication Exercise Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have another n n matrix N with elements n ij. Then the matrix-matrix product is the matrix P of dimensions n n, whose elements are given by P ijk = n m ij n jk j=1 Outline a Map-Reduce program that calculates the matrix P. Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

91 Map-Reduce Examples Matrix-matrix multiplication We can think of a matrix as a relation with three attributes: the row number, the column number and the value in that row and column Thus, we can represent M as a relation M(I, J, V ) with tuples (i, j, m ij ) Similarly, we can represent N as a relation N(J, K, W ) with tuples (j, k, n jk ) The matrix-matrix product is almost a natural join followed by grouping and aggregation Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

92 Map-Reduce Examples Matrix-matrix multiplication The common attribute of two relations is J The natural join of M(I, J, V ) and N(J, K, W ) would produce a tuple (i, j, k, v, w) This five component relation is a pair of matrix elements (m ij, n jk ) We need (i, j, k, v w) Once when we have this relation we can perform grouping and aggregation Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

93 Map-Reduce Examples Matrix-matrix multiplication The Map Function: For each matrix element m ij produce the key-value pair (j, (M, i, m ij )) Likewise for each matrix element n jk produce the key-value pair (j, (N, k, n jk )) M and N are names indicating from which matrix the element originates The Reduce Function: For each key j examine the list of associated values For each value (M, i, m ij )) that comes from M and for each value (N, k, n jk )) that comes from N produce a key-value pair with key equal to (i, k) and value equal to m ij n jk Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

94 Map-Reduce Examples Matrix-matrix multiplication The Map Function: The Map function is identity The Reduce Function: For each key (i, k) produce the sum of the list of values associated with this key The result is a pair ((i, k), p ik ) Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, / 90

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Dis Commodity Clusters Web data sets can be ery large Tens to hundreds of terabytes

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2) INTRODUCTION TO DATA SCIENCE MapReduce and the New Software Stacks(MMDS2) Big-Data Hardware Computer Clusters Computation: large number of computers/cpus Network: Ethernet switching Storage: Large collection

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 1/6/2015 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 1/6/2015 Jure Leskovec,

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Generalizing Map- Reduce

Generalizing Map- Reduce Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata MapReduce and Hadoop Debapriyo Majumdar Indian Statistical Institute Kolkata debapriyo@isical.ac.in Let s keep the intro short Modern data mining: process immense amount of data quickly Exploit parallelism

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Evaluation of relational operations

Evaluation of relational operations Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University MapReduce & HyperDex Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University 1 Distributing Processing Mantra Scale out, not up. Assume failures are common. Move processing to the data. Process

More information

Implementation of Relational Operations

Implementation of Relational Operations Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations. Relational Operations Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

CSE 190D Spring 2017 Final Exam

CSE 190D Spring 2017 Final Exam CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information