Databases 2 (VU) ( / )

Size: px

Start display at page:

Download "Databases 2 (VU) ( / )"

Julie Hampton
5 years ago
Views:

1 Databases 2 (VU) ( / ) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

2 Outline 1 Problems Suited for Map-Reduce 2 MapReduce: Applications Matrix-Vector Multiplication Information Retrieval 3 Hadoop Ecosystem Big Data Storage Technologies Slides are partially based on Slides Mining Massive Datasets by Jure Leskovec Slides MapReduce Runtime Environments by Gilles Fedak Slides Tutorial: MapReduce Theory and Practice of Data-intensive Applications by Pietro Michiardi MapReduce: Simplified Data Processing on Large Clusters by Dean et al., The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce by J.Lin, Limitations and Challenges of HDFS and MapReduce by Weets et al., Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

3 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

4 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... in general, when files are large and are rarely updated in place not suitable when managing online sales for instance, the principal operations on Amazon data involve responding to searches for products, recording sales, and so on; processes that involve relatively little calculation and that change the database Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

5 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... in general, when files are large and are rarely updated in place not suitable when managing online sales for instance, the principal operations on Amazon data involve responding to searches for products, recording sales, and so on; processes that involve relatively little calculation and that change the database instead, use MapReduce for analytical queries on data generated, e.g., by a Web application to find users with similar buying patterns or to rank search results (PageRank),... not suitable for handling Web requests (even if we have millions of users) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

6 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

7 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

8 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data real-time processing - when fast responses are needed for complex algorithms for example, machine learning algorithms such as SVMs Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

9 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data real-time processing - when fast responses are needed for complex algorithms for example, machine learning algorithms such as SVMs processing graphs use frameworks such as Giraph Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

10 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

11 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling when iterations are needed for example clustering algorithms such as K-Means use frameworks such as Spark Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

12 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling when iterations are needed for example clustering algorithms such as K-Means use frameworks such as Spark handling streaming data MR is best suited to batch process hugh amounts of data use frameworks such as Storm Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

13 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

14 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank these matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into the MapReduce programming model Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

15 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank these matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into the MapReduce programming model another important class of operations that can use MapReduce effectively are relational-algebra operations many operations on data can be described easily in terms of the common database-query primitives such as selection, union, intersection, joins which can be used for social network analysis, e.g. finding paths of different lengths, counting friends, etc. Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

16 MapReduce: Applications Application 1: Matrix-Vector Multiplication What do we need to do? Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

17 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

18 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

19 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

20 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS assume that the row-column coordinates of a matrix element (indices) can be discovered for example, each value is stored as a triple (i, j, m ij ) similarly, the position of v j can be discovered analogously Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

21 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

22 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

23 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

24 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

25 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

26 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? reduce function sums all the values associated with a given key i result is a pair (i, x i ) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

27 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? reduce function sums all the values associated with a given key i result is a pair (i, x i ) thereby calculating one entry in the output vector x Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

28 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

29 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory alternatively we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height use enough stripes so that the portion of the vector in one stripe can fit into main memory at a compute node Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

30 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer Figure: Divide matrix M and vector v into stripes Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

31 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

32 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

33 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v each Map task is assigned a chunk from one of the stripes in the matrix and gets the entire corresponding stripe of the vector Map and Reduce tasks can then act exactly as before need to sum up once more the results of the stripes multiplication Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

34 MapReduce: Applications Information Retrieval Application 2: Information Retrieval Information Retrieval (IR) in a Nutshell is the activity of obtaining information resources which are relevant to an information need (for instance a search query you submit to Google) from a collection of information resources (for instance databases of texts, images or sounds). Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

35 MapReduce: Applications Information Retrieval Application 2: Information Retrieval Information Retrieval (IR) in a Nutshell is the activity of obtaining information resources which are relevant to an information need (for instance a search query you submit to Google) from a collection of information resources (for instance databases of texts, images or sounds). Information Retrieval is so to say the Science of Searching for Information Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

36 MapReduce: Applications Information Retrieval Information Retrieval Process Figure: taken from Eissa Alshari Semantic Arabic Information Retrieval Framework Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

37 MapReduce: Applications Information Retrieval Search Process - Step 1: Indexing Figure: taken from Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

38 MapReduce: Applications Information Retrieval Search Process - Step 2: Retrieval Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

39 MapReduce: Applications Information Retrieval Are Indexing or Retrieval MapReducable? The Indexing Problem Scalability is critical must be relatively fast, but need not be real time fundamentally a batch operation sounds like a Map-Reduce problem Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

40 MapReduce: Applications Information Retrieval Are Indexing or Retrieval MapReducable? The Indexing Problem Scalability is critical must be relatively fast, but need not be real time fundamentally a batch operation sounds like a Map-Reduce problem The Retrieval Problem must have sub-second response time for the web, only need relatively few results rather not a Map-Reduce problem Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

41 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

42 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Sort/shuffle: group postings by term Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

43 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Sort/shuffle: group postings by term Reduce gather and sort the postings (e.g., by docnr. or tf) write postings to disk Map Reduce does all the heavy lifting Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

44 MapReduce: Applications Information Retrieval Index Construction with MapReduce: An Overview Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

45 MapReduce: Applications Information Retrieval including information on term position: Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

46 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

47 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? so instead: let the framework do the sorting; the term frequency is implicitly stored; postings are directly written to the disk Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

48 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? so instead: let the framework do the sorting; the term frequency is implicitly stored; postings are directly written to the disk Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

49 Hadoop Ecosystem Hadoop Eco System (v1) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

50 Hadoop Ecosystem Hadoop Eco System (v1) HBase open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

51 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

52 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive Pig open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis high-level platform for creating programs that run on Apache Hadoop (language is called Pig Latin) abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

53 Hadoop Ecosystem Hadoop Eco System (v2) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

54 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

55 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

56 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm Giraph Impala are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) utilizes Apache Hadoop s MapReduce implementation to process graphs is Cloudera s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

57 Hadoop Ecosystem Data Processing: From Batch to Streaming... Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

58 Hadoop Ecosystem Java(-ish) is the Hadoop Language Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

59 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

60 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System see Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

61 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System see The Ugly Maven dependency hell, if integrated with other systems Spark depends on > 50 libraries with a specific version! Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

62 Hadoop Ecosystem History of Hadoop Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

63 Hadoop Ecosystem System Design usually boils down to picking the right components wrt. Input, Processing + Output Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

64 Hadoop Ecosystem System Design Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

65 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

66 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

67 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Message-based: Kafka distributed Producer/Consumer messaging system data partitioned in topics producer groups / consumer groups Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

68 Hadoop Ecosystem Big Data Storage Technologies File-based: HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

69 Hadoop Ecosystem Big Data Storage Technologies File-based: HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

70 Hadoop Ecosystem Big Data Storage Technologies Key-Value based: HBase Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

71 Hadoop Ecosystem Big Data Storage Technologies Key-Value based: HBase Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

72 Hadoop Ecosystem Big Data Storage Technologies Message-based: Kafka Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

73 Hadoop Ecosystem Big Data Storage Technologies Message-based: Kafka Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

74 Hadoop Ecosystem Big Data Storage Technologies Map Reduce Lectures Recap: Part 1: Handling Big Data Key Elements: MapReduce Framework & Distributed File System Part 2: Optimization: Maximizing Parallelism Stragglers Problem & Input Data Skew Part 3: Suitability & Applications Hadoop Ecosystem & Storage Technologies Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

75 Hadoop Ecosystem Big Data Storage Technologies The End Next: Beyond Map-Reduce Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure