MongoDB DI Dr. Angelika Kusel

Size: px

Start display at page:

Download "MongoDB DI Dr. Angelika Kusel"

Joseph Ward
6 years ago
Views:

1 MongoDB DI Dr. Angelika Kusel 1 Motivation Problem Data is partitioned over large scale clusters Clusters change the rules for processing Good news Lots of machines to spread the computation over Bad news Programmer must deal with parallelization distribution of data Complex Programming! failure tolerance 2

2 Motivation A framework for processing huge datasets distributed over a large group of computers (nodes) Idea: "Push the program near the data" Brought up by Google in 2004 (patent!) Adopted by programming languages (e.g., Python), frameworks (e.g., Apache Hadoop), and NoSQL databases (e.g., CouchDB) Advantages Framework hides complexity of parallelization, distribution and failures Developer has to implement two operations, only map and reduce & reduce steps may be performed in parallel Application Scenarios Distributed search Counting of URL hits Creation of an index of words... 3 Basic Principle Inspired from functional programming languages such as Lisp, Scheme, and Haskell (map and fold functions) Processing happens in two distinct steps map and reduce step Generates from input data a collection of intermediate results similar to GROUP BY in SQL Input is a single aggregate Output is a bunch of key-value pairs Each application of the map function is independent of all others Safely parallelizable Framework can freely allocate aggregates to map tasks step Aggregates intermediate results similar to aggregation operations in SQL Input are multiple map outputs with the same key Output is combined value for a single key 4

3 Basic Principle Group by Aggregation 1 3 Iteration over the input and 2 Grouping of all computation of key/value pairs intermediate values by key Iteration over the resulting groups and reduction of each group Shuffle and Sort 5 Hello World Example Word Counting SQL SELECT word, count(word) FROM words GROUP BY word; Words a a b c Node1 Words c d Node2 map(): EmitIntermediate(w, "1"); Words a a c Node3 Shuffle and Sort 6

4 Hello World Example Word Counting reduce(string key, Iterator values): // key: word // values: number of occurrences int result = 0; for each i in values: result += ParseInt(i); Emit(key, AsString(result)); values is passed as an iterator, since huge data sets may not fit into main memory : a, : "4" : c, : "3" 7 Combinable rs Problem of basic Huge amount of data must be moved from node to node (network traffic!) Idea: apply reduce function additionally locally to reduce network traffic Idea: Apply reduce locally first! Less network traffic! Local Shuffle and Sort Global 8

5 Hello World Example Word Counting map(): EmitIntermediate(w, 1 ); Contents a a b c Node1 reduce(string key, Iterator values): int result = 0; for each i in values: result += ParseInt(i); Emit(key, AsString(result)); : a, : "2" Contents c d Node2 Contents a a c Node3 Local : a, : "2" 9 Hello World Example Word Counting reduce(string key, Iterator values): int result = 0; for each i in values: result += ParseInt(i); Emit(key, AsString(result)); : a, : "2" : a, : "2" : a, : "2" : a, : "4" : c, : "3" : a, : "2" Shuffle and Sort Global 10

6 Properties of Combinable rs Composability: Type of the value emitted by the <reduce> function must be identical to the type of the value emitted by the <map> function reduce(key, [C, reduce(key, [ A, B ])]) == reduce(key, [C, A, B]) Confluence Idempotency: reduce function can be applied multiple times without changing the result beyond the initial application reduce(key, [reduce(key, valuesarray)]) == reduce(key, valuesarray) Order-agnosticism: Order of the elements in the valuesarray should not affect the output of the reduce function reduce(key, [A, B]) == reduce(key, [B, A]) 11 in MongoDB -Function SQL SELECT word, count(word) FROM words GROUP BY word; Collection words {"_id": "1", "word" : "a"} {"_id": "2", "word" : "b"} {"_id": "3", "word" : "c"} {"_id": "4", "word" : "a"} Responsible for projection (cannot be performed later) and selection (useful to keep network traffic low)! Must choose appropriate key for grouping! function () { emit(this.word, {"word" : this.word, "count" : 1}); } a {"word": "a", "count": 1} b {"word": "b", "count": 1} c {"word": "c", "count": 1} a {"word": "a", "count": 1} 12

7 in MongoDB Shuffle and Sort a {"word": "a", "count": 1} b {"word": "b", "count": 1} c {"word": "c", "count": 1} a {"word": "a", "count": 1} Shuffle and Sort Responsible for grouping values according to key! a [{"word": "a", "count": 1}, {"word": "a", "count": 1}] b [{"word": "b", "count": 1}] c [{"word": "c", "count": 1}] is now an array with all values of same key! 13 in MongoDB -Function a [{"word": "a", "count": 1}, {"word": "a", "count": 1}] b [{"word": "b", "count": 1}] c [{"word": "c", "count": 1}] function (key, values) { var reduced = {"word" : values[0].word, "count":0}; } values.foreach(function(val) { reduced.count += val.count; }); return reduced; Responsible for aggregation! Must return documents of same structure as emitted value; key is added automatically! a {"word": "a", "count": 2} b {"word": "b", "count": 1} c {"word": "c", "count": 1} Result in MongoDB 14

8 Combinable r? function (key, values) { var reduced = {word : values[0].word, count:0}; values.foreach(function(val) { reduced.count += val.count; }); return reduced; } Composability? Type of the value emitted by the <reduce> function must be identical to the type of the value emitted by the <map> function Confluence? Idempotency: reduce function can be applied multiple times without changing the result beyond the initial application Order-agnosticism: Order of the elements in the valuesarray should not affect the output of the reduce function 15 in MongoDB Syntax db.collection.map( mapfunction, reducefunction, { out: <collection> {action : <collection>}, //action = (replace merge reduce) //replace = replaces existing collection //merge = merge document sets; overwrite existing ones //reduce = merge document sets; apply reduce-function to documents with equal keys query: <document>, //filter = WHERE-clause sort : <document>, //sorts the input documents limit : <number>, //maximum number of return documents finalize: <function>, //optional final function jsmode : <boolean> //specifies, if intermediate results are converted to BSON or not } ) 16

9 -Function of MongoDB Example var mapfunction1 = function(){ emit(this.word, {"word" : this.word, "count" : 1}); }; var reducefunction1 = function(key, values) { var reduced = {"word" : values[0].word, "count":0}; values.foreach(function(val) { reduced.count += val.count; }); return reduced; }; db.words.map( mapfunction1, reducefunction1, { out: "map_reduce_example" } ) 17

Database Systems CSE 414

Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create