Big Data and Scripting mongodb map reduce

Size: px

Start display at page:

Download "Big Data and Scripting mongodb map reduce"

Andrea Harris
6 years ago
Views:

1 Big Data and Scripting mongodb map reduce 1,

2 2, last lecture replication distribution full distributed storage network today use storage network for distributed computations introduce map/reduce framework

3 3, map reduce - introduction aim massive parallelization (arbitrary scaling) heterogeneous environment (different types of nodes) failure tolerance (nodes can fail) why not PRAM model? in PRAM, many cores access the same memory in parallel arbitrary scaling one large machine with many CPUs and much RAM very expensive if realizable at all using many cheap nodes connected via network scales better

4 4, map reduce - a new model of computation simulating a PRAM in a computing network simulation is possible in principle handling of locking, parallel writes and synchronization is complicated PRAM algorithms are not aware of data locations transport of data between nodes reduces simulation speed many parallel processes working on one big memory creating a new paradigm for distributed computations split data into many small parts handle each part independently from all the other combine results when parts are finished

5 5, the simplest example the task: word frequencies document with words, many repeated find bag-of-words -vector: b N, such that b w = repetitions of w in document splitting and recombination split documents in tuples: w, 1 (key w, value 1) collect all values for each key: w, (1, 1,..., 1) combine by adding values: w, k w splitting can be parallelized using chunks of the document combination is independent for each word

6 6, generalizing key,value computations input: (input) key/value pairs k 1, v 1,..., k n, v n output: (output) key/value pairs k i, r i map k i, v i { k i, v i : i 0,..., n } map each input pair to zero or more output pairs modify or even split/filter input shuffle for each k i collect all v i and create new tuple distribute to reduce instances, e.g. different nodes reduce k i, (v 1,..., v k ) k i, r i combine all values for one key to (partial) result do not change key

7 7, generalizing key/value computations mapping is independent for each input key can be applied completely in parallel can be executed on nodes that store the corresponding data must be implemented without side-effects shuffling is independent of algorithm implemented and optimized once map computations to nodes where data is stored (when possible) can be executed partially parallel to mapping only finished when mapping is finished reduce step again parallel for each output key can be started only when is finished mapping more complex algorithms can be realized by combining several map/reduce steps

8 8, a general interface for map/reduce (many) different implementations for map/reduce exist all share the common interface: input: set of key, value pairs input: map-function input: reduce-function input flexibility: can be a single element can consist of elements with same key (e.g. NULL) analogous for output all input elements can be mapped to one key single output value

9 9, example approach for a complex parallel algorithm a first map/reduce step partitions the input into many sub-problems intermediate steps solve sub-problems in parallel and possible combine them final step maps all elements to single key and reduces to final result

10 10, example: recursive k-means clustering the STREAM k-means clustering can be adopted to the map/reduce framework 1. distribution/partitioning map: partition the input set randomly using a hash-function as key reduce: values for one key are clustered into 4k clusters 2. repeated recursive reduction map: redistribute weighted cluster representatives, use new hash-function with smaller range reduce: cluster weighted representatives for each key 3. recombination to single clustering map: map all weighted representatives to one key reduce: combine representatives to a final clustering of input set in contrast to the stream implementation, the data points can be assigned to the resulting cluster centers

11 11, map/reduce in mongodb import an example table with cities and population size: 1 mongoimport --db scratch --collection zips --file zips.json the imported collection: > use scratch switched to db scratch > db.zips.findone() { "city" : "ACMAR", "loc" : [ , ], "pop" : 6055, "state" : "AL", "_id" : "35004" } 1 media.mongodb.org/zips.json

12 12, map/reduce in mongodb task: create a histogram of city sizes for appropriate k number of cities with population in [1000k, 1000(k + 1)) (include lower, exclude upper bound) expression in map/reduce terms map values to their bucket: > map=function(){value=math.round(this.pop/1000); emit(value,1);} reduce by summing up values in bucket: > reduce=function(value, counts){ return(array.sum(counts)); } note: map and reduce are variables holding functions these will in the following as parameters

13 13, defining map and reduce functions > map=function(){value=math.round(this.pop/1000); emit(value,1);} map does not need a parameter access input document as this arbitrary many calls to emit(key,value) create input for shuffle-stage key and value can be simple values or documents > reduce=function(value, counts){ return(array.sum(counts)); } reduce has two parameters, named arbitrarily first: the key (here: value), corresponding to key used in emit second: an array of values, each array element has type corresponding to the second argument of emit implementation in mongodb expects additional properties (later)

14 14, execution and refinement execute and put result in new collection hist 1000 > db.zips.mapreduce(map, reduce, {out:"hist_1000"}) { "result" : "hist_1000", "timemillis" : 531, "counts" : { "input" : 29467, "emit" : 29467, "reduce" : 2135, "output" : 97}, "ok" : 1,} execution delivers result document with execution parameters result collection execution time input pairs, pairs emitted in from map() input pairs for reduce() (after shuffling) output pairs

15 15, output of map/reduce in mongodb > show collections hist_1000 system.indexes zips > db.hist_1000.findone() { "_id" : 0, "value" : 4415 } result is stored in collection one document for each key/value pair in result key is used as id value is stored as field value

16 16, additional requirements for reduce() mongodbs implementation differs from original definition original definition: reduce() executed once for each key in mongodb: multiple times for each key with part of the values additional requirements for reduce() functions: type of return equals type of input (value only, key is not changed) idempotent reduce(reduce(key,values))=reduce(key,values) associative reduce(key,[a,reduce(key,[a,b])]==reduce(key,[a,b,c]) commutative reduce(key,[a,b])==reduce(key,[b,a]) note: this allows application of the recursive parallelization scheme for associative functions on many arguments

17 17, additional options to mapreduce() mandatory arguments: map, reduce, out other parameters are optional query: limit input to documents matching the query limit: limit number of input documents sort: sort input documents may be useful for optimization others are: finalize, scope

18 18, finalization limitation of reduce() may result in unwanted output format output can be finally modified using finalize: db.collection.mapreduce(map, reduce, {out:"..", finalize: finfunction}) function arguments equal to reduce() return value ends up as value-field in result set

19 using global variables with scope allows to blend in global variables into map, reduce and finalize specify set of variables from JavaScript context: scope: ["var1","var2"] these can be read and modified in map/reduce functions example: > db.zips.mapreduce(map,... function(v,c){counter++;... return({sum:array.sum(c), count:counter});},... {out:"hist", limit:100, scope:["counter"]}) { "result" : "hist",... > db.hist.find() { "_id" : 0, "value" : 1 } { "_id" : 1, "value" : { "sum" : 12, "count" : 56 } } { "_id" : 2, "value" : { "sum" : 9, "count" : 57 } } only one value with key 0 did not enter reduce 19,

20 20, output options out : "output_collection" default behaviour: store output key/value pairs in collection if collection exists, overwrite alternative: output action out: {<action> : <collectionname>, <optionalparameters>} replace (the default behavior) merge overwrite only documents with key existing in output reduce merge documents with equal keys using reduce

21 21, more output options optionalparameters: db: store result collection in other database default is db of input collection sharded: distribute (shard) output using id as key nonatomic: when merging or reducing the output into an existing collection prevent locking of the database with nonatomic: true default is false, i.e. database is locked for output writing output mode without storage in collection: out: {inline : 1} intermediate documents not stored in database pure in-memory computation result document has additional array-field results with output

22 22, using functions via api calls functions of the form db.collection.find() are executed in the JavaScript context on the mongo-console access via APIs provides execution of mongo queries and commands, but not JavaScript a minimal interface minimal context provided by (almost) all APIs a connection object (or variable) a database object methods to send commands to a mongodb instance

23 23, accessing mongo via API access to subset of mongodb functions in language context execution of JavaScript commands is often not directly supported sending JavaScript code is error-prone and should be avoided generic command execution: db.runcommand() executes database commands 2 provided by mongodb parameter availability can differ from JavaScript counterpart provides access to functions not directly supported by api example: db.runcommand({aggregate: "<collection>", pipeline: [<pipeline>]}) 2 list: docs.mongodb.org/manual/reference/command/

24 execution of JavaScript eval allows execution of JavaScript functions: { eval: <function>, args: [ <arg1>, <arg2>,...], nolock: <boolean>} example: > db.runcommand({eval: function(x){db.tmp.insert({value:x});},... args: ["2"]}) { "retval" : null, "ok" : 1 } > db.tmp.find() { "_id" : ObjectId("51c1ee258ab237572ba6ea88"), "value" : "2" } > db.runcommand({eval: "db.tmp.insert({value:3})"}) > db.tmp.find() { "_id" : ObjectId("51c1ee258ab237572ba6ea88"), "value" : "2" } { "_id" : ObjectId("51c29fc08ab237572ba6ea8c"), "value" : 3 } by default execution blocks the complete database prevent this with nolock:true 24,

25 25, storing pre-defined functions mongodb provides a special collection for storing functions: db.system.js these functions are database specific can access database as db id is function name, value holds function > db.system.js.insert({_id:"tmpinsert", value: function(x)... {db.tmp.insert({value:x});}}) load function functions: > db.loadserverscripts() > tmpinsert(3) > db.runcommand({eval : tmpinsert, args: [24]}) { "retval" : null, "ok" : 1 } > db.tmp.find() { "_id" : ObjectId("51c2a27be806ddd460c2ae90"), "value" : 3 } { "_id" : ObjectId("51c2a3698ab237572ba6ea8d"), "value" : 24 }

26 26, formalizing map/reduce for automated execution aim: create a formalized storage for map/reduce algorithms allow automated execution of stored algorithms a single map reduce operation has 3 functions, one optional these could be stored in a mr-step document: {map: function()..., reduce: function()..., finalize: function()...} a complete algorithm consists of (possibly) multiple steps can be formalized as mr-algorithm document: {_id : "algoname", functions : [ <mr1>, <mr2>,...]} where <mr1> is a document describing a single map/reduce step

27 27, automatic execution of map/reduce algorithms assume a collection of mr-algorithm documents execution of these can be scripted in a function mr-execution input: algorithm name, input collection, output collection get algorithm document from collection for first element in functions: execute mapreduce input collection output collection for each element in functions: execute mapreduce output collection output collection

Course Content MongoDB

Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL