Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons

Size: px

Start display at page:

Download "Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons"

Helen Summers
5 years ago
Views:

Oral Exams Dates Distributed Data Management Summer Semester 203 TU Kaiserslautern Dr.- Ing. Sebas9an Michel smichel@mmci.uni- saarland.

1 Oral Exams Dates Distributed Data Management Summer Semester 203 TU Kaiserslautern Dr.- Ing. Sebas9an Michel saarland.de Note: Last week of teaching at University, SS 3 July 5 - July 20, 203 How about: all (around 5?) slots in July (after end of teaching, or in last week) or early August. Your preferences? Distributed Data Management, SoSe 203, S. Michel Distributed Data Management, SoSe 203, S. Michel 2 Recap: Map and Reduce Map (k,v) à list(k2,v2) Reduce(k2, list(v2)) à list(k3, v3) keys allow grouping data to machines/ tasks Lecture 4 MAP REDUCE: APPLICATIONS (CONT D) For instance: k= document iden9fier v= document content k2= term v2=count k3= term v3= final count Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 (Equi) Join of 3 Rela9ons R(A,B) Join S(B,C) Join T(C,D) Can be implemented as Two 2- way joins, e.g., R(A,B) Join S(B,C) and then the result joined with T(C,D) Or directly, how? Join of 3 Rela9ons: Considera9ons R(A,B) Join S(B,C) Join T(C,D) Send tuples of S by key (b,c), but tuples in R and T for many combina9ons (*,b) and (c,*) Note: Theta joins (with arbitrary join predicate) are much more complicated. Distributed Data Management, SoSe 203, S. Michel 5 Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Distributed Data Management, SoSe 203, S. Michel 6

2 n- Grams Sta9s9cs about variable- length word sequences (e.g., lord of the rings, at the end of, ) have many applicaoons in fields including Informa9on Retrieval Natural Language Processing Digital Humani9es thou shalt not don t ya Example: Google Books Ngrams E.g., hfp://books.google.com/ngrams/ A n- gram dataset is also available from there Distributed Data Management, SoSe 203, S. Michel 7 n- gram slides based on a talk by Klaus Berberich Distributed Data Management, SoSe 203, S. Michel 8 n- grams Example Task: Compu9ng n- grams in MR Document: a x b b a y Possible n- grams: (a), (x), (b), (y) (ax), (xb), (bb), (axb), (xbb), (axbb), (xbba), (bbay) (axbba), (xbbay) (axbbay) words How can we efficiently compute n- grams, that occur at least τ 9mes and consist of at most σ words using MapReduce? Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203:0-2 Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 0 Naïve Solu9on: Simple Coun9ng map(did, content): for k in <... σ >: for all k- grams in content: emit(k- gram, did) reduce(n- gram, list<did>): if length(list<did>) >= τ: emit(n- gram, length(list<did>)) Distributed Data Management, SoSe 203, S. Michel A Priori Based A priori Principle*: k- gram can occur more than τ 9mes only if its cons9tuent (k- )- grams occur at least τ 9mes (a,b,c) qualified only if (b,c), (a,b) and (a), (b), (c) How to implement? *) Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining AssociaAon Rules between Sets of Items in Large Databases. SIGMOD Conference 993: Distributed Data Management, SoSe 203, S. Michel 2 2

3 A Priori Based (Cont d) Itera9ve Implementa9on: First - grams that occur τ 9mes Then 2- grams grams that occur τ 9mes Needs mul9ple MapReduce rounds (of full data scans) Already determined k- grams are kept Suffix Based Emit only suffixes in map phase Each of them represents mul9ple n- grams corresponding to its prefixes For instance, axbbay represents a, ax, axb, axbb, axbba, and axbbay map(did, content): for all suffixes in content: emit(suffix, did) Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 Suffix Based: Par99oning Partition the suffixes by first word to ensure all n-grams end up property for counting, that is: all occurrences of ax have to end up at same reducer suffix property: ax is only generated from suffixes that start with ax.. partition(suffix, did): return suffix[0] % m Suffix Based: Sor9ng Reducer has to generate n-grams based on suffixes read prefixes count for each observed prefix its frequency optimization: sort suffixes in reverse lexicographic order then: simple counting using stack compare(suffix0, suffix): return -strcmp(suffix0, suffix) aacd aaca aabx aaba aab ax.. Distributed Data Management, SoSe 203, S. Michel 5 Distributed Data Management, SoSe 203, S. Michel 6 Discussion Assess aforemen9oned algorithms with respect to proper9es like: mul9ple MapReduce jobs vs. single job amount of network traffic ease of implementa9on GRAPH PROCESSING IN MAPREDUCE Distributed Data Management, SoSe 203, S. Michel 7 Distributed Data Management, SoSe 203, S. Michel 8 3

Graph Processing in MapReduce Refresher: Breadth First Search (BFS) General: Graph Representa9on usually: Adjacency list v - > v2, v4, v5 v2 - > v4 v3 - > v5 v v5 v2 v4 v3 Q = FIFO queue enqueue

$Michel 20 Graph Processing in MapReduce Graph Processing in MapReduce (2) No global state in MapReduce Need to pass on results AND graph structure map(id, node) { emit(id, node) par9al_result =$

4 Graph Processing in MapReduce Refresher: Breadth First Search (BFS) General: Graph Representa9on usually: Adjacency list v - > v2, v4, v5 v2 - > v4 v3 - > v5 v v5 v2 v4 v3 Q = FIFO queue enqueue start node while not found: n := Q.dequeue if n== target then break foreach c in n.childlist Q.enqueue(c) Example visi9ng order: Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 20 Graph Processing in MapReduce Graph Processing in MapReduce (2) No global state in MapReduce Need to pass on results AND graph structure map(id, node) { emit(id, node) par9al_result = local_compute() for each neighbor in node.adjacencylist { emit(neighbor.id, par9al_result) reduce(id, list) { foreach msg in list{ if instanceof(msg) == Node node = msg else result = aggregate(result, msg) end node.value = result emit(id, node) re- construct outgoing edges for next round make use of incoming results Distributed Data Management, SoSe 203, S. Michel 2 Distributed Data Management, SoSe 203, S. Michel 22 BFS in MapReduce How to implement Breadth First Search in MapReduce? Hint: Need to pass on structure (as seen) before. Augment nodes with addi9onal informa9on: visited, distance. Applica9on: Compu9ng PageRank Link analysis model proposed by Brin&Page Compute authority scores In terms of: incoming links (weights) from other pages Random surfer model S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Distributed Data Management, SoSe 203, S. Michel 23 Distributed Data Management, SoSe 203, S. Michel 24 4

PageRank: Formal Defini9on PageRank of a page q: p) q) = ε + ( ε ) p p q out( p) N N Total number of pages; p) PageRank of page p; out(p) Outdegree of p ε Random jump probability Itera9ve computa9on

5 PageRank: Formal Defini9on PageRank of a page q: p) q) = ε + ( ε ) p p q out( p) N N Total number of pages; p) PageRank of page p; out(p) Outdegree of p ε Random jump probability Itera9ve computa9on un9l convergence Dangling nodes: Sinks. Solu9on: Add random jump (uniform) to any other nodes. Distributed Data Management, SoSe 203, S. Michel 25 v v5 Formal Model of Web Graph Matrix representa9on of graphs Given a graph G, its adjacency matrix A is n x n and a ij =, it there is a link from node i to node j a ij = 0, otherwise v2 v4 v3 v v2 v3 v4 v5 v 0 0 v v v v Distributed Data Management, SoSe 203, S. Michel 26 PageRank: Matrix Nota9on A Matrix containing the transi9on probabili9es T A = εp + ( ε) E where Pij = /out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix Probability distribu9on vector at 9me k x (0) x ( k ) k = A x is the star9ng vector PageRank Sta9onary distribu9on of the Markov Chain described by A, i.e., principal eigenvector or A ( k ) PageRank = lim x (0) k Distributed Data Management, SoSe 203, S. Michel 27 Reconsider: PageRank in MapReduce p) q) = ε + ( ε ) out( p) N p p q è to compute q) we need only informa9on about PR scores and out degree of nodes that link to q Have info: (page q, PR) linking to page p, p2, è Need to invert that pafern Distributed Data Management, SoSe 203, S. Michel 28 PR in MR: Map Phase PR in MR: Aer Map Phase map(nid m, node M) p = M.pageRank / M.adjacencyList emit(nid m, M) for all nid x in M.adjacencyList do emit(nid x, p) node has pagerank afribute and list of outgoing edges send info about outgoing edges send score mass to nodes M links to We have now: for page K (group by id of K): [pagein, IN)/INn], [pagein2, IN2)/INn2],. [pageo, pageo2, page03, ] PR informa9on from incoming links informa9on about outgoing links Distributed Data Management, SoSe 203, S. Michel 29 Distributed Data Management, SoSe 203, S. Michel 30 5

6 PR in MR: Reduce Phase Literature reduce(nid m, [p,p2, ]) s=0; M = node for all p in [p,p2, ] do if p instanceof node then M = p else s += p M.pageRank = (- ε)/n + ε*s emit(nid m, node M) recover outgoing edges sum up incoming PR scores Jeffrey Dean und Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google Labs. hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203: 0-2 Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Associa9on Rules between Sets of Items in Large Databases. SIGMOD Conference 993: S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Hadoop Book: Tom White. Hadoop: The defini9ve Guide. O Reilly, 3 rd edi9on. Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 32 Literature (2) Bloom, Burton H. (970), "Space/Ame trade- offs in hash coding with allowable errors", CommunicaAons of the ACM 3 (7): Broder, Andrei; Mitzenmacher, Michael (2005), "Network ApplicaAons of Bloom Filters: A Survey", Internet MathemaAcs (4): hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Publicly available book : hfp://lintool.github.io/mapreducealgorithms/mapreduce- book- final.pdf Distributed Data Management, SoSe 203, S. Michel 33 6

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Oral Exams Dates Note: Last