Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons

Size: px
Start display at page:

Download "Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons"

Transcription

1 Oral Exams Dates Distributed Data Management Summer Semester 203 TU Kaiserslautern Dr.- Ing. Sebas9an Michel saarland.de Note: Last week of teaching at University, SS 3 July 5 - July 20, 203 How about: all (around 5?) slots in July (after end of teaching, or in last week) or early August. Your preferences? Distributed Data Management, SoSe 203, S. Michel Distributed Data Management, SoSe 203, S. Michel 2 Recap: Map and Reduce Map (k,v) à list(k2,v2) Reduce(k2, list(v2)) à list(k3, v3) keys allow grouping data to machines/ tasks Lecture 4 MAP REDUCE: APPLICATIONS (CONT D) For instance: k= document iden9fier v= document content k2= term v2=count k3= term v3= final count Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 (Equi) Join of 3 Rela9ons R(A,B) Join S(B,C) Join T(C,D) Can be implemented as Two 2- way joins, e.g., R(A,B) Join S(B,C) and then the result joined with T(C,D) Or directly, how? Join of 3 Rela9ons: Considera9ons R(A,B) Join S(B,C) Join T(C,D) Send tuples of S by key (b,c), but tuples in R and T for many combina9ons (*,b) and (c,*) Note: Theta joins (with arbitrary join predicate) are much more complicated. Distributed Data Management, SoSe 203, S. Michel 5 Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Distributed Data Management, SoSe 203, S. Michel 6

2 n- Grams Sta9s9cs about variable- length word sequences (e.g., lord of the rings, at the end of, ) have many applicaoons in fields including Informa9on Retrieval Natural Language Processing Digital Humani9es thou shalt not don t ya Example: Google Books Ngrams E.g., hfp://books.google.com/ngrams/ A n- gram dataset is also available from there Distributed Data Management, SoSe 203, S. Michel 7 n- gram slides based on a talk by Klaus Berberich Distributed Data Management, SoSe 203, S. Michel 8 n- grams Example Task: Compu9ng n- grams in MR Document: a x b b a y Possible n- grams: (a), (x), (b), (y) (ax), (xb), (bb), (axb), (xbb), (axbb), (xbba), (bbay) (axbba), (xbbay) (axbbay) words How can we efficiently compute n- grams, that occur at least τ 9mes and consist of at most σ words using MapReduce? Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203:0-2 Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 0 Naïve Solu9on: Simple Coun9ng map(did, content): for k in <... σ >: for all k- grams in content: emit(k- gram, did) reduce(n- gram, list<did>): if length(list<did>) >= τ: emit(n- gram, length(list<did>)) Distributed Data Management, SoSe 203, S. Michel A Priori Based A priori Principle*: k- gram can occur more than τ 9mes only if its cons9tuent (k- )- grams occur at least τ 9mes (a,b,c) qualified only if (b,c), (a,b) and (a), (b), (c) How to implement? *) Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining AssociaAon Rules between Sets of Items in Large Databases. SIGMOD Conference 993: Distributed Data Management, SoSe 203, S. Michel 2 2

3 A Priori Based (Cont d) Itera9ve Implementa9on: First - grams that occur τ 9mes Then 2- grams grams that occur τ 9mes Needs mul9ple MapReduce rounds (of full data scans) Already determined k- grams are kept Suffix Based Emit only suffixes in map phase Each of them represents mul9ple n- grams corresponding to its prefixes For instance, axbbay represents a, ax, axb, axbb, axbba, and axbbay map(did, content): for all suffixes in content: emit(suffix, did) Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 Suffix Based: Par99oning Partition the suffixes by first word to ensure all n-grams end up property for counting, that is: all occurrences of ax have to end up at same reducer suffix property: ax is only generated from suffixes that start with ax.. partition(suffix, did): return suffix[0] % m Suffix Based: Sor9ng Reducer has to generate n-grams based on suffixes read prefixes count for each observed prefix its frequency optimization: sort suffixes in reverse lexicographic order then: simple counting using stack compare(suffix0, suffix): return -strcmp(suffix0, suffix) aacd aaca aabx aaba aab ax.. Distributed Data Management, SoSe 203, S. Michel 5 Distributed Data Management, SoSe 203, S. Michel 6 Discussion Assess aforemen9oned algorithms with respect to proper9es like: mul9ple MapReduce jobs vs. single job amount of network traffic ease of implementa9on GRAPH PROCESSING IN MAPREDUCE Distributed Data Management, SoSe 203, S. Michel 7 Distributed Data Management, SoSe 203, S. Michel 8 3

4 Graph Processing in MapReduce Refresher: Breadth First Search (BFS) General: Graph Representa9on usually: Adjacency list v - > v2, v4, v5 v2 - > v4 v3 - > v5 v v5 v2 v4 v3 Q = FIFO queue enqueue start node while not found: n := Q.dequeue if n== target then break foreach c in n.childlist Q.enqueue(c) Example visi9ng order: Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 20 Graph Processing in MapReduce Graph Processing in MapReduce (2) No global state in MapReduce Need to pass on results AND graph structure map(id, node) { emit(id, node) par9al_result = local_compute() for each neighbor in node.adjacencylist { emit(neighbor.id, par9al_result) reduce(id, list) { foreach msg in list{ if instanceof(msg) == Node node = msg else result = aggregate(result, msg) end node.value = result emit(id, node) re- construct outgoing edges for next round make use of incoming results Distributed Data Management, SoSe 203, S. Michel 2 Distributed Data Management, SoSe 203, S. Michel 22 BFS in MapReduce How to implement Breadth First Search in MapReduce? Hint: Need to pass on structure (as seen) before. Augment nodes with addi9onal informa9on: visited, distance. Applica9on: Compu9ng PageRank Link analysis model proposed by Brin&Page Compute authority scores In terms of: incoming links (weights) from other pages Random surfer model S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Distributed Data Management, SoSe 203, S. Michel 23 Distributed Data Management, SoSe 203, S. Michel 24 4

5 PageRank: Formal Defini9on PageRank of a page q: p) q) = ε + ( ε ) p p q out( p) N N Total number of pages; p) PageRank of page p; out(p) Outdegree of p ε Random jump probability Itera9ve computa9on un9l convergence Dangling nodes: Sinks. Solu9on: Add random jump (uniform) to any other nodes. Distributed Data Management, SoSe 203, S. Michel 25 v v5 Formal Model of Web Graph Matrix representa9on of graphs Given a graph G, its adjacency matrix A is n x n and a ij =, it there is a link from node i to node j a ij = 0, otherwise v2 v4 v3 v v2 v3 v4 v5 v 0 0 v v v v Distributed Data Management, SoSe 203, S. Michel 26 PageRank: Matrix Nota9on A Matrix containing the transi9on probabili9es T A = εp + ( ε) E where Pij = /out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix Probability distribu9on vector at 9me k x (0) x ( k ) k = A x is the star9ng vector PageRank Sta9onary distribu9on of the Markov Chain described by A, i.e., principal eigenvector or A ( k ) PageRank = lim x (0) k Distributed Data Management, SoSe 203, S. Michel 27 Reconsider: PageRank in MapReduce p) q) = ε + ( ε ) out( p) N p p q è to compute q) we need only informa9on about PR scores and out degree of nodes that link to q Have info: (page q, PR) linking to page p, p2, è Need to invert that pafern Distributed Data Management, SoSe 203, S. Michel 28 PR in MR: Map Phase PR in MR: Aer Map Phase map(nid m, node M) p = M.pageRank / M.adjacencyList emit(nid m, M) for all nid x in M.adjacencyList do emit(nid x, p) node has pagerank afribute and list of outgoing edges send info about outgoing edges send score mass to nodes M links to We have now: for page K (group by id of K): [pagein, IN)/INn], [pagein2, IN2)/INn2],. [pageo, pageo2, page03, ] PR informa9on from incoming links informa9on about outgoing links Distributed Data Management, SoSe 203, S. Michel 29 Distributed Data Management, SoSe 203, S. Michel 30 5

6 PR in MR: Reduce Phase Literature reduce(nid m, [p,p2, ]) s=0; M = node for all p in [p,p2, ] do if p instanceof node then M = p else s += p M.pageRank = (- ε)/n + ε*s emit(nid m, node M) recover outgoing edges sum up incoming PR scores Jeffrey Dean und Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google Labs. hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203: 0-2 Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Associa9on Rules between Sets of Items in Large Databases. SIGMOD Conference 993: S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Hadoop Book: Tom White. Hadoop: The defini9ve Guide. O Reilly, 3 rd edi9on. Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 32 Literature (2) Bloom, Burton H. (970), "Space/Ame trade- offs in hash coding with allowable errors", CommunicaAons of the ACM 3 (7): Broder, Andrei; Mitzenmacher, Michael (2005), "Network ApplicaAons of Bloom Filters: A Survey", Internet MathemaAcs (4): hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Publicly available book : hfp://lintool.github.io/mapreducealgorithms/mapreduce- book- final.pdf Distributed Data Management, SoSe 203, S. Michel 33 6

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Oral Exams Dates Note: Last

More information

Computing n-gram Statistics in MapReduce

Computing n-gram Statistics in MapReduce Computing n-gram Statistics in MapReduce Klaus Berberich (kberberi@mpi-inf.mpg.de) Srikanta Bedathur (bedathur@iiitd.ac.in) n-gram Statistics Statistics about variable-length word sequences (e.g., lord

More information

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Lecture 4 PIG/HIVE Distributed

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Generalizing Map- Reduce

Generalizing Map- Reduce Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute

More information

MapReduce: Algorithm Design for Relational Operations

MapReduce: Algorithm Design for Relational Operations MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

PPI Network Alignment Advanced Topics in Computa8onal Genomics

PPI Network Alignment Advanced Topics in Computa8onal Genomics PPI Network Alignment 02-715 Advanced Topics in Computa8onal Genomics PPI Network Alignment Compara8ve analysis of PPI networks across different species by aligning the PPI networks Find func8onal orthologs

More information

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search

More information

Lecture 2 Data Cube Basics

Lecture 2 Data Cube Basics CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1 Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao-

More information

Graphs (Part II) Shannon Quinn

Graphs (Part II) Shannon Quinn Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

SEMINAR: GRAPH-BASED METHODS FOR NLP

SEMINAR: GRAPH-BASED METHODS FOR NLP SEMINAR: GRAPH-BASED METHODS FOR NLP Organisatorisches: Seminar findet komplett im Mai statt Seminarausarbeitungen bis 15. Juli (?) Hilfen Seminarvortrag / Ausarbeitung auf der Webseite Tucan number for

More information

Performance and Scalability: Apriori Implementa6on

Performance and Scalability: Apriori Implementa6on Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Fall 2015 Lecture 1 Introduc?on Class Logis?cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC 4.706 MTh 11:00 12:00 AM By appointment Email: dfranke@cs.utexas.edu Web page:

More information

It also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory.

It also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory. Reference 1. Processing Theta-Joins using MapReduce, Alper Okcan, Mirek Riedewald Northeastern University 1.1. Video of presentation on above paper 2. Optimizing Joins in map-reduce environment F.N. Afrati,

More information

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,

More information

Graph Data Processing with MapReduce

Graph Data Processing with MapReduce Distributed data processing on the Cloud Lecture 5 Graph Data Processing with MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution

More information

Teach A level Compu/ng: Algorithms and Data Structures

Teach A level Compu/ng: Algorithms and Data Structures Teach A level Compu/ng: Algorithms and Data Structures Eliot Williams @MrEliotWilliams Course Outline Representa+ons of data structures: Arrays, tuples, Stacks, Queues,Lists 2 Recursive Algorithms 3 Searching

More information

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on M 2 R: Enabling Stronger Privacy in MapReduce Computa;on Anh Dinh, Prateek Saxena, Ee- Chien Chang, Beng Chin Ooi, Chunwang Zhang School of Compu,ng Na,onal University of Singapore 1. Mo;va;on Distributed

More information

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas9an Michel smichel@mmci.uni- saarland.de Lecture 1 MOTIVATION AND OVERVIEW Distributed Data Management, SoSe 2013, S. Michel

More information

MapReduce. Tom Anderson

MapReduce. Tom Anderson MapReduce Tom Anderson Last Time Difference between local state and knowledge about other node s local state Failures are endemic Communica?on costs ma@er Why Is DS So Hard? System design Par??oning of

More information

CS60092: Informa0on Retrieval

CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address ques)ons

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing

More information

Link Analysis Informa0on Retrieval. Evangelos Kanoulas

Link Analysis Informa0on Retrieval. Evangelos Kanoulas Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content

More information

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1 CMPUT 391 Database Management Systems Query Processing: The Basics Textbook: Chapter 10 (first edition: Chapter 13) Based on slides by Lewis, Bernstein and Kifer University of Alberta 1 External Sorting

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives

More information

Introduc)on to Informa)on Retrieval. Index Construc.on. Slides by Manning, Raghavan, Schutze

Introduc)on to Informa)on Retrieval. Index Construc.on. Slides by Manning, Raghavan, Schutze Index Construc.on Slides by Manning, Raghavan, Schutze 1 Plan Last lecture: Dic.onary data structures Tolerant retrieval Wildcards Spell correc.on Soundex a-hu hy-m n-z $m mace madden mo among amortize

More information

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining Unsupervised learning: Data Mining Associa6on rules and frequent itemsets mining Data Mining concepts Is the computa6onal process of discovering pa

More information

Link State Rou.ng Reading: Sec.ons 4.2 and 4.3.4

Link State Rou.ng Reading: Sec.ons 4.2 and 4.3.4 Link State Rou.ng Reading: Sec.ons. and.. COS 6: Computer Networks Spring 009 (MW :0 :50 in COS 05) Michael Freedman Teaching Assistants: WyaN Lloyd and Jeff Terrace hnp://www.cs.princeton.edu/courses/archive/spring09/cos6/

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

On Page Rank. 1 Introduction

On Page Rank. 1 Introduction On Page Rank C. Hoede Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente P.O.Box 217 7500 AE Enschede, The Netherlands Abstract In this paper the concept of page rank

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Large-Scale Duplicate Detection

Large-Scale Duplicate Detection Large-Scale Duplicate Detection Potsdam, April 08, 2013 Felix Naumann, Arvid Heise Outline 2 1 Freedb 2 Seminar Overview 3 Duplicate Detection 4 Map-Reduce 5 Stratosphere 6 Paper Presentation 7 Organizational

More information

Introduction to Database Systems CSE 444, Winter 2011

Introduction to Database Systems CSE 444, Winter 2011 Version March 15, 2011 Introduction to Database Systems CSE 444, Winter 2011 Lecture 20: Operator Algorithms Where we are / and where we go 2 Why Learn About Operator Algorithms? Implemented in commercial

More information

MapReduce Patterns, Algorithms, and Use Cases

MapReduce Patterns, Algorithms, and Use Cases MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)! Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

More information

Compila(on (Semester A, 2013/14)

Compila(on (Semester A, 2013/14) Compila(on 0368-3133 (Semester A, 2013/14) Lecture 4: Syntax Analysis (Top- Down Parsing) Modern Compiler Design: Chapter 2.2 Noam Rinetzky Slides credit: Roman Manevich, Mooly Sagiv, Jeff Ullman, Eran

More information

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates CSCI 599 Class Presenta/on Zach Levine Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates April 26 th, 2012 Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on-

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and

More information

RELATIONAL OPERATORS #1

RELATIONAL OPERATORS #1 RELATIONAL OPERATORS #1 CS 564- Spring 2018 ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? Algorithms for relational operators: select project 2 ARCHITECTURE OF A DBMS query

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

COMP Associa0on Rules

COMP Associa0on Rules COMP 4601 Associa0on Rules 1 Road map Basic concepts Apriori algorithm Different data formats for mining Mining with mul0ple minimum supports Mining class associa0on rules Summary 2 What Is Frequent Pattern

More information

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word

More information

Introduc)on to. CS60092: Informa0on Retrieval

Introduc)on to. CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in

More information

Lecture 13: Abstract Data Types / Stacks

Lecture 13: Abstract Data Types / Stacks ....... \ \ \ / / / / \ \ \ \ / \ / \ \ \ V /,----' / ^ \ \.--..--. / ^ \ `--- ----` / ^ \. ` > < / /_\ \. ` / /_\ \ / /_\ \ `--' \ /. \ `----. / \ \ '--' '--' / \ / \ \ / \ / / \ \ (_ ) \ (_ ) / / \ \

More information

MapReduce for Graph Algorithms

MapReduce for Graph Algorithms Seminar: Massive-Scale Graph Analysis Summer Semester 2015 MapReduce for Graph Algorithms Modeling & Approach Ankur Sharma ankur@stud.uni-saarland.de May 8, 2015 Agenda 1 Map-Reduce Framework Big Data

More information

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some

More information

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce. Cloud Computing COMP / ECPE 293A Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

Hypergraph Sparsifica/on and Its Applica/on to Par//oning Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer

More information

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management:

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Query Processing: The Basics. External Sorting

Query Processing: The Basics. External Sorting Query Processing: The Basics Chapter 10 1 External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot use traditional

More information

Sta$c Single Assignment (SSA) Form

Sta$c Single Assignment (SSA) Form Sta$c Single Assignment (SSA) Form SSA form Sta$c single assignment form Intermediate representa$on of program in which every use of a variable is reached by exactly one defini$on Most programs do not

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 11 more on Data Organiza:on Pa;erns CS 378 - Fall 2016 Big Data Programming 1 Assignment 5 - Review Define an Avro object for user session One user session for each

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Keyword query interpretation over structured data

Keyword query interpretation over structured data Keyword query interpretation over structured data Advanced Methods of Information Retrieval Elena Demidova SS 2018 Elena Demidova: Advanced Methods of Information Retrieval SS 2018 1 Recap Elena Demidova:

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 4: Index Construc*on Plan Last lecture: Dic*onary data structures Tolerant retrieval

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

Index construc-on. Friday, 8 April 16 1

Index construc-on. Friday, 8 April 16 1 Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1 4.3 Single-pass

More information

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

CSE 444: Database Internals. Sec2on 4: Query Op2mizer CSE 444: Database Internals Sec2on 4: Query Op2mizer Plan for Today Problem 1A, 1B: Es2ma2ng cost of a plan You try to compute the cost for 5 mins We go over the solu2on together Problem 2: Sellinger Op2mizer

More information

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #21: Data Mining and Warehousing

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #21: Data Mining and Warehousing CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #21: Data Mining and Warehousing Overview Tradi8onal database systems are tuned to many, small, simple queries. New applica8ons

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Text search. CSE 392, Computers Playing Jeopardy!, Fall

Text search. CSE 392, Computers Playing Jeopardy!, Fall Text search CSE 392, Computers Playing Jeopardy!, Fall 2011 Stony Brook University http://www.cs.stonybrook.edu/~cse392 1 Today 2 parts: theoretical: costs of searching substrings, data structures for

More information

Network Analysis Integra2ve Genomics module

Network Analysis Integra2ve Genomics module Network Analysis Integra2ve Genomics module Michael Inouye Centre for Systems Genomics University of Melbourne, Australia Summer Ins@tute in Sta@s@cal Gene@cs 2016 SeaBle, USA @minouye271 inouyelab.org

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Using Sequen+al Run+me Distribu+ons for the Parallel Speedup Predic+on of SAT Local Search

Using Sequen+al Run+me Distribu+ons for the Parallel Speedup Predic+on of SAT Local Search Using Sequen+al Run+me Distribu+ons for the Parallel Speedup Predic+on of SAT Local Search Alejandro Arbelaez - CharloBe Truchet - Philippe Codognet JFLI University of Tokyo LINA, UMR 6241 University of

More information

CS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3

CS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3 CS 4700: Foundations of Artificial Intelligence Bart Selman Search Techniques R&N: Chapter 3 Outline Search: tree search and graph search Uninformed search: very briefly (covered before in other prerequisite

More information

Uninformed search strategies

Uninformed search strategies Uninformed search strategies A search strategy is defined by picking the order of node expansion Uninformed search strategies use only the informa:on available in the problem defini:on Breadth- first search

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Frameworks for Graph-Based Problems

Frameworks for Graph-Based Problems Frameworks for Graph-Based Problems Dakshil Shah U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Chetashri Bhadane Assistant Professor Computer Engineering

More information

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra

More information

Overview of this week

Overview of this week Overview of this week Debugging tips for ML algorithms Graph algorithms at scale A prototypical graph algorithm: PageRank n memory Putting more and more on disk Sampling from a graph What is a good sample

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

An applica)on of Markov Chains: PageRank. Finding relevant informa)on on the Web

An applica)on of Markov Chains: PageRank. Finding relevant informa)on on the Web An applica)on of Markov Chains: PageRank Finding relevant informa)on on the Web Please Par)cipate h>p://www.st.ewi.tudelc.nl/~marco/lectures.html How much do you know about PageRank? 1) Nothing. 2) I

More information

Indexing Large-Scale Data

Indexing Large-Scale Data Indexing Large-Scale Data Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook November 16, 2010

More information