Oral Exams Dates. Distributed Data Management Summer Semester 2013 TU Kaiserslautern. Recap: Map and Reduce. (Equi) Join of 3 Rela9ons
|
|
- Helen Summers
- 5 years ago
- Views:
Transcription
1 Oral Exams Dates Distributed Data Management Summer Semester 203 TU Kaiserslautern Dr.- Ing. Sebas9an Michel saarland.de Note: Last week of teaching at University, SS 3 July 5 - July 20, 203 How about: all (around 5?) slots in July (after end of teaching, or in last week) or early August. Your preferences? Distributed Data Management, SoSe 203, S. Michel Distributed Data Management, SoSe 203, S. Michel 2 Recap: Map and Reduce Map (k,v) à list(k2,v2) Reduce(k2, list(v2)) à list(k3, v3) keys allow grouping data to machines/ tasks Lecture 4 MAP REDUCE: APPLICATIONS (CONT D) For instance: k= document iden9fier v= document content k2= term v2=count k3= term v3= final count Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 (Equi) Join of 3 Rela9ons R(A,B) Join S(B,C) Join T(C,D) Can be implemented as Two 2- way joins, e.g., R(A,B) Join S(B,C) and then the result joined with T(C,D) Or directly, how? Join of 3 Rela9ons: Considera9ons R(A,B) Join S(B,C) Join T(C,D) Send tuples of S by key (b,c), but tuples in R and T for many combina9ons (*,b) and (c,*) Note: Theta joins (with arbitrary join predicate) are much more complicated. Distributed Data Management, SoSe 203, S. Michel 5 Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Distributed Data Management, SoSe 203, S. Michel 6
2 n- Grams Sta9s9cs about variable- length word sequences (e.g., lord of the rings, at the end of, ) have many applicaoons in fields including Informa9on Retrieval Natural Language Processing Digital Humani9es thou shalt not don t ya Example: Google Books Ngrams E.g., hfp://books.google.com/ngrams/ A n- gram dataset is also available from there Distributed Data Management, SoSe 203, S. Michel 7 n- gram slides based on a talk by Klaus Berberich Distributed Data Management, SoSe 203, S. Michel 8 n- grams Example Task: Compu9ng n- grams in MR Document: a x b b a y Possible n- grams: (a), (x), (b), (y) (ax), (xb), (bb), (axb), (xbb), (axbb), (xbba), (bbay) (axbba), (xbbay) (axbbay) words How can we efficiently compute n- grams, that occur at least τ 9mes and consist of at most σ words using MapReduce? Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203:0-2 Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 0 Naïve Solu9on: Simple Coun9ng map(did, content): for k in <... σ >: for all k- grams in content: emit(k- gram, did) reduce(n- gram, list<did>): if length(list<did>) >= τ: emit(n- gram, length(list<did>)) Distributed Data Management, SoSe 203, S. Michel A Priori Based A priori Principle*: k- gram can occur more than τ 9mes only if its cons9tuent (k- )- grams occur at least τ 9mes (a,b,c) qualified only if (b,c), (a,b) and (a), (b), (c) How to implement? *) Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining AssociaAon Rules between Sets of Items in Large Databases. SIGMOD Conference 993: Distributed Data Management, SoSe 203, S. Michel 2 2
3 A Priori Based (Cont d) Itera9ve Implementa9on: First - grams that occur τ 9mes Then 2- grams grams that occur τ 9mes Needs mul9ple MapReduce rounds (of full data scans) Already determined k- grams are kept Suffix Based Emit only suffixes in map phase Each of them represents mul9ple n- grams corresponding to its prefixes For instance, axbbay represents a, ax, axb, axbb, axbba, and axbbay map(did, content): for all suffixes in content: emit(suffix, did) Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 4 Suffix Based: Par99oning Partition the suffixes by first word to ensure all n-grams end up property for counting, that is: all occurrences of ax have to end up at same reducer suffix property: ax is only generated from suffixes that start with ax.. partition(suffix, did): return suffix[0] % m Suffix Based: Sor9ng Reducer has to generate n-grams based on suffixes read prefixes count for each observed prefix its frequency optimization: sort suffixes in reverse lexicographic order then: simple counting using stack compare(suffix0, suffix): return -strcmp(suffix0, suffix) aacd aaca aabx aaba aab ax.. Distributed Data Management, SoSe 203, S. Michel 5 Distributed Data Management, SoSe 203, S. Michel 6 Discussion Assess aforemen9oned algorithms with respect to proper9es like: mul9ple MapReduce jobs vs. single job amount of network traffic ease of implementa9on GRAPH PROCESSING IN MAPREDUCE Distributed Data Management, SoSe 203, S. Michel 7 Distributed Data Management, SoSe 203, S. Michel 8 3
4 Graph Processing in MapReduce Refresher: Breadth First Search (BFS) General: Graph Representa9on usually: Adjacency list v - > v2, v4, v5 v2 - > v4 v3 - > v5 v v5 v2 v4 v3 Q = FIFO queue enqueue start node while not found: n := Q.dequeue if n== target then break foreach c in n.childlist Q.enqueue(c) Example visi9ng order: Distributed Data Management, SoSe 203, S. Michel 9 Distributed Data Management, SoSe 203, S. Michel 20 Graph Processing in MapReduce Graph Processing in MapReduce (2) No global state in MapReduce Need to pass on results AND graph structure map(id, node) { emit(id, node) par9al_result = local_compute() for each neighbor in node.adjacencylist { emit(neighbor.id, par9al_result) reduce(id, list) { foreach msg in list{ if instanceof(msg) == Node node = msg else result = aggregate(result, msg) end node.value = result emit(id, node) re- construct outgoing edges for next round make use of incoming results Distributed Data Management, SoSe 203, S. Michel 2 Distributed Data Management, SoSe 203, S. Michel 22 BFS in MapReduce How to implement Breadth First Search in MapReduce? Hint: Need to pass on structure (as seen) before. Augment nodes with addi9onal informa9on: visited, distance. Applica9on: Compu9ng PageRank Link analysis model proposed by Brin&Page Compute authority scores In terms of: incoming links (weights) from other pages Random surfer model S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Distributed Data Management, SoSe 203, S. Michel 23 Distributed Data Management, SoSe 203, S. Michel 24 4
5 PageRank: Formal Defini9on PageRank of a page q: p) q) = ε + ( ε ) p p q out( p) N N Total number of pages; p) PageRank of page p; out(p) Outdegree of p ε Random jump probability Itera9ve computa9on un9l convergence Dangling nodes: Sinks. Solu9on: Add random jump (uniform) to any other nodes. Distributed Data Management, SoSe 203, S. Michel 25 v v5 Formal Model of Web Graph Matrix representa9on of graphs Given a graph G, its adjacency matrix A is n x n and a ij =, it there is a link from node i to node j a ij = 0, otherwise v2 v4 v3 v v2 v3 v4 v5 v 0 0 v v v v Distributed Data Management, SoSe 203, S. Michel 26 PageRank: Matrix Nota9on A Matrix containing the transi9on probabili9es T A = εp + ( ε) E where Pij = /out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix Probability distribu9on vector at 9me k x (0) x ( k ) k = A x is the star9ng vector PageRank Sta9onary distribu9on of the Markov Chain described by A, i.e., principal eigenvector or A ( k ) PageRank = lim x (0) k Distributed Data Management, SoSe 203, S. Michel 27 Reconsider: PageRank in MapReduce p) q) = ε + ( ε ) out( p) N p p q è to compute q) we need only informa9on about PR scores and out degree of nodes that link to q Have info: (page q, PR) linking to page p, p2, è Need to invert that pafern Distributed Data Management, SoSe 203, S. Michel 28 PR in MR: Map Phase PR in MR: Aer Map Phase map(nid m, node M) p = M.pageRank / M.adjacencyList emit(nid m, M) for all nid x in M.adjacencyList do emit(nid x, p) node has pagerank afribute and list of outgoing edges send info about outgoing edges send score mass to nodes M links to We have now: for page K (group by id of K): [pagein, IN)/INn], [pagein2, IN2)/INn2],. [pageo, pageo2, page03, ] PR informa9on from incoming links informa9on about outgoing links Distributed Data Management, SoSe 203, S. Michel 29 Distributed Data Management, SoSe 203, S. Michel 30 5
6 PR in MR: Reduce Phase Literature reduce(nid m, [p,p2, ]) s=0; M = node for all p in [p,p2, ] do if p instanceof node then M = p else s += p M.pageRank = (- ε)/n + ε*s emit(nid m, node M) recover outgoing edges sum up incoming PR scores Jeffrey Dean und Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google Labs. hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Foto N. Afra9, Jeffrey D. Ullman: Op9mizing joins in a map- reduce environment. EDBT 200: 99-0 Alper Okcan, Mirek Riedewald: Processing theta- joins using MapReduce. SIGMOD Conference 20: Klaus Berberich, Srikanta J. Bedathur: Compu9ng n- gram sta9s9cs in MapReduce. EDBT 203: 0-2 Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Associa9on Rules between Sets of Items in Large Databases. SIGMOD Conference 993: S. Brin & L. Page. The anatomy of a large- scale hypertextual web search engine. In WWW Conf Hadoop Book: Tom White. Hadoop: The defini9ve Guide. O Reilly, 3 rd edi9on. Distributed Data Management, SoSe 203, S. Michel 3 Distributed Data Management, SoSe 203, S. Michel 32 Literature (2) Bloom, Burton H. (970), "Space/Ame trade- offs in hash coding with allowable errors", CommunicaAons of the ACM 3 (7): Broder, Andrei; Mitzenmacher, Michael (2005), "Network ApplicaAons of Bloom Filters: A Survey", Internet MathemaAcs (4): hfp://craig- henderson.blogspot.de/2009//dewif- and- stonebrakers- mapreduce- major.html Publicly available book : hfp://lintool.github.io/mapreducealgorithms/mapreduce- book- final.pdf Distributed Data Management, SoSe 203, S. Michel 33 6
Distributed Data Management Summer Semester 2013 TU Kaiserslautern
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Oral Exams Dates Note: Last
More informationComputing n-gram Statistics in MapReduce
Computing n-gram Statistics in MapReduce Klaus Berberich (kberberi@mpi-inf.mpg.de) Srikanta Bedathur (bedathur@iiitd.ac.in) n-gram Statistics Statistics about variable-length word sequences (e.g., lord
More informationDistributed Data Management Summer Semester 2013 TU Kaiserslautern
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Lecture 4 PIG/HIVE Distributed
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationGeneralizing Map- Reduce
Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute
More informationMapReduce: Algorithm Design for Relational Operations
MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique
More informationPPI Network Alignment Advanced Topics in Computa8onal Genomics
PPI Network Alignment 02-715 Advanced Topics in Computa8onal Genomics PPI Network Alignment Compara8ve analysis of PPI networks across different species by aligning the PPI networks Find func8onal orthologs
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search
More informationLecture 2 Data Cube Basics
CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1 Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao-
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationSEMINAR: GRAPH-BASED METHODS FOR NLP
SEMINAR: GRAPH-BASED METHODS FOR NLP Organisatorisches: Seminar findet komplett im Mai statt Seminarausarbeitungen bis 15. Juli (?) Hilfen Seminarvortrag / Ausarbeitung auf der Webseite Tucan number for
More informationPerformance and Scalability: Apriori Implementa6on
Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Fall 2015 Lecture 1 Introduc?on Class Logis?cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC 4.706 MTh 11:00 12:00 AM By appointment Email: dfranke@cs.utexas.edu Web page:
More informationIt also suggests creating more reduce tasks to deal with problems in keeping reducer inputs in memory.
Reference 1. Processing Theta-Joins using MapReduce, Alper Okcan, Mirek Riedewald Northeastern University 1.1. Video of presentation on above paper 2. Optimizing Joins in map-reduce environment F.N. Afrati,
More informationInforma/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields
Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,
More informationGraph Data Processing with MapReduce
Distributed data processing on the Cloud Lecture 5 Graph Data Processing with MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution
More informationTeach A level Compu/ng: Algorithms and Data Structures
Teach A level Compu/ng: Algorithms and Data Structures Eliot Williams @MrEliotWilliams Course Outline Representa+ons of data structures: Arrays, tuples, Stacks, Queues,Lists 2 Recursive Algorithms 3 Searching
More informationLecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto
Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationM 2 R: Enabling Stronger Privacy in MapReduce Computa;on
M 2 R: Enabling Stronger Privacy in MapReduce Computa;on Anh Dinh, Prateek Saxena, Ee- Chien Chang, Beng Chin Ooi, Chunwang Zhang School of Compu,ng Na,onal University of Singapore 1. Mo;va;on Distributed
More informationCOSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph
COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationDistributed Data Management Summer Semester 2013 TU Kaiserslautern
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas9an Michel smichel@mmci.uni- saarland.de Lecture 1 MOTIVATION AND OVERVIEW Distributed Data Management, SoSe 2013, S. Michel
More informationMapReduce. Tom Anderson
MapReduce Tom Anderson Last Time Difference between local state and knowledge about other node s local state Failures are endemic Communica?on costs ma@er Why Is DS So Hard? System design Par??oning of
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address ques)ons
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing
More informationLink Analysis Informa0on Retrieval. Evangelos Kanoulas
Link Analysis Informa0on Retrieval Evangelos Kanoulas e.kanoulas@uva.nl How Search Works Logging Clicks Context Crawling Quality Freshness Spaminess Text processing & Indexing Ranking Algorithm Content
More informationCMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1
CMPUT 391 Database Management Systems Query Processing: The Basics Textbook: Chapter 10 (first edition: Chapter 13) Based on slides by Lewis, Bernstein and Kifer University of Alberta 1 External Sorting
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationMapReduce Algorithms
Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives
More informationIntroduc)on to Informa)on Retrieval. Index Construc.on. Slides by Manning, Raghavan, Schutze
Index Construc.on Slides by Manning, Raghavan, Schutze 1 Plan Last lecture: Dic.onary data structures Tolerant retrieval Wildcards Spell correc.on Soundex a-hu hy-m n-z $m mace madden mo among amortize
More informationUnsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining
Unsupervised learning: Data Mining Associa6on rules and frequent itemsets mining Data Mining concepts Is the computa6onal process of discovering pa
More informationLink State Rou.ng Reading: Sec.ons 4.2 and 4.3.4
Link State Rou.ng Reading: Sec.ons. and.. COS 6: Computer Networks Spring 009 (MW :0 :50 in COS 05) Michael Freedman Teaching Assistants: WyaN Lloyd and Jeff Terrace hnp://www.cs.princeton.edu/courses/archive/spring09/cos6/
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More informationGraph Algorithms. Revised based on the slides by Ruoming Kent State
Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationOn Page Rank. 1 Introduction
On Page Rank C. Hoede Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente P.O.Box 217 7500 AE Enschede, The Netherlands Abstract In this paper the concept of page rank
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationLarge-Scale Duplicate Detection
Large-Scale Duplicate Detection Potsdam, April 08, 2013 Felix Naumann, Arvid Heise Outline 2 1 Freedb 2 Seminar Overview 3 Duplicate Detection 4 Map-Reduce 5 Stratosphere 6 Paper Presentation 7 Organizational
More informationIntroduction to Database Systems CSE 444, Winter 2011
Version March 15, 2011 Introduction to Database Systems CSE 444, Winter 2011 Lecture 20: Operator Algorithms Where we are / and where we go 2 Why Learn About Operator Algorithms? Implemented in commercial
More informationMapReduce Patterns, Algorithms, and Use Cases
MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web
More informationLink Structure Analysis
Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score
More informationLecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!
Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:
More informationCompila(on (Semester A, 2013/14)
Compila(on 0368-3133 (Semester A, 2013/14) Lecture 4: Syntax Analysis (Top- Down Parsing) Modern Compiler Design: Chapter 2.2 Noam Rinetzky Slides credit: Roman Manevich, Mooly Sagiv, Jeff Ullman, Eran
More informationCSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates
CSCI 599 Class Presenta/on Zach Levine Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates April 26 th, 2012 Topics Covered in this Presenta2on A (Brief) Review of HMMs HMM Parameter Learning Expecta2on-
More informationPage rank computation HPC course project a.y Compute efficient and scalable Pagerank
Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and
More informationRELATIONAL OPERATORS #1
RELATIONAL OPERATORS #1 CS 564- Spring 2018 ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? Algorithms for relational operators: select project 2 ARCHITECTURE OF A DBMS query
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationCOMP Associa0on Rules
COMP 4601 Associa0on Rules 1 Road map Basic concepts Apriori algorithm Different data formats for mining Mining with mul0ple minimum supports Mining class associa0on rules Summary 2 What Is Frequent Pattern
More informationMapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms
MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word
More informationIntroduc)on to. CS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in
More informationLecture 13: Abstract Data Types / Stacks
....... \ \ \ / / / / \ \ \ \ / \ / \ \ \ V /,----' / ^ \ \.--..--. / ^ \ `--- ----` / ^ \. ` > < / /_\ \. ` / /_\ \ / /_\ \ `--' \ /. \ `----. / \ \ '--' '--' / \ / \ \ / \ / / \ \ (_ ) \ (_ ) / / \ \
More informationMapReduce for Graph Algorithms
Seminar: Massive-Scale Graph Analysis Summer Semester 2015 MapReduce for Graph Algorithms Modeling & Approach Ankur Sharma ankur@stud.uni-saarland.de May 8, 2015 Agenda 1 Map-Reduce Framework Big Data
More informationCompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy
CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some
More informationMapReduce. Cloud Computing COMP / ECPE 293A
Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems
More informationJordan Boyd-Graber University of Maryland. Thursday, March 3, 2011
Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationPerformance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis
Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationHypergraph Sparsifica/on and Its Applica/on to Par//oning
Hypergraph Sparsifica/on and Its Applica/on to Par//oning Mehmet Deveci 1,3, Kamer Kaya 1, Ümit V. Çatalyürek 1,2 1 Dept. of Biomedical Informa/cs, The Ohio State University 2 Dept. of Electrical & Computer
More informationGes$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing
Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management:
More informationInformation Networks: PageRank
Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the
More information1. Introduction to MapReduce
Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.
More informationQuery Processing: The Basics. External Sorting
Query Processing: The Basics Chapter 10 1 External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot use traditional
More informationSta$c Single Assignment (SSA) Form
Sta$c Single Assignment (SSA) Form SSA form Sta$c single assignment form Intermediate representa$on of program in which every use of a variable is reached by exactly one defini$on Most programs do not
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 11 more on Data Organiza:on Pa;erns CS 378 - Fall 2016 Big Data Programming 1 Assignment 5 - Review Define an Avro object for user session One user session for each
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationKeyword query interpretation over structured data
Keyword query interpretation over structured data Advanced Methods of Information Retrieval Elena Demidova SS 2018 Elena Demidova: Advanced Methods of Information Retrieval SS 2018 1 Recap Elena Demidova:
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationCSE373: Data Structures and Algorithms Lecture 1: Introduc<on; ADTs; Stacks/Queues
CSE373: Data Structures and Algorithms Lecture 1: Introduc
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 4: Index Construc*on Plan Last lecture: Dic*onary data structures Tolerant retrieval
More informationLecture 27: Learning from relational data
Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission
More informationIndex construc-on. Friday, 8 April 16 1
Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1 4.3 Single-pass
More informationCSE 444: Database Internals. Sec2on 4: Query Op2mizer
CSE 444: Database Internals Sec2on 4: Query Op2mizer Plan for Today Problem 1A, 1B: Es2ma2ng cost of a plan You try to compute the cost for 5 mins We go over the solu2on together Problem 2: Sellinger Op2mizer
More informationCS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #21: Data Mining and Warehousing
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #21: Data Mining and Warehousing Overview Tradi8onal database systems are tuned to many, small, simple queries. New applica8ons
More informationMining Social Network Graphs
Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationText search. CSE 392, Computers Playing Jeopardy!, Fall
Text search CSE 392, Computers Playing Jeopardy!, Fall 2011 Stony Brook University http://www.cs.stonybrook.edu/~cse392 1 Today 2 parts: theoretical: costs of searching substrings, data structures for
More informationNetwork Analysis Integra2ve Genomics module
Network Analysis Integra2ve Genomics module Michael Inouye Centre for Systems Genomics University of Melbourne, Australia Summer Ins@tute in Sta@s@cal Gene@cs 2016 SeaBle, USA @minouye271 inouyelab.org
More informationGraphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationUsing Sequen+al Run+me Distribu+ons for the Parallel Speedup Predic+on of SAT Local Search
Using Sequen+al Run+me Distribu+ons for the Parallel Speedup Predic+on of SAT Local Search Alejandro Arbelaez - CharloBe Truchet - Philippe Codognet JFLI University of Tokyo LINA, UMR 6241 University of
More informationCS 4700: Foundations of Artificial Intelligence. Bart Selman. Search Techniques R&N: Chapter 3
CS 4700: Foundations of Artificial Intelligence Bart Selman Search Techniques R&N: Chapter 3 Outline Search: tree search and graph search Uninformed search: very briefly (covered before in other prerequisite
More informationUninformed search strategies
Uninformed search strategies A search strategy is defined by picking the order of node expansion Uninformed search strategies use only the informa:on available in the problem defini:on Breadth- first search
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationFrameworks for Graph-Based Problems
Frameworks for Graph-Based Problems Dakshil Shah U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Chetashri Bhadane Assistant Professor Computer Engineering
More informationEfficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on
Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra
More informationOverview of this week
Overview of this week Debugging tips for ML algorithms Graph algorithms at scale A prototypical graph algorithm: PageRank n memory Putting more and more on disk Sampling from a graph What is a good sample
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationAn applica)on of Markov Chains: PageRank. Finding relevant informa)on on the Web
An applica)on of Markov Chains: PageRank Finding relevant informa)on on the Web Please Par)cipate h>p://www.st.ewi.tudelc.nl/~marco/lectures.html How much do you know about PageRank? 1) Nothing. 2) I
More informationIndexing Large-Scale Data
Indexing Large-Scale Data Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook November 16, 2010
More information