Data-Intensive Computing with MapReduce
|
|
- Janel Poole
- 5 years ago
- Views:
Transcription
1 Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details
2 Developing big data intuitions Source: Wikipedia (Mahout)
3 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application
4 What s the Problem? Finding similar items with respect to some distance metric Two variants of the problem: l Offline: extract all similar pairs of objects from a large collection l Online: is this object similar to something I ve seen before? Similarities/differences with the clustering problem
5 Applications Near-duplicate detection of webpages l Offline vs. online variant of problem Recommender systems Forensic text analysis (e.g., plagiarism detection)
6 Near-Duplicate Detection of Webpages What s the source of the problem? l Mirror pages (legit) l Spam farms (non-legit) l Additional complications (e.g., nav bars) Naïve algorithm: l Compute cryptographic hash for webpage (e.g., MD5) l Insert hash values into a big hash table l Compute hash for new webpage: collision implies duplicate What s the issue? Intuition: l Hash function needs to be tolerant of minor differences l High similarity implies higher probability of hash collision
7 Literature Note Many communities have tackled similar problems: l Theoretical computer science l Information retrieval l Data mining l Databases l Issues l Slightly different terminology l Results not easy to compare
8 Four Steps Specify distance metric l Jaccard, Euclidean, cosine, etc. Compute representation l Shingling, tf.idf, etc. Project l Minhash, random projections, etc. Extract l Bucketing, sliding windows, etc.
9 Distances Source:
10 Distance Metrics 1. Non-negativity: 2. Identity: d(x, y) 0 3. Symmetry: d(x, y) = 0 () x=y d(x, y) = d(y, x) 4. Triangle Inequality d(x, y) apple d(x, z) + d(z, y)
11 Distance: Jaccard Given two sets A, B Jaccard similarity: J(A, B) = A \ B A [ B d(a, B) =1 J(A, B)
12 Distance: Norms Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Euclidean distance (L 2 -norm) d(x, y) = Manhattan distance (L 1 -norm) d(x, y) = L r -norm d(x, y) = v ux t n (x i y i ) 2 i=0 nx x i y i i=0 " X n # 1/r x i y i r i=0
13 Distance: Cosine Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Idea: measure distance between the vectors Thus: cos = x y x y sim(x, y) = pp n i=0 x2 i d(x, y) = 1 sim(x, y) P n i=0 x iy i pp n i=0 y2 i
14 Distance: Hamming Given two bit vectors Hamming distance: number of elements which differ
15 Representations
16 Representations: Text Unigrams (i.e., words) Shingles = n-grams l At the word level l At the character level Feature weights l boolean l tf.idf l BM25 l Which representation to use? What s the actual text? Use the hashing trick!
17 Representations: Beyond Text For recommender systems: l Items as features for users l Users as features for items For graphs: l Adjacency lists as features for vertices With log data: l Behaviors (clicks) as features
18 Minhash Source:
19 Minhash Seminal algorithm for near-duplicate detection of webpages l Used by AltaVista l For details see Broder et al. (1997) Setup: l Documents (HTML pages) represented by shingles (n-grams) l Jaccard similarity: dups are pairs with high similarity
20 Preliminaries: Representation Sets: l A = {e 1, e 3, e 7 } l B = {e 3, e 5, e 7 } Can be equivalently expressed as matrices: Element A B e e e e e e e 7 1 1
21 Preliminaries: Jaccard Element A B e e e e e e e Let: M 00 = # rows where both elements are 0 M 11 = # rows where both elements are 1 M 01 = # rows where A=0, B=1 M 10 = # rows where A=1, B=0 J(A, B) = M 11 M 01 + M 10 + M 11
22 Minhash Computing minhash l Start with the matrix representation of the set l Randomly permute the rows of the matrix l minhash is the first row with a one Example: Element A B e e e e e e e h(a) = e 3 h(b) = e 5 Element A B e e e e e e e 1 1 0
23 Minhash and Jaccard Element A B e e e e e e e M 00 M 00 M 01 M 11 M 11 M 00 M 10 P [h(a) =h(b)] = J(A, B) M 11 M 01 + M 10 + M 11 M 11 M 01 + M 10 + M 11
24 To Permute or Not to Permute? Permutations are expensive Interpret the hash value as the permutation Only need to keep track of the minimum hash value l Can keep track of multiple minhash values at once
25 Extracting Similar Pairs (LSH) We know: P [h(a) =h(b)] = J(A, B) Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute its minhash value l Group objects by their hash values l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s l Probability that any pair is invalid is (1 s) What s the fundamental issue?
26 Two Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute two minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s 2 l Probability that any pair is invalid is (1 s) 2
27 k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute k minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s k l Probability that any pair is invalid is (1 s) k What s the issue now?
28 n different k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute n sets k minhash values l For each set, concatenate k minhash values together l Within each set: Group objects by their signatures Output all pairs within each group l De-dup pairs Analysis: l Probability we will miss a pair is (1 s k ) n l Probability that any pair is invalid is n(1 s) k
29 Practical Notes In some cases, checking all candidate pairs may be possible l Time cost is small relative to everything else l Easy method to discard false positives Most common practical implementation: l Generate M minhash values, randomly select k of them n times l Reduces amount of hash computations needed Determining authoritative version is non-trivial
30 MapReduce Implementation Map over objects: l Generate M minhash values, randomly select k of them n times l Each draw yields a signature: emit as intermediate key, value is object id Shuffle/sort: Reduce: l Receive all object ids with same signature, generate candidate pairs l Okay to buffer in memory? Second pass to de-dup
31 Offline Extraction vs. Online Querying Batch formulation of the problem: l Discover all pairs with similarity greater than s l Useful for post-hoc batch processing of web crawl Online formulation of the problem: l Given new webpage, is it similar to one I ve seen before? l Useful for incremental web crawl processing
32 Online Similarity Querying Preparing the existing collection: l For each object, compute n sets of k minhash values l For each set, concatenate k minhash values together l Keep each signature in hash table (in memory) l Note: can parallelize across multiple machines Querying and updating: l For new webpage, compute signatures and check for collisions l Collisions imply duplicate (determine which version to keep) l Update hash tables
33 Random Projections Source:
34 Limitations of Minhash Minhash is great for near-duplicate detection l Set high threshold for Jaccard similarity Limitations: l Jaccard similarity only l Set-based representation, no way to assign weights to features Random projections: l Works with arbitrary vectors using cosine similarity l Same basic idea, but details differ l Slower but more accurate: no free lunch!
35 Random Projection Hashing Generate a random vector r of unit length l Draw from univariate Gaussian for each component l Normalize length Define: h r (u) = 1 if r u 0 0 if r u < 0 Physical intuition?
36 RP Hash Collisions It can be shown that: P [h r (u) = h r (v)] = 1 (u, v) l Proof in (Goemans and Williamson, 1995) Physical intuition?
37 Random Projection Signature Given D random vectors: [r 1, r 2, r 3,...r D ] Convert each object into a D bit signature u! [h r1 (u),h r2 (u),h r3 (u),...h rd (u)] l We can derive: h sim(u, v) = cos D i hamming(s u, s v ) Thus: similarity boils down to comparison of hamming distances between signatures
38 One-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first bit, bucket objects into two sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: cos 1 (s) 1 l Efficiency: 2 N N 2 vs. 2 2 * * Note, this is actually a simplification: see Ture et al. (SIGIR 2011) for details.
39 Two-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first two bits, bucket objects into four sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 2 (s) 1 l Efficiency: N 2 vs. 4 N 4 2
40 k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first k bits, bucket objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 k (s) 1 l Efficiency: 2 N N 2 vs. 2 k 2 k
41 m Sets of k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Choose m sets of k bits l For each set, use k selected bits to partition objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket (of each set), retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: " apple cos 1 (s) l Efficiency: k # m N 2 vs. m 2 k N 2 k 2
42 Theoretical Results Threshold = 0.7 What s interesting here? Recall Precision (theoretical) 1 set 2 sets 3 sets 4 sets 0.7 Precision Recall Number of random projections Number of Random Projections
43 Online Querying Preprocessing: l Compute D-bit RP signature for every object l Choose m sets of k bits and use to bucket l Store signatures in memory (across multiple machines) Querying l Compute D-bit signature of query object, choose m sets of k bits in same way l Perform brute-force scan of correct bucket (in parallel)
44 Additional Issues to Consider Emphasis on recall, not precision Two sources of error: l From LSH l From using hamming distance as proxy for cosine similarity Balance of bucket sizes? Parameter tuning?
45 Sliding Window Algorithm Compute D-bit RP signature for every object For each object, permute bit signature m times For each permutation, sort bit signatures l Apply sliding window of width B over sorted l Compute hamming distances of bit signatures within window
46 MapReduce Implementation Mapper: l Process each individual object in parallel l Load in random vectors as side data l Compute bit signature l Permute m times, for each emit: key = (n, signature), where n = [ 1 m ] value = object id Reduce l Keep FIFO queue of B bit signatures l For each newly-encountered bit signature, compute hamming distance wrt all bit signatures in queue l Add new bit signature to end of queue, displacing oldest
47 Slightly Faster MapReduce Variant In reducer, don t perform hamming distance computations l Write directly to HDFS in large blocks Perform hamming distance computations in second mapperonly pass l Advantages?
48 Case Study: Wikipedia Ture et al., SIGIR 2011 Ture and Lin, NAACL 2012 Source: Wikipedia (Wikipedia)
49 Mining Bitext from Wikipedia Task: extract similar article pairs in different languages l Why is this useful? l Why not just use inter-wiki links? Use LSH! l Why is this non-trivial?
50 German Doc A MT translate English doc vector v A Machine Translation English Doc B doc vector v B Tradeoffs? German Doc A doc vector v A CLIR translate doc vector v A Vector Translation English Doc B doc vector v B tf 0 (e, D) = X f2f df 0 (e) = X f2f p(f e) tf(f,d) p(f e) df(f)
51 Translations are noisy! Question: To what extent are ground truth mutual translation pairs similar?
52 Architecture German articles English articles Preprocess document vectors (English) Signature generation Signatures LSH Similar article pairs
53 Evaluation Setup Collection: 3.44m English m German Wikipedia l Brute force: 5.05 trillion comparisons Task: extract similar documents with cosine similarity > 0.3 Representation: 1000 bit random projections Ground truth: l Sample ~1000 German Wikipedia articles Evaluation metrics: l Time l Relative cost l Recall
54 Results (16 node cluster) tables 200 tables 300 tables 64 million document pairs Time (hours) Window Size (B)
55 Sources of Error Hamming distance computation introduces error! l 1000-bit random projections yields ~0.04 absolute error l Maximum obtainable recall is 0.76 Why not just compute cosine similarity? l 20 times slower than computing hamming distance Let s renormalize with respect to hamming distance upper bound
56 No Free Lunch! 100 Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 70% cost Relative Cost (%) 500 tables
57 No Free Lunch! % recall? Don t use LSH! Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 62% cost 99% recall 70% cost Relative Cost (%) 500 tables 1500 tables
58 Parallel Sentence Extraction Model parallel sentence extraction as a classification problem l Given a candidate pair, classify as parallel or not parallel What s the problem? l 64m document pairs yields 400b raw sentence pairs What s that quote about premature optimization?
59 Two-Step Classification Approach: l Apply a few simple heuristics l Simple classifier: efficiently filters out irrelevant pairs l Complex classifier: effectively classifies remaining pairs
60 Join Algorithm 1 l Mapper processes both English and foreign collection Load all (d e, d f ) similarity pairs in memory as side data When encountering relevant document, emit (d e, d f ) pair as key, list of sentence vectors as value l Reducer: Gather sentence vectors, compute Cartesian product and apply classifier Algorithm 2 l Same idea as first, except with stripes pattern Which is faster?
61 Complete Pipeline source collection F target collection E Preprocess Preprocess doc vectors F doc vectors E Signature Generation Signature Generation signatures E Phase 1 Sliding Window Algorithm cross-lingual document pairs aligned bilingual sentence pairs 2-step Parallel Sentence Classifier candidate sentence pairs Candidate Generation Phase 2
62 Statistical Machine Translation Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) he sat at the table the service was good Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence ê 1 I = argmax e 1 I!P(e " I 1 f J 1 )# $ = argmax e 1 I!P(e " I 1 )P( f J 1 e I 1 )# $
63 Evaluation
64 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application
65 Questions? Source: Wikipedia (Japanese rock garden)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationFinding Similar Items:Nearest Neighbor Search
Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationCSE 5243 INTRO. TO DATA MINING
CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationUS Patent 6,658,423. William Pugh
US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that
More informationFinding Similar Sets
Finding Similar Sets V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Motivation Many Web-mining problems can be expressed as
More informationUniversity of Maryland. Tuesday, March 2, 2010
Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationwith MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009
Data-Intensive Text Processing with MapReduce Jimmy Lin The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009 This work is licensed under a Creative
More informationJordan Boyd-Graber University of Maryland. Thursday, March 3, 2011
Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 7: Clustering and Classification Jimmy Lin University of Maryland Thursday, March 7, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationCPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017
CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start
More informationShingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University
Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationLocality-Sensitive Hashing
Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More informationMapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg
MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationMahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island
Mahout in Action SEAN OWEN ROBIN ANIL TED DUNNING ELLEN FRIEDMAN II MANNING Shelter Island contents preface xvii acknowledgments about this book xx xix about multimedia extras xxiii about the cover illustration
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationAn Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia
An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and
More informationMining di Dati Web. Lezione 3 - Clustering and Classification
Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationCake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.
Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity Jesse Kornblum Outline Introduction Artificial Intelligence Spam Detection Clustering
More informationScalable Techniques for Similarity Search
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Scalable Techniques for Similarity Search Siddartha Reddy Nagireddy San Jose State University
More informationCSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.
CSE547: Machine Learning for Big Data Spring 2019 Problem Set 1 Please read the homework submission policies. 1 Spark (25 pts) Write a Spark program that implements a simple People You Might Know social
More informationMapReduce Algorithms
Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationB490 Mining the Big Data. 5. Models for Big Data
B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationToday s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2
More informationToday s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates
More informationGraph Algorithms. Revised based on the slides by Ruoming Kent State
Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationNew Issues in Near-duplicate Detection
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationGoing nonparametric: Nearest neighbor methods for regression and classification
Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN
More informationCPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016
CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.
More informationHomework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationMining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams
/9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them
More informationEpilog: Further Topics
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas
More informationOverview of Clustering
based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationChapter 8: Implementation- Clipping and Rasterization
Chapter 8: Implementation- Clipping and Rasterization Clipping Fundamentals Cohen-Sutherland Parametric Polygons Circles and Curves Text Basic Concepts: The purpose of clipping is to remove objects or
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationUniversity of Maryland. Tuesday, March 23, 2010
Data-Intensive Information Processing Applications Session #7 MapReduce and databases Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationAssociation Pattern Mining. Lijun Zhang
Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
More informationStatistical Machine Translation Lecture 3. Word Alignment Models
p. Statistical Machine Translation Lecture 3 Word Alignment Models Stephen Clark based on slides by Philipp Koehn p. Statistical Modeling p Mary did not slap the green witch Maria no daba una bofetada
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationSupervised Learning: Nearest Neighbors
CS 2750: Machine Learning Supervised Learning: Nearest Neighbors Prof. Adriana Kovashka University of Pittsburgh February 1, 2016 Today: Supervised Learning Part I Basic formulation of the simplest classifier:
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationInstance-Based Learning: Nearest neighbor and kernel regression and classificiation
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each
More informationMachine Learning for NLP
Machine Learning for NLP Readings in unsupervised Learning Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Hashing 2 Hashing: definition Hashing is the process of converting
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationGoing nonparametric: Nearest neighbor methods for regression and classification
Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN
More informationInstance-Based Learning: Nearest neighbor and kernel regression and classificiation
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationChapter 2 Basic Structure of High-Dimensional Spaces
Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (2/2) March 16, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationMachine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016
Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1 Classification Representing data: Hypothesis (classifier) 2 Clustering 3 Supervised
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationInformation Retrieval. Lecture 9 - Web search basics
Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationCS290N Summary Tao Yang
CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher
More informationBirkbeck (University of London)
Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationVector Space Models: Theory and Applications
Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du
More informationThanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman
Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
More information