Data-Intensive Computing with MapReduce

Size: px

Start display at page:

Download "Data-Intensive Computing with MapReduce"

Janel Poole
5 years ago
Views:

Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is

1 Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

2 Developing big data intuitions Source: Wikipedia (Mahout)

3 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application

4 What s the Problem? Finding similar items with respect to some distance metric Two variants of the problem: l Offline: extract all similar pairs of objects from a large collection l Online: is this object similar to something I ve seen before? Similarities/differences with the clustering problem

5 Applications Near-duplicate detection of webpages l Offline vs. online variant of problem Recommender systems Forensic text analysis (e.g., plagiarism detection)

6 Near-Duplicate Detection of Webpages What s the source of the problem? l Mirror pages (legit) l Spam farms (non-legit) l Additional complications (e.g., nav bars) Naïve algorithm: l Compute cryptographic hash for webpage (e.g., MD5) l Insert hash values into a big hash table l Compute hash for new webpage: collision implies duplicate What s the issue? Intuition: l Hash function needs to be tolerant of minor differences l High similarity implies higher probability of hash collision

7 Literature Note Many communities have tackled similar problems: l Theoretical computer science l Information retrieval l Data mining l Databases l Issues l Slightly different terminology l Results not easy to compare

8 Four Steps Specify distance metric l Jaccard, Euclidean, cosine, etc. Compute representation l Shingling, tf.idf, etc. Project l Minhash, random projections, etc. Extract l Bucketing, sliding windows, etc.

9 Distances Source:

10 Distance Metrics 1. Non-negativity: 2. Identity: d(x, y) 0 3. Symmetry: d(x, y) = 0 () x=y d(x, y) = d(y, x) 4. Triangle Inequality d(x, y) apple d(x, z) + d(z, y)

11 Distance: Jaccard Given two sets A, B Jaccard similarity: J(A, B) = A \ B A [ B d(a, B) =1 J(A, B)

12 Distance: Norms Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Euclidean distance (L 2 -norm) d(x, y) = Manhattan distance (L 1 -norm) d(x, y) = L r -norm d(x, y) = v ux t n (x i y i ) 2 i=0 nx x i y i i=0 " X n # 1/r x i y i r i=0

13 Distance: Cosine Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Idea: measure distance between the vectors Thus: cos = x y x y sim(x, y) = pp n i=0 x2 i d(x, y) = 1 sim(x, y) P n i=0 x iy i pp n i=0 y2 i

14 Distance: Hamming Given two bit vectors Hamming distance: number of elements which differ

15 Representations

16 Representations: Text Unigrams (i.e., words) Shingles = n-grams l At the word level l At the character level Feature weights l boolean l tf.idf l BM25 l Which representation to use? What s the actual text? Use the hashing trick!

17 Representations: Beyond Text For recommender systems: l Items as features for users l Users as features for items For graphs: l Adjacency lists as features for vertices With log data: l Behaviors (clicks) as features

18 Minhash Source:

19 Minhash Seminal algorithm for near-duplicate detection of webpages l Used by AltaVista l For details see Broder et al. (1997) Setup: l Documents (HTML pages) represented by shingles (n-grams) l Jaccard similarity: dups are pairs with high similarity

20 Preliminaries: Representation Sets: l A = {e 1, e 3, e 7 } l B = {e 3, e 5, e 7 } Can be equivalently expressed as matrices: Element A B e e e e e e e 7 1 1

21 Preliminaries: Jaccard Element A B e e e e e e e Let: M 00 = # rows where both elements are 0 M 11 = # rows where both elements are 1 M 01 = # rows where A=0, B=1 M 10 = # rows where A=1, B=0 J(A, B) = M 11 M 01 + M 10 + M 11

22 Minhash Computing minhash l Start with the matrix representation of the set l Randomly permute the rows of the matrix l minhash is the first row with a one Example: Element A B e e e e e e e h(a) = e 3 h(b) = e 5 Element A B e e e e e e e 1 1 0

23 Minhash and Jaccard Element A B e e e e e e e M 00 M 00 M 01 M 11 M 11 M 00 M 10 P [h(a) =h(b)] = J(A, B) M 11 M 01 + M 10 + M 11 M 11 M 01 + M 10 + M 11

24 To Permute or Not to Permute? Permutations are expensive Interpret the hash value as the permutation Only need to keep track of the minimum hash value l Can keep track of multiple minhash values at once

25 Extracting Similar Pairs (LSH) We know: P [h(a) =h(b)] = J(A, B) Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute its minhash value l Group objects by their hash values l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s l Probability that any pair is invalid is (1 s) What s the fundamental issue?

26 Two Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute two minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s 2 l Probability that any pair is invalid is (1 s) 2

27 k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute k minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s k l Probability that any pair is invalid is (1 s) k What s the issue now?

28 n different k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute n sets k minhash values l For each set, concatenate k minhash values together l Within each set: Group objects by their signatures Output all pairs within each group l De-dup pairs Analysis: l Probability we will miss a pair is (1 s k ) n l Probability that any pair is invalid is n(1 s) k

29 Practical Notes In some cases, checking all candidate pairs may be possible l Time cost is small relative to everything else l Easy method to discard false positives Most common practical implementation: l Generate M minhash values, randomly select k of them n times l Reduces amount of hash computations needed Determining authoritative version is non-trivial

30 MapReduce Implementation Map over objects: l Generate M minhash values, randomly select k of them n times l Each draw yields a signature: emit as intermediate key, value is object id Shuffle/sort: Reduce: l Receive all object ids with same signature, generate candidate pairs l Okay to buffer in memory? Second pass to de-dup

31 Offline Extraction vs. Online Querying Batch formulation of the problem: l Discover all pairs with similarity greater than s l Useful for post-hoc batch processing of web crawl Online formulation of the problem: l Given new webpage, is it similar to one I ve seen before? l Useful for incremental web crawl processing

32 Online Similarity Querying Preparing the existing collection: l For each object, compute n sets of k minhash values l For each set, concatenate k minhash values together l Keep each signature in hash table (in memory) l Note: can parallelize across multiple machines Querying and updating: l For new webpage, compute signatures and check for collisions l Collisions imply duplicate (determine which version to keep) l Update hash tables

33 Random Projections Source:

34 Limitations of Minhash Minhash is great for near-duplicate detection l Set high threshold for Jaccard similarity Limitations: l Jaccard similarity only l Set-based representation, no way to assign weights to features Random projections: l Works with arbitrary vectors using cosine similarity l Same basic idea, but details differ l Slower but more accurate: no free lunch!

35 Random Projection Hashing Generate a random vector r of unit length l Draw from univariate Gaussian for each component l Normalize length Define: h r (u) = 1 if r u 0 0 if r u < 0 Physical intuition?

36 RP Hash Collisions It can be shown that: P [h r (u) = h r (v)] = 1 (u, v) l Proof in (Goemans and Williamson, 1995) Physical intuition?

37 Random Projection Signature Given D random vectors: [r 1, r 2, r 3,...r D ] Convert each object into a D bit signature u! [h r1 (u),h r2 (u),h r3 (u),...h rd (u)] l We can derive: h sim(u, v) = cos D i hamming(s u, s v ) Thus: similarity boils down to comparison of hamming distances between signatures

38 One-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first bit, bucket objects into two sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: cos 1 (s) 1 l Efficiency: 2 N N 2 vs. 2 2 * * Note, this is actually a simplification: see Ture et al. (SIGIR 2011) for details.

39 Two-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first two bits, bucket objects into four sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 2 (s) 1 l Efficiency: N 2 vs. 4 N 4 2

40 k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first k bits, bucket objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 k (s) 1 l Efficiency: 2 N N 2 vs. 2 k 2 k

41 m Sets of k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Choose m sets of k bits l For each set, use k selected bits to partition objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket (of each set), retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: " apple cos 1 (s) l Efficiency: k # m N 2 vs. m 2 k N 2 k 2

42 Theoretical Results Threshold = 0.7 What s interesting here? Recall Precision (theoretical) 1 set 2 sets 3 sets 4 sets 0.7 Precision Recall Number of random projections Number of Random Projections

43 Online Querying Preprocessing: l Compute D-bit RP signature for every object l Choose m sets of k bits and use to bucket l Store signatures in memory (across multiple machines) Querying l Compute D-bit signature of query object, choose m sets of k bits in same way l Perform brute-force scan of correct bucket (in parallel)

44 Additional Issues to Consider Emphasis on recall, not precision Two sources of error: l From LSH l From using hamming distance as proxy for cosine similarity Balance of bucket sizes? Parameter tuning?

45 Sliding Window Algorithm Compute D-bit RP signature for every object For each object, permute bit signature m times For each permutation, sort bit signatures l Apply sliding window of width B over sorted l Compute hamming distances of bit signatures within window

46 MapReduce Implementation Mapper: l Process each individual object in parallel l Load in random vectors as side data l Compute bit signature l Permute m times, for each emit: key = (n, signature), where n = [ 1 m ] value = object id Reduce l Keep FIFO queue of B bit signatures l For each newly-encountered bit signature, compute hamming distance wrt all bit signatures in queue l Add new bit signature to end of queue, displacing oldest

47 Slightly Faster MapReduce Variant In reducer, don t perform hamming distance computations l Write directly to HDFS in large blocks Perform hamming distance computations in second mapperonly pass l Advantages?

48 Case Study: Wikipedia Ture et al., SIGIR 2011 Ture and Lin, NAACL 2012 Source: Wikipedia (Wikipedia)

49 Mining Bitext from Wikipedia Task: extract similar article pairs in different languages l Why is this useful? l Why not just use inter-wiki links? Use LSH! l Why is this non-trivial?

50 German Doc A MT translate English doc vector v A Machine Translation English Doc B doc vector v B Tradeoffs? German Doc A doc vector v A CLIR translate doc vector v A Vector Translation English Doc B doc vector v B tf 0 (e, D) = X f2f df 0 (e) = X f2f p(f e) tf(f,d) p(f e) df(f)

51 Translations are noisy! Question: To what extent are ground truth mutual translation pairs similar?

52 Architecture German articles English articles Preprocess document vectors (English) Signature generation Signatures LSH Similar article pairs

53 Evaluation Setup Collection: 3.44m English m German Wikipedia l Brute force: 5.05 trillion comparisons Task: extract similar documents with cosine similarity > 0.3 Representation: 1000 bit random projections Ground truth: l Sample ~1000 German Wikipedia articles Evaluation metrics: l Time l Relative cost l Recall

54 Results (16 node cluster) tables 200 tables 300 tables 64 million document pairs Time (hours) Window Size (B)

55 Sources of Error Hamming distance computation introduces error! l 1000-bit random projections yields ~0.04 absolute error l Maximum obtainable recall is 0.76 Why not just compute cosine similarity? l 20 times slower than computing hamming distance Let s renormalize with respect to hamming distance upper bound

56 No Free Lunch! 100 Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 70% cost Relative Cost (%) 500 tables

No Free Lunch! 100 100% recall? Don t use LSH!

57 No Free Lunch! % recall? Don t use LSH! Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 62% cost 99% recall 70% cost Relative Cost (%) 500 tables 1500 tables

58 Parallel Sentence Extraction Model parallel sentence extraction as a classification problem l Given a candidate pair, classify as parallel or not parallel What s the problem? l 64m document pairs yields 400b raw sentence pairs What s that quote about premature optimization?

59 Two-Step Classification Approach: l Apply a few simple heuristics l Simple classifier: efficiently filters out irrelevant pairs l Complex classifier: effectively classifies remaining pairs

60 Join Algorithm 1 l Mapper processes both English and foreign collection Load all (d e, d f ) similarity pairs in memory as side data When encountering relevant document, emit (d e, d f ) pair as key, list of sentence vectors as value l Reducer: Gather sentence vectors, compute Cartesian product and apply classifier Algorithm 2 l Same idea as first, except with stripes pattern Which is faster?

61 Complete Pipeline source collection F target collection E Preprocess Preprocess doc vectors F doc vectors E Signature Generation Signature Generation signatures E Phase 1 Sliding Window Algorithm cross-lingual document pairs aligned bilingual sentence pairs 2-step Parallel Sentence Classifier candidate sentence pairs Candidate Generation Phase 2

Statistical Machine Translation Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) he sat at the

62 Statistical Machine Translation Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) he sat at the table the service was good Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence ê 1 I = argmax e 1 I!P(e " I 1 f J 1 )# $ = argmax e 1 I!P(e " I 1 )P( f J 1 e I 1 )# $

63 Evaluation

64 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application

65 Questions? Source: Wikipedia (Japanese rock garden)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides