Topic: Duplicate Detection and Similarity Computing

Similar documents
Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE 5243 INTRO. TO DATA MINING

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Finding Similar Sets

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Finding Similar Items:Nearest Neighbor Search

An introduction to Web Mining part II

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Search Engines. Information Retrieval in Practice

Information Retrieval. Lecture 9 - Web search basics

Data-Intensive Computing with MapReduce

US Patent 6,658,423. William Pugh

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CS290N Summary Tao Yang

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

New Issues in Near-duplicate Detection

Scalable Techniques for Similarity Search

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Big Data Analytics CSCI 4030

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Finding dense clusters in web graph

An efficient approach in Near Duplicate Document Detection using Shingling-MD5

Min-Hashing and Geometric min-hashing

Hashing and sketching

Information Retrieval and Web Search

Compressing Social Networks

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

CS5112: Algorithms and Data Structures for Applications

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

CSE 562 Database Systems

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Unit VIII. Chapter 9. Link Analysis

Mining Web Data. Lijun Zhang

Web Characteristics CE-324: Modern Information Retrieval Sharif University of Technology

Jeffrey D. Ullman Stanford University

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

2.3 Algorithms Using Map-Reduce

CS 572: Information Retrieval

Contents. CS 124 Final Exam Practice Problem 5/6/17. 1 Format and Logistics 2

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Finding Similar Items and Locality Sensitive Hashing. STA 325, Supplemental Material

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Achieving both High Precision and High Recall in Near-duplicate Detection

Roadmap. PCY Algorithm

9.5 Equivalence Relations

Locality-Sensitive Hashing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Introduction to Data Mining

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Error-Correcting Codes

Chapter 12: Indexing and Hashing. Basic Concepts

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Practice Exam Sample Solutions

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

MinHash Sketches: A Brief Survey. June 14, 2016

CS 231: Algorithmic Problem Solving

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Chapter 12: Indexing and Hashing

Introduction to Data Mining

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

CS 245: Database System Principles

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Advanced Computer Networks. Rab Nawaz Jadoon DCS. Assistant Professor COMSATS University, Lahore Pakistan. Department of Computer Science

CSC 505, Spring 2005 Week 6 Lectures page 1 of 9

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

MITOCW watch?v=kz7jjltq9r4

Exam Review Session. William Cohen

Antisymmetric Relations. Definition A relation R on A is said to be antisymmetric

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

ENGR 1181 MATLAB 09: For Loops 2

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Indexing and Searching

CS 525: Advanced Database Organization 04: Indexing

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Adaptive Near-Duplicate Detection via Similarity Learning

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

CSE 5243 INTRO. TO DATA MINING

An Efficient Recommender System Using Locality Sensitive Hashing

Chapter 3. Set Theory. 3.1 What is a Set?

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman Applications of Duplicate Detection and Similarity Computing Duplicate and near-duplicate documents occur in many situations Copies, versions, plagiarism, spam, mirror sites Over 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% Duplicates consume significant resources during crawling, indexing, and search Little value to most users Similar query suggestions Advertisement: coalition and spam detection Duplicate Detection Exact duplicate detection is relatively easy Content fingerprints MD5, cyclic redundancy check (CRC) Checksum techniques A checksum is a value that is computed based on the content of the document e.g., sum of the bytes in the document file Possible for files with different text to have same checksum 1

Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web pages with same text context but different advertising or format near-duplicates? Near-Duplication: Approximate match Compute syntactic similarity with an editdistance measure Use similarity threshold to detect nearduplicates E.g., Similarity > 80% => Documents are near duplicates Not transitive though sometimes used transitively 5 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Near-Duplicate Detection Two Techniques for Computing Similarity Search: find near-duplicates of a document D O(N) comparisons required Discovery: find all pairs of near-duplicate documents in the collection O(N 2 ) comparisons IR techniques are effective for search scenario For discovery, other techniques used to generate compact representations 1. Shingling : convert documents, emails, etc., to fingerprint sets. 2. Minhashing : convert large sets to short signatures, while preserving similarity. Document The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity All-pair comparison 8 2

Fingerprint Generation Process for Web Documents Computing Similarity with Shingles Shingles (Word k-grams) [Brin95, Brod98] a rose is a rose is a rose => a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Size_of_Intersection / Size_of_Union Jaccard measure Example: Jaccard Similarity Fingerprint Example for Web Documents The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. Sim (C 1, C 2 ) = C 1 C 2 / C 1 C 2. 3 in intersection. 8 in union. Jaccard similarity = 3/8 11 3

Approximated Representation with Sketching Computing Sketch[i] for Doc1 Computing exact set intersection of shingles between all pairs of documents is expensive Approximate using a subset of shingles (called sketch vectors) Create a sketch vector using minhashing. For doc d, sketch d [i] is computed as follows: Let f map all shingles in the universe to 0..2 m Let p i be a specific random permutation on 0..2 m Pick MIN p i (f(s)) over all shingles s in this document d Document 1 Start with 64 bit shingles Permute on the number line with p i Pick the min value Documents which share more than t (say 80%) in sketch vector s elements are similar Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 A B Shingling with minhashing Given two documents d1, d2. Let S1 and S2 be their shingle sets Resemblance = Intersection of S1 and S2 / Union of S1 and S2. Let Alpha = min ( p (S1)) Let Beta = min (p(s2)) Probability (Alpha = Beta) = Resemblance Computing this by sampling (e.g. 200 times). Are these equal? Test for 200 random permutations: p 1, p 2, p 200 4

Proof with Boolean Matrices Rows = elements of the universal set. Columns = sets. 1 in row e and column S if and only if e is a member of S. Column similarity is the Jaccard similarity of the sets of their rows with 1. Typical matrix is sparse. C 1 C 2 0 1 1 0 1 1 Sim (C 1, C 2 ) = 0 0 2/5 = 0.4 1 1 0 1 * * * * * * * sim J(C i,c j) Ci C C C i j j 17 Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim A sim J(C i,c j) A B C Minhashing Imagine the rows permuted randomly. hash function h (C ) = the index of the first (in the permuted order) row with 1 in column C. Use several (e.g., 100) independent hash functions to create a signature. The similarity of signatures is the fraction of the hash functions in which they agree. Property The probability (over all permutations of the rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1, C 2 ). P h(c ) h(c ) sim J C, C i Both are A /(A +B +C )! Why? Look down the permuted columns C 1 and C 2 until we see a 1. If it s a type-a row, then h (C 1 ) = h (C 2 ). If a typeb or type-c row, then not. j i j 19 20 5

All-pair comparison is expensive Locality-Sensitive Hashing We want to compare objects, finding those pairs that are sufficiently similar. comparing the signatures of all pairs of objects is quadratic in the number of objects Example: 10 6 objects implies 5*10 11 comparisons. At 1 microsecond/comparison: 6 days. 21 22 The Big Picture Locality-Sensitive Hashing Document The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity Localitysensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. Map a document to many buckets Make elements of the same bucket candidate pairs. d1 d2 23 24 6

Another view of LSH: Produce signature with bands Signature agreement of each pair at each band Agreement? Mapped into the same bucket? b bands r rows per band b bands r rows per band One short signature Signature 25 26 Buckets Docs 2 and 6 are probably identical. Signature generation and bucket comparison Matrix M Docs 6 and 7 are surely different. Create b bands for each document Signature of doc X and Y in the same band agrees a candidate pair Use r minhash values (r rows) for each band r rows b bands Tune b and r to catch most similar pairs, but few nonsimilar pairs. 27 28 7

Analysis of LSH Probability the minhash signatures of C 1, C 2 agree in one row: s Threshold of two similar documents Probability C 1, C 2 identical in one band: s r Probability C 1, C 2 do not agree at least one row of a band: 1-s r Probability C 1, C 2 do not agree in all bands: (1-s r ) b False negative probability Probability C 1, C 2 agree one of these bands: 1- (1-s r ) b Probability that we find such a pair. Example Suppose C 1, C 2 are 80% Similar Choose 20 bands of 5 integers/band. Probability C 1, C 2 identical in one particular band: (0.8) 5 = 0.328. Probability C 1, C 2 are not similar in any of the 20 bands: (1-0.328) 20 =.00035. i.e., about 1/3000th of the 80%-similar column pairs are false negatives. C1 C2 29 30 Analysis of LSH What We Want What One Band Gives You Probability = 1 if s > t Probability of sharing a bucket No chance if s < t Probability of sharing a bucket Remember: probability of equal hash-values = similarity t t Similarity s of two docs 31 Similarity s of two docs 32 8

What b Bands of r Rows Gives You Example: b = 20; r = 5 Probability of a similar pair to share a bucket Probability of sharing a bucket t t ~ (1/b) 1/r At least one band identical s r 1 - ( 1 - ) b Some row of a band unequal No bands identical All rows of a band are equal s 1-(1-s r ) b.2.006.3.047.4.186.5.470.6.802.7.975.8.9996 Similarity s of two docs 33 34 LSH Summary Get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. Check that candidate pairs really do have similar signatures. LSH involves tradeoff Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. 35 9