US Patent 6,658,423. William Pugh

Similar documents
Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Topic: Duplicate Detection and Similarity Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Finding Similar Sets

Distributed computing: index building and use

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Information Retrieval. Lecture 9 - Web search basics

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Internet Basics & Beyond

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

FIGURING OUT WHAT MATTERS, WHAT DOESN T, AND WHY YOU SHOULD CARE

CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters

Finding dense clusters in web graph

Problem 1: Complexity of Update Rules for Logistic Regression

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Hashing and sketching

Data-Intensive Computing with MapReduce

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

CS5112: Algorithms and Data Structures for Applications

Finding Similar Items:Nearest Neighbor Search

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Analysis, Dekalb Roofing Company Web Site

Distributed computing: index building and use

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

1.7 Limit of a Function

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

PAIRS AND LISTS 6. GEORGE WANG Department of Electrical Engineering and Computer Sciences University of California, Berkeley

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING. Linear Search. Binary Search.

Compressing Social Networks

Information Retrieval

Stanford University Computer Science Department CS 240 Quiz 2 with Answers Spring May 24, total

Beginning Microsoft Word Crystal Lake Public Library

Information Retrieval

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CS/COE 1501

Full-Text Search on Data with Access Control

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Session 10: Information Retrieval

Randomized Algorithms: Element Distinctness

Search Engines. Dr. Johan Hagelbäck.

Introduction to Information Retrieval

Implementation of Relational Operations: Other Operations

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

ASSIGNMENT 3. COMP-202, Winter 2015, All Sections. Due: Tuesday, February 24, 2015 (23:59)

Evaluation of Relational Operations: Other Techniques

Automated Fixing of Programs with Contracts

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Replication. Feb 10, 2016 CPSC 416

From Routing to Traffic Engineering

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CS5112: Algorithms and Data Structures for Applications

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Lecture 7: Efficient Collections via Hashing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

New Issues in Near-duplicate Detection

Evaluation of Relational Operations

Collection Building on the Web. Basic Algorithm

An introduction to Web Mining part II

Evaluation of Relational Operations: Other Techniques

MapReduce. Stony Brook University CSE545, Fall 2016

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

THE WEB SEARCH ENGINE

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CS 645 : Lecture 6 Hashes, HMAC, and Authentication. Rachel Greenstadt May 16, 2012

Searching non-text information objects

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

google SEO UpdatE the RiSE Of NOt provided and hummingbird october 2013

Bloom Filters. From this point on, I m going to refer to search queries as keys since that is the role they

CSE 5243 INTRO. TO DATA MINING

ML from Large Datasets

An efficient approach in Near Duplicate Document Detection using Shingling-MD5

The Effectiveness of Deduplication on Virtual Machine Disk Images

2.3 Algorithms Using Map-Reduce

Top-To-Bottom (And Beyond) On-Page Optimization Guidebook

Unit VIII. Chapter 9. Link Analysis

DER GOBBLE. Good Secure Crypto Wallet Practices. What is your wallet?

Tips & Tricks for Microsoft Word

Mining Web Data. Lijun Zhang

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

15-415/615 Faloutsos 1

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Transcription:

US Patent 6,658,423 William Pugh

Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that other people at Google were exploring other approaches I ll only discuss background work and what could be discovered from the patent application

Reasons for confidentiality Competitors (e.g., Microsoft) Search Engine Optimizers

Problem In a web crawl, many duplicate and near duplicate web pages are encountered one study suggested than 30+% are dups Multiple URL s for same page http://www.cs.umd.edu/~pugh vs http://www.cs.umd.edu/users/pugh Same web site hosted on multiple host names Web spammers

Why near - duplicate? Identical web pages are easy to explain, and easy to cope with just use a hash function on web pages Near duplicate web pages typically arise from: small amounts of dynamic content web spammers

Obvious O(n 2 ) algorithm We could compare each pair of web pages and compute edit distance Could do this at time query result is generated

What would we do with the information? Identify mirrors or replicated web sites Avoid storing near-duplicate copies Avoid returning near-duplicate web pages in results Use it to improve page rank calculations

First approach Break a document up into chunks (sentences, paragraphs, shingles,...) Fingerprint each chunk Two documents are similar if a large percentage of their fingerprints are in common Still have lots of data to process Iceberg query, hard to perform

Broder s approach Andrei Broder of Digital/Compaq/Altavista had a number of papers on this problem

Shingles A k-shingle is a sequence of k consecutive words The quick brown quick brown fox brown fox jumped fox jumped over...

Resemblance Let S(A) be the shingles contained in A or the 64-bit hashes of the shingles contained in A Resemblance of A and B given by S(A) S(B) S(A) S(B)

Sampling minima Let σ be a random permutation/hash function Prob[min(σ(S(A))) = min(σ(s(b)))] = S(A) S(B) S(A) S(B)

First Implementation Choose a set of t random min-wise independent permutations For each document, keep a sketch of the! minima shingles (samples) Estimate similarity by counting common samples

SuperShingles Divide samples into k groups of s samples (t = k*s) Fingerprint each group => feature Two documents are considered near-duplicates if they have more than r features in common

Sample values Looking of resemblance of 90% Sketch size = 48, divide into 6 groups of 14 samples Need r=2 identical groups/features to be considered near duplicates

How does this work? Similarity model is OK, has good and bad features Can easily compare two documents to see if they are similar Expensive to find all similar document pairs

Finding all near - duplicate document pairs Want to find all document pairs that have more than r fingerprints in common Discard all fingerprints/features that occur in a single document If r > 1, we know have an iceburg query lots of fingerprints that occur in two or more non-near-duplicate documents

US Patent 6,658,423 Take the list of words (or shingles) Apply a hash function to each word Use hash value to determine which list it is appended to Compute fingerprint of each list Two documents are similar if they have any fingerprints in common

send to list jumped over the lazy dog fox quick the brown

send to list dog fox quick the jumped over brown lazy the

Similarity Different metric of similarity An edit consists of: removing any or all occurrences of a word adding any number of occurrences of a word Edit distance is minimum number of edits required to convert one document into another

Will two documents have a fingerprint in common? Assume we use 4 lists/fingerprints per document Each edit will cause one randomly selected fingerprint to change How many changes are needed to create a high probability that all fingerprints have been changed?

100.00% 10.00% 1.00% 0.10% 0.01% 0 5 10 15 20 25 30 35 40 Words changed

False Positive Rate 0.1% seems like a pretty low false positive rate unless you are indexing billions of web pages Need to be very careful about deciding to discard web pages from index Less careful about eliminating near duplicates from query results

Can run test multiple times 100.00% 10.00% 1.00% Once Twice Three times Four times 0.10% 0.01% 0 5 10 15 20 25 30 35 40 Words changed

Can vary # of lists and number of repeats 100.00% 10.00% 1.00% 0.10% 0.01% 0 5 10 15 20 25 30

Discarding unique fingerprints Hash fingerprints into bitmaps to remove unique fingerprints two bits per hash value (0, 1 or many) first pass to count, second pass to discard Repeat passes with different hash function to handle collisions Partition fingerprints if we can t create bitmaps that are sparse enough

Finding clusters Sort fingerprint, docid pairs by fingerprint Perform union-find on docids Result is clusters of near-duplicate documents with an assumption that similarity is transitive

Similarity Measures Replacement of one word with another counts as two changes no matter how many replacements occur Moving a big chunk of text is a big change unless you use a order-insensitive hash function Absolute diff, not % diff

Conclusion Unknown I don t know what Google has done with this idea although they seem to be using something I haven t been able to talk with anyone else about this idea until now