New Issues in Near-duplicate Detection

Similar documents
Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Topic: Duplicate Detection and Similarity Computing

Overview of the 6th International Competition on Plagiarism Detection

Search Engines. Information Retrieval in Practice

Analyzing a Large Corpus of Crowdsourced Plagiarism

Finding Similar Items:Nearest Neighbor Search

Overview of the 5th International Competition on Plagiarism Detection

Crawling and Mining Web Sources

Information Retrieval. Lecture 9 - Web search basics

Fingerprint-based Similarity Search and its Applications

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

US Patent 6,658,423. William Pugh

Hashing and Merging Heuristics for Text Reuse Detection

An Information Retrieval Approach for Source Code Plagiarism Detection

Large Crawls of the Web for Linguistic Purposes

Data-Intensive Computing with MapReduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Exploratory Search Missions for TREC Topics

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Query Session Detection as a Cascade

Detection of Text Plagiarism and Wikipedia Vandalism

Algorithms for Nearest Neighbors

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Plagiarism detection by similarity join

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Finding Similar Sets

Information Retrieval Spring Web retrieval

A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm

Improving Synoptic Querying for Source Retrieval

Adaptive Near-Duplicate Detection via Similarity Learning

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Redundant Documents and Search Effectiveness

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Candidate Document Retrieval for Web-scale Text Reuse Detection

Webis at the TREC 2012 Session Track

Deliverable D Multilingual corpus acquisition software

Detecting the Origin of Text Segments Efficiently

Detecting Near-Duplicates in Large-Scale Short Text Databases

Information Retrieval May 15. Web retrieval

Chapter ML:XI (continued)

Query Session Detection as a Cascade

Web-scale Content Reuse Detection (extended) USC/ISI Technical Report ISI-TR-692, June 2014

EMC Celerra Replicator V2 with Silver Peak WAN Optimization

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Algorithms for Nearest Neighbors

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT 5 LIST OF TABLES LIST OF FIGURES LIST OF SYMBOLS AND ABBREVIATIONS xxi

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CLSH: Cluster-based Locality-Sensitive Hashing

From Web Page Storage to Living Web Archives Thomas Risse

Creating a Classifier for a Focused Web Crawler

Problem 1: Complexity of Update Rules for Logistic Regression

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012

Influence of Word Normalization on Text Classification

DATA MINING II - 1DL460. Spring 2014"

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content

Information Retrieval

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Adaptive Binary Quantization for Fast Nearest Neighbor Search

TIC: A Topic-based Intelligent Crawler

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

An introduction to Web Mining part II

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

From Web Page Storage to Living Web Archives

Contents. List of Figures. List of Tables. Acknowledgements

Achieving both High Precision and High Recall in Near-duplicate Detection

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Similarity Searching Techniques in Content-based Audio Retrieval via Hashing

Finding dense clusters in web graph

CHAPTER 2: LITERATURE REVIEW

An efficient approach in Near Duplicate Document Detection using Shingling-MD5

Predictive Indexing for Fast Search

Developing Focused Crawlers for Genre Specific Search Engines

SafeAssign. Blackboard Support Team V 2.0

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Néonaute: mining web archives for linguistic analysis

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Chapter S:II. II. Search Space Representation

CHAPTER 3 NOISE REMOVAL FROM WEB PAGE FOR EFFECTUAL WEB CONTENT MINING

Cosine Approximate Nearest Neighbors

Information Retrieval. (M&S Ch 15)

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Efficient Web Crawling for Large Text Corpora

Predictive Indexing for Fast Search

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

SourcererCC -- Scaling Code Clone Detection to Big-Code

Evaluation of Retrieval Systems

Similarity Joins in MapReduce

Information Retrieval and Web Search

Topology-Based Spam Avoidance in Large-Scale Web Crawls

Automatic Detection and Visualisation of Overlap for Tracking of Information Flow

Transcription:

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems

Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: Mirrors. Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. Versions created for different delivery mechanisms (HTML, PDF, etc.) Annotated and unannotated copies of the same document Policies and procedures for the same purpose in different legislatures Boilerplate text such as license agreements or disclaimers Shared context such as summaries of other material or lists of links Syndicated news articles delivered in different venues Revisions and versions Reuse and republication of text (legitimate and otherwise) [Zobel 06]

Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: Mirrors. Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. Versions created for different delivery mechanisms (HTML, PDF, etc.) Annotated and unannotated copies of the same document Policies and procedures for the same purpose in different legislatures Boilerplate text such as license agreements or disclaimers Shared context such as summaries of other material or lists of links Syndicated news articles delivered in different venues Revisions and versions Reuse and republication of text (legitimate and otherwise) Nearly exact copies and modified copies with high similarity. [Zobel 06] Near-duplicate documents.

Motivation Contributions of near-duplicate detection to real-world tasks: Index size reduction Search result cleaning Web crawl prioritization Plagiarism analysis Our contributions to near-duplicate detection: Classification of near-duplicate detection algorithms Presentation of a new tailored corpus for evaluation Comparison of current algorithms (including so far unconsidered hashing technologies)

Formalization Consider a set of documents D. Given a document d q : Find all documents D q D with a high similarity to d q. Naive approach: Compare d q with each d D. In detail: Construct document models for D and d q, obtaining D and d q. Employ a similarity function ϕ : D D [0, 1]. Near-duplicate detection algorithms rely on purposefully constructed document models, called fingerprints. A fingerprints is a set of k natural numbers, which are computed on the basis document extracts. Two documents are considered as duplicates if their fingerprints share at least k d, k d < k, numbers.

Fingerprinting Fingerprinting methods Chunking- based Finger- printing Similarity Hashing Chunking: k chunks are selected from a document d. σ 351427 125497 {351427, 125497} d Chunks c 1, c 2 Hashcodes Fingerprint p 1 = h(c 1 ), p 2 = h(c 2 ) F d = { p 1, p 2 } Chunks are also called n-grams or shingles.

Fingerprinting Fingerprinting methods All Chunks n-gram model Finger- printing Chunking- based Collection-specific (Pseudo-) Random Cascading Synchronized Local rare chunks SPEX, I-Match shingling, prefix anchors, hashed breakpoints, winnowing random, n-th chunk super-, megashingling Similarity Hashing Chunking: k chunks are selected from a document d. Selection heuristics: all based on knowledge about D intelligent random choices

Fingerprinting Fingerprinting methods All Chunks n-gram model Finger- printing Chunking- based Collection-specific (Pseudo-) Random Cascading Synchronized Local rare chunks SPEX, I-Match shingling, prefix anchors, hashed breakpoints, winnowing random, n-th chunk super-, megashingling Similarity Hashing Similarity Hashing: k particular hash functions h ϕ : D U, U N, with the property h ϕ (d) = h ϕ (d q ) ϕ(d,d q ) 1 ε with d D, 0 < ε 1 are used to generate k hashcodes for a document d.

Fingerprinting Fingerprinting methods All Chunks n-gram model Finger- printing Chunking- based Collection-specific (Pseudo-) Random Cascading Synchronized Local rare chunks SPEX, I-Match shingling, prefix anchors, hashed breakpoints, winnowing random, n-th chunk super-, megashingling Similarity Hashing Knowledge-based Randomized fuzzy-fingerprinting locality-sensitive hashing Similarity Hashing: k particular hash functions h ϕ : D U, U N, with the property h ϕ (d) = h ϕ (d q ) ϕ(d,d q ) 1 ε with d D, 0 < ε 1 are used to generate k hashcodes. Hash function construction: domain knowledge vs. randomization.

Fingerprinting Fingerprinting methods All Chunks n-gram model Finger- printing Chunking- based Collection-specific (Pseudo-) Random Cascading Synchronized Local rare chunks SPEX, I-Match shingling, prefix anchors, hashed breakpoints, winnowing random, n-th chunk super-, megashingling Similarity Hashing Knowledge-based Randomized fuzzy-fingerprinting locality-sensitive hashing For algorithms in the upper box fingerprints have to share more than one number, k d > 1, to be recognized as duplicates For algorithms in the lower box fingerprints need to share only one number, k d = 1, to be recognized as duplicates.

(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] ( ) w 1 w 2 w 3 w 2 w 3 w 4 w 3 w 4 w 5... w m-2 w m-1 w m ( ) 12354 15695 59634... 43586 {12354, 15695,..., 55476} n-gram vector space model hash value computation Fingerprint (random choice)

(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] Shingling Super-Shingling ( ) w 1 w 2 w 3 w 2 w 3 w 4 w 3 w 4 w 5... w m-2 w m-1 w m n-gram vector space model Cascade ( ) 12354 15695 59634... 43586 hash value computation {12354, 15695,..., 55476} Fingerprint (random choice) "12354 15695... 55476" String representation

Similarity Hashing Fuzzy-Fingerprinting (FF) [Stein 05] A priori probabilities of prefix classes in BNC Distribution of prefix classes in sample Normalization and difference computation Fuzzification {213235632, 157234594} Fingerprint All words having the same prefix belong to the same prefix class.

Similarity Hashing Locality-Sensitive Hashing (LSH) [Indyk and Motwani 98, Datar et. al. 04] a 2 a 1 T a. i d d a r Vector space with sample document and random vectors Dot product computation Real number line {213235632} Fingerprint The results of the r dot products are summed.

Wikipedia Snapshot including all Revisions Existing standard corpora (TREC, Reuters) are not suited for large-scale near-duplicate detection algorithm evaluations. Wikipedia is a rich resource of versioned and revisioned documents. Benchmark data: approx. 6 million pages (documents) approx. 80 million revisions XML file of approx. 1 TB

Wikipedia Snapshot including all Revisions Experiments: first revision of each Wikipedia page is in the role of d q d q was compared with each of it s revisions d q was compared with it s immediate succeeding page Reference: Vector space model with tf and cos-similarity. y y 1 yy y y ywikipedia Reuters 0.1 0.01 0.001 0.0001 0 0.2 0.4 0.6 0.8 1 Similarity Intervals Precision and recall were recorded for similarity thresholds ranging from 0 to 1. Percentage of Similarities

Results Recall 1 0.8 0.6 0.4 FF LSH SSh Sh HBC Wikipedia Revision 0.2 0 0 0.2 0.4 0.6 0.8 1 Similarity

Results 1 0.8 Precision 0.6 0.4 FF LSH 0.2 SSh Sh HBC Wikipedia Revision 0 0 0.2 0.4 0.6 0.8 1 Similarity

Near-duplicate detection accuracy: FF outperforms other algorithms in terms of recall. No algorithm outperforms another in terms of precision. LSH performs poor in both cases. Wikipedia Revision Collection: May be a new standard for high similarity evaluations. Allows for evaluations at the Web scale. Conclusions: Similarity hashing is a promising technology for near-duplicate detection. There is still room for improvement. Chunking strategies are susceptible to versioned documents.

Thank you!