SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content

Similar documents
A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Detecting Near-Duplicates in Large-Scale Short Text Databases

Achieving both High Precision and High Recall in Near-duplicate Detection

Combinatorial Algorithms for Web Search Engines - Three Success Stories

SpotSigs: Near-Duplicate Detection in Web Page Collections

Adaptive Near-Duplicate Detection via Similarity Learning

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

A LITERATURE SURVEY ON WEB CRAWLERS

Automated Path Ascend Forum Crawling

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

CADIAL Search Engine at INEX

Mining Quantitative Association Rules on Overlapped Intervals

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING

From Passages into Elements in XML Retrieval

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web

Topic: Duplicate Detection and Similarity Computing

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Clustering-Based Distributed Precomputation for Quality-of-Service Routing*

Metric Learning Applied for Automatic Large Image Classification

Making Retrieval Faster Through Document Clustering

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Semantic Website Clustering

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Chapter 27 Introduction to Information Retrieval and Web Search

A Two-Tier Distributed Full-Text Indexing System

Towards a hybrid approach to Netflix Challenge

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

New Issues in Near-duplicate Detection

Exploiting Index Pruning Methods for Clustering XML Collections

A New Measure of the Cluster Hypothesis

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing

Text Clustering Incremental Algorithm in Sensitive Topic Detection

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Joint Entity Resolution

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Mining di Dati Web. Lezione 3 - Clustering and Classification

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Web page recommendation using a stochastic process model

Finding Similar Items:Nearest Neighbor Search

Plagiarism detection by similarity join

Automatic Query Type Identification Based on Click Through Information

Online Document Clustering Using the GPU

Ranking Clustered Data with Pairwise Comparisons

Algorithms for Nearest Neighbors

A Model for Interactive Web Information Retrieval

C-NBC: Neighborhood-Based Clustering with Constraints

Cardinality Estimation: An Experimental Survey

Search Engines. Information Retrieval in Practice

Selection of n in K-Means Algorithm

Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Mining High Average-Utility Itemsets

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Duplicate News Story Detection Revisited

Ranking Web Pages by Associating Keywords with Locations

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Encoding Words into String Vectors for Word Categorization

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

ResPubliQA 2010

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Live Virtual Machine Migration with Efficient Working Set Prediction

Semi-Supervised Clustering with Partial Background Information

Leveraging Set Relations in Exact Set Similarity Join

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

Distance-based Outlier Detection: Consolidation and Renewed Bearing

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems

K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

FOCUS: ADAPTING TO CRAWL INTERNET FORUMS

Weighted Suffix Tree Document Model for Web Documents Clustering

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Classification of Page to the aspect of Crawl Web Forum and URL Navigation

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Compact Encoding of the Web Graph Exploiting Various Power Laws

Frame based Video Retrieval using Video Signatures

International Journal of Advanced Research in Computer Science and Software Engineering

Crawler with Search Engine based Simple Web Application System for Forum Mining

Online Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010

Improving Suffix Tree Clustering Algorithm for Web Documents

Application of Support Vector Machine Algorithm in Spam Filtering

Fast or furious? - User analysis of SF Express Inc

A Comparison of Algorithms used to measure the Similarity between two documents

Predictive Indexing for Fast Search

Transcription:

SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan Department of Computer Science and Technology, Peking University {mxl,lxb,dn,lxm,yhf}@net.pku.edu.cn Abstract. Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don t work well for short Web pages 1, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed nearduplicate news articles, which include both short and long Web pages. Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs, SizeSpotSigs, Information Retrieval. 1 Introduction Detection of duplicate or near-duplicate Web pages is an important and difficult problem for Web search engines. Lots of algorithms have been proposed in recent years [6,8,20,13,18]. Most approaches can be characterized as different types of distance or overlap measures operating on the HTML strings. State-of-the-art algorithms, such as Broder et al. s [2] and Charikar s [3], achieve reasonable precision or recall. Especially, SpotSigs[19] could avert the process of removing noise in Web page because of its smart feature selection. Existing deduplicate 1 In this page, Web pages are classified into long (Web) pages and short (Web) page based on their core content size. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp. 537 548, 2011. c Springer-Verlag Berlin Heidelberg 2011

538 X. Mao et al. algorithms don t take size of the page core content into account. Essentially, these algorithms are more suitable for processing the long Web pages because they just take surfacing features to present documents. For short documents, however, the presentation is not sufficient. Especially, when documents have noise information, like ads within the Web page, the presentation is worse. Our experiments in section 5.3 also proves that the state-of-the-art deduplicate algorithm is relatively poor for short Web pages, just 0.62(F1) against 0.92(F1) for long Web pages. In fact, there are large amount of short Web pages which have duplicated core content on the World Wide Web. At the same time, they are also very important, for example, the central bank announces some message, such as, interest rate adjustment. Fig.1 shows a pair of same-core Web pages that only differ in the framing, advertisements, and navigational banners. Both articles exhibit almost identical core contents, reporting on the match review between Uruguay and Netherlands. Fig. 1. Near-duplicate Web Pages: identical core content with different framing and banners(additional ads and related links removed) and the size of core contents are short So it is important and necessary to improve the effectiveness of deduplication for short Web pages.

SizeSpotSigs: An Effective Deduplicate Algorithm 539 1.1 Contribution 1. Analyze the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2. Based on our analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages, which leads to AF SpotSigs algorithm; 3. We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. 2 Related Work There are two families of methods for near duplicate detection. One is contentbased methods, the other is non-content-based methods. The content-based methods were to detect near duplicates by computing similarity between contents of documents, while the non-content-based methods made use of non-content features[10,1,17](i.e. URL pattern) to detect near duplicates. The non-contentbased methods were only used to detect the near duplicate pages in one web site while the content-based methods have no any limitation. Content-based algorithms could be also divided into two groups according to whether they need noise removing. Most of the existing content-based deduplicate algorithms needed the process of removing noise. Broder et al. [6] proposed a DSC algorithm(also called Shingling), as a method to detect near duplicates by computing similarity among the shingle sets of the documents. The similarity between two documents is computed based on the common Jaccard overlap measure between these document shingle set. In order to reduce the complexity of Shingling for processing large collections, DSC- SS(also called super shingles) was later proposed by Broder in [5]. DSC-SS makes use of meta-shingles, i.e., shingles of shingles, with only a little decrease in precision. A variety of methods for getting good shingles are investigated by Hod and Zobel [14]. Buttcher and Clarke [7] focus on Kullback-Leibler divergence in the more general context of search. A lager-scale evaluation was implemented by Henzinger[13] to compare the precision of shingling and simhash algorithms by adjusting their parameters to maintain almost same recall. The experiment shows that neither of the algorithms works well for finding nearduplicate pairs on the same site because of the influence of templates, while both achieve a higher precision for near-duplicate pairs on different sites. [21] proposed that near-duplicate clustering should incorporating information about document attributes or the content structure. Another widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to compute pair-wise similarity of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by Jaccard

540 X. Mao et al. formula. Different algorithms are characterized, and their computational costs are determined, by the hash functions and how character sequences are selected. Manber [18] started the research firstly. I-Match algorithm [9,16] uses external collection statistics and make recall increase by using multiple fingerprints per document. Position based schemes [4] select strings based on their offset in a document. Broder etc. [6] pick strings whose hash values are multiples of an integer. Indyk and Motwani [12,15] proposed Locality Sensitive Hashing (LSH), which is an approximate similarity search technique that scales to both large and high-dimensional data sets. There are many variant of LSH, such as LSH-Tree [3] or Hamming-LSH [11]. Generally, the noise removing is an expensive operation. If possible, the nearduplicate detection algorithm should avoid noise removing. Martin Theobald proposed SpotSigs [19] algorithm, which used word chain around stop words as features to construct feature set. For example, consider the sentence: On a street in Milton, in the city s inner-west, one woman wept as she toured her waterlogged home. Choosing the articles a, an, the and the verb is as antecedents with a uniform spot distance of 1 and chain length of 2, we obtain the set of spot signatures S = {a:street:milton, the:city s:inner-west}. The SpotSigs only needs a single pass over a corpus, which is much more efficient, easier to implement, and less error-prone because expensive layout analysis is omitted. Meanwhile, it remains largely independent of the input format. The method will be taken as our baseline. In this paper, considering special merits, we focus on the algorithms without noise removing, and we also take Jaccard overlap measure as our similarity measure. 3 Relation between Noise-Content Ratio and Similarity 3.1 Concepts and Notation For calculating the similarity, we need to extract features from Web pages. We define all the features from one page as page-feature set; Also we split these features into content-feature set and noise-feature set. A feature comes from the core content of page is defined as content feature (element) and belongs to the content feature set; otherwise, the feature is called noise feature (element) and belongs to the noise feature set. The noise-content (feature) ratio represents the ratio between the size of noise feature set and the size of content feature set. 3.2 Theoretical Analysis Let sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 be the default Jaccard similarity as defined over two sets P 1 and P 2, each consisting of distinct page-feature set in our case. P 1c and P 2c are the content-feature sets; P 1n and P 2n are the noise-feature sets, which subject to P 1c P 1n = P 1 and P 2c P 2n = P 2. The similarity between P 1c and P 2c is sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c, which is the real value we care in the near-duplicate detection.

SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact, near-duplicate detection is to compare the similarity of the core contents of two pages, but Web pages have many noisy content, such as banners and ads. Most of algorithms is to use sim(p 1,P 2 )to approach sim(p 1c,P 2c ). If sim(p 1,P 2 )isclosetosim(p 1c,P 2c ), it shows that the near-duplicate detection algorithm works well, and vice versa. In order to describe the difference between sim(p 1,P 2 )andsim(p 1c,P 2c ), we could get the Theorem 1 as follow: Theorem 1. Given two sets, P 1 and P 2, subject to P 1c P 1, P 1n P 1 and P 1c P 1n = P 1. Similarly, P 2c P 2, P 2n P 2 and P 2c P 2n = P 2 ;Atthesame time, sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 and sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c. Let the noise-content ratio P1n P 1c ɛ and P2n P 2c ɛ, whereɛ is a small number. Then, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (1) Proof: leta = P 1c P 2c,B = P 1c P 2c,then A (P 1c P 1n ) (P 2c P 2n ) A +2 max{ P 1n, P 2n } (2) B (P 1c P 1n ) (P 2c P 2n ) B +2 max{ P 1n, P 2n } (3) From (2) and (3), we can get the following inequality: A (P1c P1n) (P2c P2n) A +2 max{ P1n, P2n } B +2 max{ P 1n, P 2n } (P 1c P 1n) (P 2c P 2n) B (4) From (4), wet get the following inequality: 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n }) (P 1c P 1n ) (P 2c P 2n ) A 2 max{ P 1n, P 2n } (P 1c P 1n ) (P 2c P 2n ) B B (5) Obviously, A B and B max{ P 1c, P 2c }. So, we get: max{ P 1n, P 2n } B Another inequality is: max{ P 1n, P 2n } max{ P 1c, P 2c } ɛ. (6) 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n } = 2A max{ P 1n, P 2n } B B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 max{ P 1c, P 2c } +2 max{ P 1n, P 2n } 2 max{ P 1n, P 2n } max{ P 1c, P 2c } 1+ 2 max{ P1n, P2n } max{ P 1c, P 2c } ( 2ɛ)/(1 + 2ɛ) (7)

542 X. Mao et al. So, (5)could be reformed as: 2ɛ 1+2ɛ (P 1c P 1n ) (P 2c P 2n ) (P 1c P 1n ) (P 2c P 2n ) A B 2ɛ (8) That is, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (9) Theorem1 shows:(1). When ɛ is small enough, the similarity sim(p 1,P 2 )isclose to the similarity sim(p 1c,P 2c ); (2). When ɛ reaches a certain small value, the difference between two similarity is little even though ɛ continue to become smaller, the difference varies little. That is, when noise-content ratio reaches a certain small number, the increase of effectiveness of near-duplicate detection algorithm will be little. Without loss of generality, we assume P2n P 2c P1n P 1c = ɛ. ThenFormula(9) could be reformed as: 2 P 1n P 1c +2 P 1n sim(p 1,P 2 ) sim(p 1c,P 2c ) 2 P 1n P 1c (10) Formula(10) shows P 1c should be large for robust; Otherwise, P 1c or P 1n changes slightly will cause fierce change of upper bound and lower bound, which shows the algorithm is not robust. For example, assume two upper-bounds: 5/100 and 5/100, the upper bound become (5+5)/(100+100) after combining feature sets, which is equal with 5/100. but (5+1)/100 > (5+5+1)/(100+100). Obviously, (5+5)/(100+100) is more robuster than 5/100, though they have same value. In a word, when ɛ is large relatively, we could make the algorithm work better by two rules as follows:(a). Select features that have small noise-content ratio to improve effectiveness; (b). When the noise-content ratios of two types of feature are the same, we should select the feature with larger content-feature set to make the algorithm robust, which implies that if the noise-content ratios of several types of features are very close, these features should be combined to increase the robustness while the effectiveness changes little. 4 AF SpotSigs and SizeSpotSigs Algorithm SpotSigs[19] provided a stopword feature, which aimed to filter natural-language text passages out of noisy Web page components, that is, noise-content ratio was small, which gave us an intuition that we should choose features that tend to occur mostly in the core content of Web documents and skip over advertisements, banners, and the navigational components. In this paper, based on thought in SpotSigs and our analysis in section 3.2, we developed four features which

SizeSpotSigs: An Effective Deduplicate Algorithm 543 Fig. 2. Table of Meaning of Chinese Punctuations and Table of Markers of Chinese Stopwords in the paper all have small noise-content ratio. Details are as follows: 1).Stopword feature; It is similar to the feature in SpotSigs that is a string of stopword and its neighboring words, except that the stopwords are different because languages are different; Because the stopwords in noisy content are less than ones in core content, so the features could decrease the noise-content ratio against Shingling features. The Chinese stopwords and corresponding marker used in this paper are listed in the Fig.2. 2).Chinese punctuation feature; In English, many punctuations are the same with the special characters in HTML language. So in English, we can t use the punctuation to extract feature. In Chinese, however, this is not the case. As we known, the Chinese punctuations occurs less in the noisy area. We choose a string of punctuation and its neighboring words as Chinese punctuation feature, which makes the noise-content ratio small. The Chinese punctuations and corresponding English punctuations used in this paper are also listed in the Fig.2. 3).Sentence feature; The string between two Chinese punctuations is thought as sentence; Considering the sentence with punctuation is little in noisy area, so the sentence features could decrease noise-content ratio notably. 4).Sentence shingling feature; Assuming the length of one sentence is n, all 1-gram, 2-gram,..., (n-1)-gram are taken as new features, aiming to increase the number of content-feature set for robustness and effectiveness, which would also make noise-content ratio small based on sentence feature. The Stopword feature is used by the state-of-the-art algorithm, SpotSigs [19]. Though the stopwords are different because languages are different, we still call the algorithm SpotSigs. The experiments in Section 5.3 showed that SpotSigs could reach 0.92(F1) on long Web pages, but only 0.62 on short Web pages. Obviously, SpotSigs could not process the short Web pages well, and we need new algorithm. If all four features are used to detect near duplication, the algorithm is called AF SpotSigs. The experiments in Section 5.3 showed that AF SpotSigs could reach 0.77(F1) against 0.62(F1) of SpotSigs for short Web pages, but only increasing by 0.04(F1) with 28.8 times time overhead for long Web pages, which presents AF SpotSigs could work better than SpotSigs for short Web pages, and the effectiveness of AF SpotSigs is slightly better than that of SpotSigs for long Web pages but cost is higher. Considering the balance between efficiency and effectiveness, we propose algorithm called SizeSpotSigs that chooses only stopword features to judge the near duplication for long Web pages(namely SpotSigs) while the algorithm chooses all four-type features mentioned above for short Web pages(namely AF SpotSigs).

544 X. Mao et al. 5 Experiment 5.1 Data Set For verifying our algorithms, AF SpotSigs and SizeSpotSigs, we construct 4 datasets. Details are as follows: Collection Shorter/Collection Longer: weconstructthecollection Shorter and Longer humanly. The Collection Shorter has 379 short Web pages and 48 clusters; And the Collection Longer has 332 long Web pages and 40 clusters. Collection Mixer/Collection Mixer Purity: The Collection Shorter and Collection Longer are mixed as Collection Mixer, which includes 88 clusters and 711 Web pages totally. For each Web page in the Collection Mixer, we get its core content according to human judge, which lead to Collection Mixer Purity. 5.2 Choice of Stopwords Because quantity of stopwords is large, e.g.370 more in Chinese, we need to select the most representative stopwords to improve performance. SpotSigs, however, just did experiments on English Collection. We don t know how to choose stopwords or the length of its neighboring words on Chinese collection. At the same time, for AF SpotSigs, we also need to choose stopwords and the length. We find that F1 varies slightly about 1 absolute percent from a chain length of 1 to distance of 3 (figures omitted). So we choose two words as length parameter for the two algorithms. In this section, we will seek to the best combination of stopwords for AF Spot Sigs and SpotSigs for Chinese. We now consider variations in the choice of Spot- Sigs antecedents(stopwords and its neighboring words), thus aiming to find a good compromise between extracting characteristic signatures while avoiding an over-fitting of these signatures to particular articles or sites. For SpotSigs, which is fit for long Web pages, the best combination was searched in the collection Longer Sample which was sampled 1/3 clusters from the collection Longer. Moreover, for AF SpotSigs, which is fit for short Web pages, we get the parameter over the collection Shorter Sample, which was sampled 1/3 clusters from the collection Shorter. Fig.3(a) shows that we obtain the best F1 result for SpotSigs from a combination of De1, Di, De2, Shi, Ba, Le, mostly occurring in core contents and less likely to occur in ads or navigational banners. Meanwhile, for AF SpotSigs, Fig.3(b) shows the best F1 result is obtained on stopword De1. Using a full stopword list (here we use the most frequent 40 stopwords) already tends to yield overly generic signatures but still performs good significantly. 5.3 AF SpotSigs vs. SpotSigs After obtaining the parameters of AF SpotSigs and SpotSigs, we could compare the two algorithms from F1 value to computing cost. So, the two algorithms run on the Collection Shorter and Longer to do comparison.

SizeSpotSigs: An Effective Deduplicate Algorithm 545 SpotSigs on Longer AF_SpotSigs on Shorter 0.913 0.921 0.898 0.95 0.887 0.874 0.875 0.856 0.824 0.811 0.85 0.772 0.768 0.770 0.768 0.768 0.769 0.769 0.769 0.767 0.757 F1 F1 0.75 0.65 0.588 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list 0.55 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list (a) (b) Fig. 3. (a)the effectiveness of SpotSigs with different stopwords on Longer collection;(b)the effectiveness of AF SpotSigs with different stopwords on Shorter collection Fig.4showstheF1scoresofAFSpotSigs are both better than SpotSigs on Shorter and Longer. Moreover, F1 score of SpotSigs is far worse than AF SpotSigs on Shorter while F1 scores of two algorithms are very close on Longer. However, Table 1 shows that AF SpotSigs took much more time than SpotSigs. Considering balance between effectiveness and efficiency, we could partition one collection into two parts, namely the short part and long part. SpotSigs works on the long part while AF SpotSigs runs on the short part, namely SizeSpotSigs algorithm. SpotSigs 0.960 AF_SpotSigs 0.921 0.772 F1 0.622 Shorter Longer Shorter Longer F1 0.6223 0.9214 SpotSigs Time(Sec.) 1.743 1.812 F1 0.7716 0.9597 AF SpotSigs Time(Sec.) 21.17 52.31 Fig. 4. The effectiveness of SpotSigs and AF SpotSigs on Shorter and Longer Table 1. The F1 value and cost of two algorithms

546 X. Mao et al. Mixer_Purity Mixer F1 F1 SizeSpotSigs AF_SpotSigs SpotSigs SizeSpotSigs AF_SpotSigs SpotSigs cluster partition point (a) 1 1 cluster partition point (b) Fig. 5. F1 values of SizeSpotSigs, AF SpotSigs and SpotSigs on Collection Mixer purity(a) and Mixer(b) 5.4 SizeSpotSigs over SpotSigs and AF SpotSigs To verify SizeSpotSigs, all clusters in Mixer are sorted from small to large as their average size of core contents. We select three partition point (22,44,66) to partition set of clusters. For example, if partition point is 22, the first 22 clusters in the sorted clusters are took as small part while the rest clusters are large part. Table 2 demonstrates the nature of two parts in the every partition. Specially, 0/88 means that all clusters are took into large part which make SizeSpotSigs becomes SpotSigs while 88/0 means all clusters belong to small part which make SizeSpotSigs becomes AF SpotSigs. Fig.5(b) shows SizeSpotSigs works better than SpotSigs while worse than AF SpotSigs. Moreover, the F1 value of SizeSpotSigs increases with the increase of partition point. When purified collection is used, noise-content ratio is zero. So based on formula (9), sim(p 1,P 2 ) = sim(p 1c,P 2c ), which leads to F1 value depends on sim(p 1c,P 2c ) completely. Fig.5(a) demonstrates F1 of SizeSpotSigs rise and fall in a irregular manner, but among a reasonable interval, which all above 0.91. All details are listed in Table 3. Table 2. The Nature of Partitions Partition 0/88 22/66 44/44 66/22 88/0 Avg Size(Byte) 0/2189.41 607.65/2561.43 898.24/3247.73 1290.25/4421.20 2189.41/0 File Num 0/711 136/575 321/390 514/197 711/0

SizeSpotSigs: An Effective Deduplicate Algorithm 547 Table 3. the F1 value and time for 3 algorithms on partitions(s is Sec.) SpotSigs AF SpotSigs SizeSpotSigs SizeSpotSigs SizeSpotSigs (0/88) (88/0) (22/66) (44/44) (66/22) F1 0.6957 0.8216 0.7530 0.7793 0.8230 Mixer Time(s) 3.6094 148.20 7.142 22.81 61.13 F1 0.9360 0.9122 0.9580 0.9306 0.9165 Mixer Purity Time(s) 2.2783 134.34 4.0118 15.99 47.00 6 Conclusions and Future Works We analyzed the relation between noise-content ratio and similarity theoretically, which leads to two rules that could make the near-duplicate detection algorithm work better. Then, the paper proposed 3 new features to improve the effectiveness and robustness for short Web pages, which leaded to our AF SpotSigs method. Experiments confirm that 3 new features are effective, and our AF SpotSigs work 15% better than the state-of-the-art method for short Web pages. Besides, SizeSpotSigs that considers the size of page core content performs better than SpotSigs over different partition points. Future work will focus on 1). How to decide the size of the core content of Web page automatically or approximately; 2). Design more features that is fit for short Web page to improve the effectiveness, as well as generalizing the bounding approach toward other metrics such as Cosine. Acknowledgments This work is supported by NSFC Grant No.70903008, 60933004 and 61073082, FSSP 2010 Grant No.15. At the same time, we thank Jing He, Dongdong Shan for a quick review of our paper close to the submission deadline. References 1. Agarwal, A., Koppula, H., Leela, K., Chitrapura, K., Garg, S., GM, P., Haty, C., Roy, A., Sasturkar, A.: URL normalization for de-duplication of web pages. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1987 1990. ACM, New York (2009) 2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison- Wesley, Reading (1999) 3. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, p. 660. ACM, New York (2005) 4. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. ACM SIGMOD Record 24(2), 409 (1995) 5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1 10. Springer, Heidelberg (2000)

548 X. Mao et al. 6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157 1166 (1997) 7. Buttcher, S., Clarke, C.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, p. 189. ACM, New York (2006) 8. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, p. 388. ACM, New York (2002) 9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 191 (2002) 10. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping URLs via rewrite rules. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 186 194. ACM, New York (2008) 11. Datar, M., Gionis, A., Indyk, P., Motwani, R., Ullman, J., et al.: Finding Interesting Associations without Support Pruning. IEEE Transactions on Knowledge And Data Engineering 13(1) (2001) 12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518 529. Morgan Kaufmann Publishers Inc., San Francisco (1999) 13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284 291. ACM, New York (2006) 14. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology 54(3), 203 215 (2003) 15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604 613. ACM, New York (1998) 16. Ko lcz,a.,chowdhury,a.,alspector,j.:improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 610. ACM, New York (2004) 17. Koppula, H., Leela, K., Agarwal, A., Chitrapura, K., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 381 390. ACM, New York (2010) 18. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp. 1 10 (1994) 19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563 570. ACM, New York (2008) 20. Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop on E-Commerce (1996) 21. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 428. ACM, New York (2006)