SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content

SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan Department of Computer Science and Technology, Peking University {mxl,lxb,dn,lxm,yhf}@net.pku.edu.cn Abstract. Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don t work well for short Web pages 1, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed nearduplicate news articles, which include both short and long Web pages. Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs, SizeSpotSigs, Information Retrieval. 1 Introduction Detection of duplicate or near-duplicate Web pages is an important and difficult problem for Web search engines. Lots of algorithms have been proposed in recent years [6,8,20,13,18]. Most approaches can be characterized as different types of distance or overlap measures operating on the HTML strings. State-of-the-art algorithms, such as Broder et al. s [2] and Charikar s [3], achieve reasonable precision or recall. Especially, SpotSigs[19] could avert the process of removing noise in Web page because of its smart feature selection. Existing deduplicate 1 In this page, Web pages are classified into long (Web) pages and short (Web) page based on their core content size. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp. 537 548, 2011. c Springer-Verlag Berlin Heidelberg 2011

538 X. Mao et al. algorithms don t take size of the page core content into account. Essentially, these algorithms are more suitable for processing the long Web pages because they just take surfacing features to present documents. For short documents, however, the presentation is not suﬃcient. Especially, when documents have noise information, like ads within the Web page, the presentation is worse. Our experiments in section 5.3 also proves that the state-of-the-art deduplicate algorithm is relatively poor for short Web pages, just 0.62(F1) against 0.92(F1) for long Web pages. In fact, there are large amount of short Web pages which have duplicated core content on the World Wide Web. At the same time, they are also very important, for example, the central bank announces some message, such as, interest rate adjustment. Fig.1 shows a pair of same-core Web pages that only diﬀer in the framing, advertisements, and navigational banners. Both articles exhibit almost identical core contents, reporting on the match review between Uruguay and Netherlands. Fig. 1. Near-duplicate Web Pages: identical core content with diﬀerent framing and banners(additional ads and related links removed) and the size of core contents are short So it is important and necessary to improve the eﬀectiveness of deduplication for short Web pages.

SizeSpotSigs: An Effective Deduplicate Algorithm 539 1.1 Contribution 1. Analyze the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2. Based on our analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages, which leads to AF SpotSigs algorithm; 3. We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. 2 Related Work There are two families of methods for near duplicate detection. One is contentbased methods, the other is non-content-based methods. The content-based methods were to detect near duplicates by computing similarity between contents of documents, while the non-content-based methods made use of non-content features[10,1,17](i.e. URL pattern) to detect near duplicates. The non-contentbased methods were only used to detect the near duplicate pages in one web site while the content-based methods have no any limitation. Content-based algorithms could be also divided into two groups according to whether they need noise removing. Most of the existing content-based deduplicate algorithms needed the process of removing noise. Broder et al. [6] proposed a DSC algorithm(also called Shingling), as a method to detect near duplicates by computing similarity among the shingle sets of the documents. The similarity between two documents is computed based on the common Jaccard overlap measure between these document shingle set. In order to reduce the complexity of Shingling for processing large collections, DSC- SS(also called super shingles) was later proposed by Broder in [5]. DSC-SS makes use of meta-shingles, i.e., shingles of shingles, with only a little decrease in precision. A variety of methods for getting good shingles are investigated by Hod and Zobel [14]. Buttcher and Clarke [7] focus on Kullback-Leibler divergence in the more general context of search. A lager-scale evaluation was implemented by Henzinger[13] to compare the precision of shingling and simhash algorithms by adjusting their parameters to maintain almost same recall. The experiment shows that neither of the algorithms works well for finding nearduplicate pairs on the same site because of the influence of templates, while both achieve a higher precision for near-duplicate pairs on different sites. [21] proposed that near-duplicate clustering should incorporating information about document attributes or the content structure. Another widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to compute pair-wise similarity of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by Jaccard

540 X. Mao et al. formula. Different algorithms are characterized, and their computational costs are determined, by the hash functions and how character sequences are selected. Manber [18] started the research firstly. I-Match algorithm [9,16] uses external collection statistics and make recall increase by using multiple fingerprints per document. Position based schemes [4] select strings based on their offset in a document. Broder etc. [6] pick strings whose hash values are multiples of an integer. Indyk and Motwani [12,15] proposed Locality Sensitive Hashing (LSH), which is an approximate similarity search technique that scales to both large and high-dimensional data sets. There are many variant of LSH, such as LSH-Tree [3] or Hamming-LSH [11]. Generally, the noise removing is an expensive operation. If possible, the nearduplicate detection algorithm should avoid noise removing. Martin Theobald proposed SpotSigs [19] algorithm, which used word chain around stop words as features to construct feature set. For example, consider the sentence: On a street in Milton, in the city s inner-west, one woman wept as she toured her waterlogged home. Choosing the articles a, an, the and the verb is as antecedents with a uniform spot distance of 1 and chain length of 2, we obtain the set of spot signatures S = {a:street:milton, the:city s:inner-west}. The SpotSigs only needs a single pass over a corpus, which is much more efficient, easier to implement, and less error-prone because expensive layout analysis is omitted. Meanwhile, it remains largely independent of the input format. The method will be taken as our baseline. In this paper, considering special merits, we focus on the algorithms without noise removing, and we also take Jaccard overlap measure as our similarity measure. 3 Relation between Noise-Content Ratio and Similarity 3.1 Concepts and Notation For calculating the similarity, we need to extract features from Web pages. We define all the features from one page as page-feature set; Also we split these features into content-feature set and noise-feature set. A feature comes from the core content of page is defined as content feature (element) and belongs to the content feature set; otherwise, the feature is called noise feature (element) and belongs to the noise feature set. The noise-content (feature) ratio represents the ratio between the size of noise feature set and the size of content feature set. 3.2 Theoretical Analysis Let sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 be the default Jaccard similarity as defined over two sets P 1 and P 2, each consisting of distinct page-feature set in our case. P 1c and P 2c are the content-feature sets; P 1n and P 2n are the noise-feature sets, which subject to P 1c P 1n = P 1 and P 2c P 2n = P 2. The similarity between P 1c and P 2c is sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c, which is the real value we care in the near-duplicate detection.

SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact, near-duplicate detection is to compare the similarity of the core contents of two pages, but Web pages have many noisy content, such as banners and ads. Most of algorithms is to use sim(p 1,P 2 )to approach sim(p 1c,P 2c ). If sim(p 1,P 2 )isclosetosim(p 1c,P 2c ), it shows that the near-duplicate detection algorithm works well, and vice versa. In order to describe the difference between sim(p 1,P 2 )andsim(p 1c,P 2c ), we could get the Theorem 1 as follow: Theorem 1. Given two sets, P 1 and P 2, subject to P 1c P 1, P 1n P 1 and P 1c P 1n = P 1. Similarly, P 2c P 2, P 2n P 2 and P 2c P 2n = P 2 ;Atthesame time, sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 and sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c. Let the noise-content ratio P1n P 1c ɛ and P2n P 2c ɛ, whereɛ is a small number. Then, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (1) Proof: leta = P 1c P 2c,B = P 1c P 2c,then A (P 1c P 1n ) (P 2c P 2n ) A +2 max{ P 1n, P 2n } (2) B (P 1c P 1n ) (P 2c P 2n ) B +2 max{ P 1n, P 2n } (3) From (2) and (3), we can get the following inequality: A (P1c P1n) (P2c P2n) A +2 max{ P1n, P2n } B +2 max{ P 1n, P 2n } (P 1c P 1n) (P 2c P 2n) B (4) From (4), wet get the following inequality: 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n }) (P 1c P 1n ) (P 2c P 2n ) A 2 max{ P 1n, P 2n } (P 1c P 1n ) (P 2c P 2n ) B B (5) Obviously, A B and B max{ P 1c, P 2c }. So, we get: max{ P 1n, P 2n } B Another inequality is: max{ P 1n, P 2n } max{ P 1c, P 2c } ɛ. (6) 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n } = 2A max{ P 1n, P 2n } B B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 max{ P 1c, P 2c } +2 max{ P 1n, P 2n } 2 max{ P 1n, P 2n } max{ P 1c, P 2c } 1+ 2 max{ P1n, P2n } max{ P 1c, P 2c } ( 2ɛ)/(1 + 2ɛ) (7)

542 X. Mao et al. So, (5)could be reformed as: 2ɛ 1+2ɛ (P 1c P 1n ) (P 2c P 2n ) (P 1c P 1n ) (P 2c P 2n ) A B 2ɛ (8) That is, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (9) Theorem1 shows:(1). When ɛ is small enough, the similarity sim(p 1,P 2 )isclose to the similarity sim(p 1c,P 2c ); (2). When ɛ reaches a certain small value, the difference between two similarity is little even though ɛ continue to become smaller, the difference varies little. That is, when noise-content ratio reaches a certain small number, the increase of effectiveness of near-duplicate detection algorithm will be little. Without loss of generality, we assume P2n P 2c P1n P 1c = ɛ. ThenFormula(9) could be reformed as: 2 P 1n P 1c +2 P 1n sim(p 1,P 2 ) sim(p 1c,P 2c ) 2 P 1n P 1c (10) Formula(10) shows P 1c should be large for robust; Otherwise, P 1c or P 1n changes slightly will cause fierce change of upper bound and lower bound, which shows the algorithm is not robust. For example, assume two upper-bounds: 5/100 and 5/100, the upper bound become (5+5)/(100+100) after combining feature sets, which is equal with 5/100. but (5+1)/100 > (5+5+1)/(100+100). Obviously, (5+5)/(100+100) is more robuster than 5/100, though they have same value. In a word, when ɛ is large relatively, we could make the algorithm work better by two rules as follows:(a). Select features that have small noise-content ratio to improve effectiveness; (b). When the noise-content ratios of two types of feature are the same, we should select the feature with larger content-feature set to make the algorithm robust, which implies that if the noise-content ratios of several types of features are very close, these features should be combined to increase the robustness while the effectiveness changes little. 4 AF SpotSigs and SizeSpotSigs Algorithm SpotSigs[19] provided a stopword feature, which aimed to filter natural-language text passages out of noisy Web page components, that is, noise-content ratio was small, which gave us an intuition that we should choose features that tend to occur mostly in the core content of Web documents and skip over advertisements, banners, and the navigational components. In this paper, based on thought in SpotSigs and our analysis in section 3.2, we developed four features which

SizeSpotSigs: An Effective Deduplicate Algorithm 543 Fig. 2. Table of Meaning of Chinese Punctuations and Table of Markers of Chinese Stopwords in the paper all have small noise-content ratio. Details are as follows: 1).Stopword feature; It is similar to the feature in SpotSigs that is a string of stopword and its neighboring words, except that the stopwords are different because languages are different; Because the stopwords in noisy content are less than ones in core content, so the features could decrease the noise-content ratio against Shingling features. The Chinese stopwords and corresponding marker used in this paper are listed in the Fig.2. 2).Chinese punctuation feature; In English, many punctuations are the same with the special characters in HTML language. So in English, we can t use the punctuation to extract feature. In Chinese, however, this is not the case. As we known, the Chinese punctuations occurs less in the noisy area. We choose a string of punctuation and its neighboring words as Chinese punctuation feature, which makes the noise-content ratio small. The Chinese punctuations and corresponding English punctuations used in this paper are also listed in the Fig.2. 3).Sentence feature; The string between two Chinese punctuations is thought as sentence; Considering the sentence with punctuation is little in noisy area, so the sentence features could decrease noise-content ratio notably. 4).Sentence shingling feature; Assuming the length of one sentence is n, all 1-gram, 2-gram,..., (n-1)-gram are taken as new features, aiming to increase the number of content-feature set for robustness and effectiveness, which would also make noise-content ratio small based on sentence feature. The Stopword feature is used by the state-of-the-art algorithm, SpotSigs [19]. Though the stopwords are different because languages are different, we still call the algorithm SpotSigs. The experiments in Section 5.3 showed that SpotSigs could reach 0.92(F1) on long Web pages, but only 0.62 on short Web pages. Obviously, SpotSigs could not process the short Web pages well, and we need new algorithm. If all four features are used to detect near duplication, the algorithm is called AF SpotSigs. The experiments in Section 5.3 showed that AF SpotSigs could reach 0.77(F1) against 0.62(F1) of SpotSigs for short Web pages, but only increasing by 0.04(F1) with 28.8 times time overhead for long Web pages, which presents AF SpotSigs could work better than SpotSigs for short Web pages, and the effectiveness of AF SpotSigs is slightly better than that of SpotSigs for long Web pages but cost is higher. Considering the balance between efficiency and effectiveness, we propose algorithm called SizeSpotSigs that chooses only stopword features to judge the near duplication for long Web pages(namely SpotSigs) while the algorithm chooses all four-type features mentioned above for short Web pages(namely AF SpotSigs).

544 X. Mao et al. 5 Experiment 5.1 Data Set For verifying our algorithms, AF SpotSigs and SizeSpotSigs, we construct 4 datasets. Details are as follows: Collection Shorter/Collection Longer: weconstructthecollection Shorter and Longer humanly. The Collection Shorter has 379 short Web pages and 48 clusters; And the Collection Longer has 332 long Web pages and 40 clusters. Collection Mixer/Collection Mixer Purity: The Collection Shorter and Collection Longer are mixed as Collection Mixer, which includes 88 clusters and 711 Web pages totally. For each Web page in the Collection Mixer, we get its core content according to human judge, which lead to Collection Mixer Purity. 5.2 Choice of Stopwords Because quantity of stopwords is large, e.g.370 more in Chinese, we need to select the most representative stopwords to improve performance. SpotSigs, however, just did experiments on English Collection. We don t know how to choose stopwords or the length of its neighboring words on Chinese collection. At the same time, for AF SpotSigs, we also need to choose stopwords and the length. We find that F1 varies slightly about 1 absolute percent from a chain length of 1 to distance of 3 (figures omitted). So we choose two words as length parameter for the two algorithms. In this section, we will seek to the best combination of stopwords for AF Spot Sigs and SpotSigs for Chinese. We now consider variations in the choice of Spot- Sigs antecedents(stopwords and its neighboring words), thus aiming to find a good compromise between extracting characteristic signatures while avoiding an over-fitting of these signatures to particular articles or sites. For SpotSigs, which is fit for long Web pages, the best combination was searched in the collection Longer Sample which was sampled 1/3 clusters from the collection Longer. Moreover, for AF SpotSigs, which is fit for short Web pages, we get the parameter over the collection Shorter Sample, which was sampled 1/3 clusters from the collection Shorter. Fig.3(a) shows that we obtain the best F1 result for SpotSigs from a combination of De1, Di, De2, Shi, Ba, Le, mostly occurring in core contents and less likely to occur in ads or navigational banners. Meanwhile, for AF SpotSigs, Fig.3(b) shows the best F1 result is obtained on stopword De1. Using a full stopword list (here we use the most frequent 40 stopwords) already tends to yield overly generic signatures but still performs good significantly. 5.3 AF SpotSigs vs. SpotSigs After obtaining the parameters of AF SpotSigs and SpotSigs, we could compare the two algorithms from F1 value to computing cost. So, the two algorithms run on the Collection Shorter and Longer to do comparison.

SizeSpotSigs: An Effective Deduplicate Algorithm 545 SpotSigs on Longer AF_SpotSigs on Shorter 0.913 0.921 0.898 0.95 0.887 0.874 0.875 0.856 0.824 0.811 0.85 0.772 0.768 0.770 0.768 0.768 0.769 0.769 0.769 0.767 0.757 F1 F1 0.75 0.65 0.588 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list 0.55 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list (a) (b) Fig. 3. (a)the effectiveness of SpotSigs with different stopwords on Longer collection;(b)the effectiveness of AF SpotSigs with different stopwords on Shorter collection Fig.4showstheF1scoresofAFSpotSigs are both better than SpotSigs on Shorter and Longer. Moreover, F1 score of SpotSigs is far worse than AF SpotSigs on Shorter while F1 scores of two algorithms are very close on Longer. However, Table 1 shows that AF SpotSigs took much more time than SpotSigs. Considering balance between effectiveness and efficiency, we could partition one collection into two parts, namely the short part and long part. SpotSigs works on the long part while AF SpotSigs runs on the short part, namely SizeSpotSigs algorithm. SpotSigs 0.960 AF_SpotSigs 0.921 0.772 F1 0.622 Shorter Longer Shorter Longer F1 0.6223 0.9214 SpotSigs Time(Sec.) 1.743 1.812 F1 0.7716 0.9597 AF SpotSigs Time(Sec.) 21.17 52.31 Fig. 4. The effectiveness of SpotSigs and AF SpotSigs on Shorter and Longer Table 1. The F1 value and cost of two algorithms

546 X. Mao et al. Mixer_Purity Mixer F1 F1 SizeSpotSigs AF_SpotSigs SpotSigs SizeSpotSigs AF_SpotSigs SpotSigs cluster partition point (a) 1 1 cluster partition point (b) Fig. 5. F1 values of SizeSpotSigs, AF SpotSigs and SpotSigs on Collection Mixer purity(a) and Mixer(b) 5.4 SizeSpotSigs over SpotSigs and AF SpotSigs To verify SizeSpotSigs, all clusters in Mixer are sorted from small to large as their average size of core contents. We select three partition point (22,44,66) to partition set of clusters. For example, if partition point is 22, the first 22 clusters in the sorted clusters are took as small part while the rest clusters are large part. Table 2 demonstrates the nature of two parts in the every partition. Specially, 0/88 means that all clusters are took into large part which make SizeSpotSigs becomes SpotSigs while 88/0 means all clusters belong to small part which make SizeSpotSigs becomes AF SpotSigs. Fig.5(b) shows SizeSpotSigs works better than SpotSigs while worse than AF SpotSigs. Moreover, the F1 value of SizeSpotSigs increases with the increase of partition point. When purified collection is used, noise-content ratio is zero. So based on formula (9), sim(p 1,P 2 ) = sim(p 1c,P 2c ), which leads to F1 value depends on sim(p 1c,P 2c ) completely. Fig.5(a) demonstrates F1 of SizeSpotSigs rise and fall in a irregular manner, but among a reasonable interval, which all above 0.91. All details are listed in Table 3. Table 2. The Nature of Partitions Partition 0/88 22/66 44/44 66/22 88/0 Avg Size(Byte) 0/2189.41 607.65/2561.43 898.24/3247.73 1290.25/4421.20 2189.41/0 File Num 0/711 136/575 321/390 514/197 711/0

SizeSpotSigs: An Effective Deduplicate Algorithm 547 Table 3. the F1 value and time for 3 algorithms on partitions(s is Sec.) SpotSigs AF SpotSigs SizeSpotSigs SizeSpotSigs SizeSpotSigs (0/88) (88/0) (22/66) (44/44) (66/22) F1 0.6957 0.8216 0.7530 0.7793 0.8230 Mixer Time(s) 3.6094 148.20 7.142 22.81 61.13 F1 0.9360 0.9122 0.9580 0.9306 0.9165 Mixer Purity Time(s) 2.2783 134.34 4.0118 15.99 47.00 6 Conclusions and Future Works We analyzed the relation between noise-content ratio and similarity theoretically, which leads to two rules that could make the near-duplicate detection algorithm work better. Then, the paper proposed 3 new features to improve the effectiveness and robustness for short Web pages, which leaded to our AF SpotSigs method. Experiments confirm that 3 new features are effective, and our AF SpotSigs work 15% better than the state-of-the-art method for short Web pages. Besides, SizeSpotSigs that considers the size of page core content performs better than SpotSigs over different partition points. Future work will focus on 1). How to decide the size of the core content of Web page automatically or approximately; 2). Design more features that is fit for short Web page to improve the effectiveness, as well as generalizing the bounding approach toward other metrics such as Cosine. Acknowledgments This work is supported by NSFC Grant No.70903008, 60933004 and 61073082, FSSP 2010 Grant No.15. At the same time, we thank Jing He, Dongdong Shan for a quick review of our paper close to the submission deadline. References 1. Agarwal, A., Koppula, H., Leela, K., Chitrapura, K., Garg, S., GM, P., Haty, C., Roy, A., Sasturkar, A.: URL normalization for de-duplication of web pages. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1987 1990. ACM, New York (2009) 2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison- Wesley, Reading (1999) 3. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, p. 660. ACM, New York (2005) 4. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. ACM SIGMOD Record 24(2), 409 (1995) 5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1 10. Springer, Heidelberg (2000)

548 X. Mao et al. 6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157 1166 (1997) 7. Buttcher, S., Clarke, C.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, p. 189. ACM, New York (2006) 8. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, p. 388. ACM, New York (2002) 9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 191 (2002) 10. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping URLs via rewrite rules. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 186 194. ACM, New York (2008) 11. Datar, M., Gionis, A., Indyk, P., Motwani, R., Ullman, J., et al.: Finding Interesting Associations without Support Pruning. IEEE Transactions on Knowledge And Data Engineering 13(1) (2001) 12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518 529. Morgan Kaufmann Publishers Inc., San Francisco (1999) 13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284 291. ACM, New York (2006) 14. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology 54(3), 203 215 (2003) 15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604 613. ACM, New York (1998) 16. Ko lcz,a.,chowdhury,a.,alspector,j.:improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 610. ACM, New York (2004) 17. Koppula, H., Leela, K., Agarwal, A., Chitrapura, K., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 381 390. ACM, New York (2010) 18. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp. 1 10 (1994) 19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563 570. ACM, New York (2008) 20. Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop on E-Commerce (1996) 21. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 428. ACM, New York (2006)