SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content

Size: px
Start display at page:

Download "SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content"

Transcription

1 SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan Department of Computer Science and Technology, Peking University Abstract. Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don t work well for short Web pages 1, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed nearduplicate news articles, which include both short and long Web pages. Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs, SizeSpotSigs, Information Retrieval. 1 Introduction Detection of duplicate or near-duplicate Web pages is an important and difficult problem for Web search engines. Lots of algorithms have been proposed in recent years [6,8,20,13,18]. Most approaches can be characterized as different types of distance or overlap measures operating on the HTML strings. State-of-the-art algorithms, such as Broder et al. s [2] and Charikar s [3], achieve reasonable precision or recall. Especially, SpotSigs[19] could avert the process of removing noise in Web page because of its smart feature selection. Existing deduplicate 1 In this page, Web pages are classified into long (Web) pages and short (Web) page based on their core content size. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp , c Springer-Verlag Berlin Heidelberg 2011

2 538 X. Mao et al. algorithms don t take size of the page core content into account. Essentially, these algorithms are more suitable for processing the long Web pages because they just take surfacing features to present documents. For short documents, however, the presentation is not sufficient. Especially, when documents have noise information, like ads within the Web page, the presentation is worse. Our experiments in section 5.3 also proves that the state-of-the-art deduplicate algorithm is relatively poor for short Web pages, just 0.62(F1) against 0.92(F1) for long Web pages. In fact, there are large amount of short Web pages which have duplicated core content on the World Wide Web. At the same time, they are also very important, for example, the central bank announces some message, such as, interest rate adjustment. Fig.1 shows a pair of same-core Web pages that only differ in the framing, advertisements, and navigational banners. Both articles exhibit almost identical core contents, reporting on the match review between Uruguay and Netherlands. Fig. 1. Near-duplicate Web Pages: identical core content with different framing and banners(additional ads and related links removed) and the size of core contents are short So it is important and necessary to improve the effectiveness of deduplication for short Web pages.

3 SizeSpotSigs: An Effective Deduplicate Algorithm Contribution 1. Analyze the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2. Based on our analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages, which leads to AF SpotSigs algorithm; 3. We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. 2 Related Work There are two families of methods for near duplicate detection. One is contentbased methods, the other is non-content-based methods. The content-based methods were to detect near duplicates by computing similarity between contents of documents, while the non-content-based methods made use of non-content features[10,1,17](i.e. URL pattern) to detect near duplicates. The non-contentbased methods were only used to detect the near duplicate pages in one web site while the content-based methods have no any limitation. Content-based algorithms could be also divided into two groups according to whether they need noise removing. Most of the existing content-based deduplicate algorithms needed the process of removing noise. Broder et al. [6] proposed a DSC algorithm(also called Shingling), as a method to detect near duplicates by computing similarity among the shingle sets of the documents. The similarity between two documents is computed based on the common Jaccard overlap measure between these document shingle set. In order to reduce the complexity of Shingling for processing large collections, DSC- SS(also called super shingles) was later proposed by Broder in [5]. DSC-SS makes use of meta-shingles, i.e., shingles of shingles, with only a little decrease in precision. A variety of methods for getting good shingles are investigated by Hod and Zobel [14]. Buttcher and Clarke [7] focus on Kullback-Leibler divergence in the more general context of search. A lager-scale evaluation was implemented by Henzinger[13] to compare the precision of shingling and simhash algorithms by adjusting their parameters to maintain almost same recall. The experiment shows that neither of the algorithms works well for finding nearduplicate pairs on the same site because of the influence of templates, while both achieve a higher precision for near-duplicate pairs on different sites. [21] proposed that near-duplicate clustering should incorporating information about document attributes or the content structure. Another widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to compute pair-wise similarity of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by Jaccard

4 540 X. Mao et al. formula. Different algorithms are characterized, and their computational costs are determined, by the hash functions and how character sequences are selected. Manber [18] started the research firstly. I-Match algorithm [9,16] uses external collection statistics and make recall increase by using multiple fingerprints per document. Position based schemes [4] select strings based on their offset in a document. Broder etc. [6] pick strings whose hash values are multiples of an integer. Indyk and Motwani [12,15] proposed Locality Sensitive Hashing (LSH), which is an approximate similarity search technique that scales to both large and high-dimensional data sets. There are many variant of LSH, such as LSH-Tree [3] or Hamming-LSH [11]. Generally, the noise removing is an expensive operation. If possible, the nearduplicate detection algorithm should avoid noise removing. Martin Theobald proposed SpotSigs [19] algorithm, which used word chain around stop words as features to construct feature set. For example, consider the sentence: On a street in Milton, in the city s inner-west, one woman wept as she toured her waterlogged home. Choosing the articles a, an, the and the verb is as antecedents with a uniform spot distance of 1 and chain length of 2, we obtain the set of spot signatures S = {a:street:milton, the:city s:inner-west}. The SpotSigs only needs a single pass over a corpus, which is much more efficient, easier to implement, and less error-prone because expensive layout analysis is omitted. Meanwhile, it remains largely independent of the input format. The method will be taken as our baseline. In this paper, considering special merits, we focus on the algorithms without noise removing, and we also take Jaccard overlap measure as our similarity measure. 3 Relation between Noise-Content Ratio and Similarity 3.1 Concepts and Notation For calculating the similarity, we need to extract features from Web pages. We define all the features from one page as page-feature set; Also we split these features into content-feature set and noise-feature set. A feature comes from the core content of page is defined as content feature (element) and belongs to the content feature set; otherwise, the feature is called noise feature (element) and belongs to the noise feature set. The noise-content (feature) ratio represents the ratio between the size of noise feature set and the size of content feature set. 3.2 Theoretical Analysis Let sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 be the default Jaccard similarity as defined over two sets P 1 and P 2, each consisting of distinct page-feature set in our case. P 1c and P 2c are the content-feature sets; P 1n and P 2n are the noise-feature sets, which subject to P 1c P 1n = P 1 and P 2c P 2n = P 2. The similarity between P 1c and P 2c is sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c, which is the real value we care in the near-duplicate detection.

5 SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact, near-duplicate detection is to compare the similarity of the core contents of two pages, but Web pages have many noisy content, such as banners and ads. Most of algorithms is to use sim(p 1,P 2 )to approach sim(p 1c,P 2c ). If sim(p 1,P 2 )isclosetosim(p 1c,P 2c ), it shows that the near-duplicate detection algorithm works well, and vice versa. In order to describe the difference between sim(p 1,P 2 )andsim(p 1c,P 2c ), we could get the Theorem 1 as follow: Theorem 1. Given two sets, P 1 and P 2, subject to P 1c P 1, P 1n P 1 and P 1c P 1n = P 1. Similarly, P 2c P 2, P 2n P 2 and P 2c P 2n = P 2 ;Atthesame time, sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 and sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c. Let the noise-content ratio P1n P 1c ɛ and P2n P 2c ɛ, whereɛ is a small number. Then, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (1) Proof: leta = P 1c P 2c,B = P 1c P 2c,then A (P 1c P 1n ) (P 2c P 2n ) A +2 max{ P 1n, P 2n } (2) B (P 1c P 1n ) (P 2c P 2n ) B +2 max{ P 1n, P 2n } (3) From (2) and (3), we can get the following inequality: A (P1c P1n) (P2c P2n) A +2 max{ P1n, P2n } B +2 max{ P 1n, P 2n } (P 1c P 1n) (P 2c P 2n) B (4) From (4), wet get the following inequality: 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n }) (P 1c P 1n ) (P 2c P 2n ) A 2 max{ P 1n, P 2n } (P 1c P 1n ) (P 2c P 2n ) B B (5) Obviously, A B and B max{ P 1c, P 2c }. So, we get: max{ P 1n, P 2n } B Another inequality is: max{ P 1n, P 2n } max{ P 1c, P 2c } ɛ. (6) 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n } = 2A max{ P 1n, P 2n } B B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 max{ P 1c, P 2c } +2 max{ P 1n, P 2n } 2 max{ P 1n, P 2n } max{ P 1c, P 2c } 1+ 2 max{ P1n, P2n } max{ P 1c, P 2c } ( 2ɛ)/(1 + 2ɛ) (7)

6 542 X. Mao et al. So, (5)could be reformed as: 2ɛ 1+2ɛ (P 1c P 1n ) (P 2c P 2n ) (P 1c P 1n ) (P 2c P 2n ) A B 2ɛ (8) That is, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (9) Theorem1 shows:(1). When ɛ is small enough, the similarity sim(p 1,P 2 )isclose to the similarity sim(p 1c,P 2c ); (2). When ɛ reaches a certain small value, the difference between two similarity is little even though ɛ continue to become smaller, the difference varies little. That is, when noise-content ratio reaches a certain small number, the increase of effectiveness of near-duplicate detection algorithm will be little. Without loss of generality, we assume P2n P 2c P1n P 1c = ɛ. ThenFormula(9) could be reformed as: 2 P 1n P 1c +2 P 1n sim(p 1,P 2 ) sim(p 1c,P 2c ) 2 P 1n P 1c (10) Formula(10) shows P 1c should be large for robust; Otherwise, P 1c or P 1n changes slightly will cause fierce change of upper bound and lower bound, which shows the algorithm is not robust. For example, assume two upper-bounds: 5/100 and 5/100, the upper bound become (5+5)/( ) after combining feature sets, which is equal with 5/100. but (5+1)/100 > (5+5+1)/( ). Obviously, (5+5)/( ) is more robuster than 5/100, though they have same value. In a word, when ɛ is large relatively, we could make the algorithm work better by two rules as follows:(a). Select features that have small noise-content ratio to improve effectiveness; (b). When the noise-content ratios of two types of feature are the same, we should select the feature with larger content-feature set to make the algorithm robust, which implies that if the noise-content ratios of several types of features are very close, these features should be combined to increase the robustness while the effectiveness changes little. 4 AF SpotSigs and SizeSpotSigs Algorithm SpotSigs[19] provided a stopword feature, which aimed to filter natural-language text passages out of noisy Web page components, that is, noise-content ratio was small, which gave us an intuition that we should choose features that tend to occur mostly in the core content of Web documents and skip over advertisements, banners, and the navigational components. In this paper, based on thought in SpotSigs and our analysis in section 3.2, we developed four features which

7 SizeSpotSigs: An Effective Deduplicate Algorithm 543 Fig. 2. Table of Meaning of Chinese Punctuations and Table of Markers of Chinese Stopwords in the paper all have small noise-content ratio. Details are as follows: 1).Stopword feature; It is similar to the feature in SpotSigs that is a string of stopword and its neighboring words, except that the stopwords are different because languages are different; Because the stopwords in noisy content are less than ones in core content, so the features could decrease the noise-content ratio against Shingling features. The Chinese stopwords and corresponding marker used in this paper are listed in the Fig.2. 2).Chinese punctuation feature; In English, many punctuations are the same with the special characters in HTML language. So in English, we can t use the punctuation to extract feature. In Chinese, however, this is not the case. As we known, the Chinese punctuations occurs less in the noisy area. We choose a string of punctuation and its neighboring words as Chinese punctuation feature, which makes the noise-content ratio small. The Chinese punctuations and corresponding English punctuations used in this paper are also listed in the Fig.2. 3).Sentence feature; The string between two Chinese punctuations is thought as sentence; Considering the sentence with punctuation is little in noisy area, so the sentence features could decrease noise-content ratio notably. 4).Sentence shingling feature; Assuming the length of one sentence is n, all 1-gram, 2-gram,..., (n-1)-gram are taken as new features, aiming to increase the number of content-feature set for robustness and effectiveness, which would also make noise-content ratio small based on sentence feature. The Stopword feature is used by the state-of-the-art algorithm, SpotSigs [19]. Though the stopwords are different because languages are different, we still call the algorithm SpotSigs. The experiments in Section 5.3 showed that SpotSigs could reach 0.92(F1) on long Web pages, but only 0.62 on short Web pages. Obviously, SpotSigs could not process the short Web pages well, and we need new algorithm. If all four features are used to detect near duplication, the algorithm is called AF SpotSigs. The experiments in Section 5.3 showed that AF SpotSigs could reach 0.77(F1) against 0.62(F1) of SpotSigs for short Web pages, but only increasing by 0.04(F1) with 28.8 times time overhead for long Web pages, which presents AF SpotSigs could work better than SpotSigs for short Web pages, and the effectiveness of AF SpotSigs is slightly better than that of SpotSigs for long Web pages but cost is higher. Considering the balance between efficiency and effectiveness, we propose algorithm called SizeSpotSigs that chooses only stopword features to judge the near duplication for long Web pages(namely SpotSigs) while the algorithm chooses all four-type features mentioned above for short Web pages(namely AF SpotSigs).

8 544 X. Mao et al. 5 Experiment 5.1 Data Set For verifying our algorithms, AF SpotSigs and SizeSpotSigs, we construct 4 datasets. Details are as follows: Collection Shorter/Collection Longer: weconstructthecollection Shorter and Longer humanly. The Collection Shorter has 379 short Web pages and 48 clusters; And the Collection Longer has 332 long Web pages and 40 clusters. Collection Mixer/Collection Mixer Purity: The Collection Shorter and Collection Longer are mixed as Collection Mixer, which includes 88 clusters and 711 Web pages totally. For each Web page in the Collection Mixer, we get its core content according to human judge, which lead to Collection Mixer Purity. 5.2 Choice of Stopwords Because quantity of stopwords is large, e.g.370 more in Chinese, we need to select the most representative stopwords to improve performance. SpotSigs, however, just did experiments on English Collection. We don t know how to choose stopwords or the length of its neighboring words on Chinese collection. At the same time, for AF SpotSigs, we also need to choose stopwords and the length. We find that F1 varies slightly about 1 absolute percent from a chain length of 1 to distance of 3 (figures omitted). So we choose two words as length parameter for the two algorithms. In this section, we will seek to the best combination of stopwords for AF Spot Sigs and SpotSigs for Chinese. We now consider variations in the choice of Spot- Sigs antecedents(stopwords and its neighboring words), thus aiming to find a good compromise between extracting characteristic signatures while avoiding an over-fitting of these signatures to particular articles or sites. For SpotSigs, which is fit for long Web pages, the best combination was searched in the collection Longer Sample which was sampled 1/3 clusters from the collection Longer. Moreover, for AF SpotSigs, which is fit for short Web pages, we get the parameter over the collection Shorter Sample, which was sampled 1/3 clusters from the collection Shorter. Fig.3(a) shows that we obtain the best F1 result for SpotSigs from a combination of De1, Di, De2, Shi, Ba, Le, mostly occurring in core contents and less likely to occur in ads or navigational banners. Meanwhile, for AF SpotSigs, Fig.3(b) shows the best F1 result is obtained on stopword De1. Using a full stopword list (here we use the most frequent 40 stopwords) already tends to yield overly generic signatures but still performs good significantly. 5.3 AF SpotSigs vs. SpotSigs After obtaining the parameters of AF SpotSigs and SpotSigs, we could compare the two algorithms from F1 value to computing cost. So, the two algorithms run on the Collection Shorter and Longer to do comparison.

9 SizeSpotSigs: An Effective Deduplicate Algorithm 545 SpotSigs on Longer AF_SpotSigs on Shorter F1 F {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list 0.55 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list (a) (b) Fig. 3. (a)the effectiveness of SpotSigs with different stopwords on Longer collection;(b)the effectiveness of AF SpotSigs with different stopwords on Shorter collection Fig.4showstheF1scoresofAFSpotSigs are both better than SpotSigs on Shorter and Longer. Moreover, F1 score of SpotSigs is far worse than AF SpotSigs on Shorter while F1 scores of two algorithms are very close on Longer. However, Table 1 shows that AF SpotSigs took much more time than SpotSigs. Considering balance between effectiveness and efficiency, we could partition one collection into two parts, namely the short part and long part. SpotSigs works on the long part while AF SpotSigs runs on the short part, namely SizeSpotSigs algorithm. SpotSigs AF_SpotSigs F Shorter Longer Shorter Longer F SpotSigs Time(Sec.) F AF SpotSigs Time(Sec.) Fig. 4. The effectiveness of SpotSigs and AF SpotSigs on Shorter and Longer Table 1. The F1 value and cost of two algorithms

10 546 X. Mao et al. Mixer_Purity Mixer F1 F1 SizeSpotSigs AF_SpotSigs SpotSigs SizeSpotSigs AF_SpotSigs SpotSigs cluster partition point (a) 1 1 cluster partition point (b) Fig. 5. F1 values of SizeSpotSigs, AF SpotSigs and SpotSigs on Collection Mixer purity(a) and Mixer(b) 5.4 SizeSpotSigs over SpotSigs and AF SpotSigs To verify SizeSpotSigs, all clusters in Mixer are sorted from small to large as their average size of core contents. We select three partition point (22,44,66) to partition set of clusters. For example, if partition point is 22, the first 22 clusters in the sorted clusters are took as small part while the rest clusters are large part. Table 2 demonstrates the nature of two parts in the every partition. Specially, 0/88 means that all clusters are took into large part which make SizeSpotSigs becomes SpotSigs while 88/0 means all clusters belong to small part which make SizeSpotSigs becomes AF SpotSigs. Fig.5(b) shows SizeSpotSigs works better than SpotSigs while worse than AF SpotSigs. Moreover, the F1 value of SizeSpotSigs increases with the increase of partition point. When purified collection is used, noise-content ratio is zero. So based on formula (9), sim(p 1,P 2 ) = sim(p 1c,P 2c ), which leads to F1 value depends on sim(p 1c,P 2c ) completely. Fig.5(a) demonstrates F1 of SizeSpotSigs rise and fall in a irregular manner, but among a reasonable interval, which all above All details are listed in Table 3. Table 2. The Nature of Partitions Partition 0/88 22/66 44/44 66/22 88/0 Avg Size(Byte) 0/ / / / /0 File Num 0/ / / / /0

11 SizeSpotSigs: An Effective Deduplicate Algorithm 547 Table 3. the F1 value and time for 3 algorithms on partitions(s is Sec.) SpotSigs AF SpotSigs SizeSpotSigs SizeSpotSigs SizeSpotSigs (0/88) (88/0) (22/66) (44/44) (66/22) F Mixer Time(s) F Mixer Purity Time(s) Conclusions and Future Works We analyzed the relation between noise-content ratio and similarity theoretically, which leads to two rules that could make the near-duplicate detection algorithm work better. Then, the paper proposed 3 new features to improve the effectiveness and robustness for short Web pages, which leaded to our AF SpotSigs method. Experiments confirm that 3 new features are effective, and our AF SpotSigs work 15% better than the state-of-the-art method for short Web pages. Besides, SizeSpotSigs that considers the size of page core content performs better than SpotSigs over different partition points. Future work will focus on 1). How to decide the size of the core content of Web page automatically or approximately; 2). Design more features that is fit for short Web page to improve the effectiveness, as well as generalizing the bounding approach toward other metrics such as Cosine. Acknowledgments This work is supported by NSFC Grant No , and , FSSP 2010 Grant No.15. At the same time, we thank Jing He, Dongdong Shan for a quick review of our paper close to the submission deadline. References 1. Agarwal, A., Koppula, H., Leela, K., Chitrapura, K., Garg, S., GM, P., Haty, C., Roy, A., Sasturkar, A.: URL normalization for de-duplication of web pages. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp ACM, New York (2009) 2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison- Wesley, Reading (1999) 3. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, p ACM, New York (2005) 4. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. ACM SIGMOD Record 24(2), 409 (1995) 5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM LNCS, vol. 1848, pp Springer, Heidelberg (2000)

12 548 X. Mao et al. 6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), (1997) 7. Buttcher, S., Clarke, C.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, p ACM, New York (2006) 8. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, p ACM, New York (2002) 9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 191 (2002) 10. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping URLs via rewrite rules. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp ACM, New York (2008) 11. Datar, M., Gionis, A., Indyk, P., Motwani, R., Ullman, J., et al.: Finding Interesting Associations without Support Pruning. IEEE Transactions on Knowledge And Data Engineering 13(1) (2001) 12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp Morgan Kaufmann Publishers Inc., San Francisco (1999) 13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM, New York (2006) 14. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology 54(3), (2003) 15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp ACM, New York (1998) 16. Ko lcz,a.,chowdhury,a.,alspector,j.:improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p ACM, New York (2004) 17. Koppula, H., Leela, K., Agarwal, A., Chitrapura, K., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp ACM, New York (2010) 18. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp (1994) 19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM, New York (2008) 20. Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop on E-Commerce (1996) 21. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p ACM, New York (2006)

A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm

A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm Hamid Mohammadi Department of Computer Engineering K. N. Toosi University of Technology

More information

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Department of Computer Science 353 Serra Mall, Stanford

More information

Detecting Near-Duplicates in Large-Scale Short Text Databases

Detecting Near-Duplicates in Large-Scale Short Text Databases Detecting Near-Duplicates in Large-Scale Short Text Databases Caichun Gong 1,2, Yulan Huang 1,2, Xueqi Cheng 1, and Shuo Bai 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

Achieving both High Precision and High Recall in Near-duplicate Detection

Achieving both High Precision and High Recall in Near-duplicate Detection Achieving both High Precision and High Recall in Near-duplicate Detection Lian en Huang Institute of Network Computing and Distributed Systems, Peking University Beijing 100871, P.R.China hle@net.pku.edu.cn

More information

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Combinatorial Algorithms for Web Search Engines - Three Success Stories Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will

More information

SpotSigs: Near-Duplicate Detection in Web Page Collections

SpotSigs: Near-Duplicate Detection in Web Page Collections SpotSigs: Near-Duplicate Detection in Web Page Collections Masters Thesis Report Siddharth Jonathan Computer Science, Stanford University ABSTRACT Motivated by our work with political scientists we present

More information

Adaptive Near-Duplicate Detection via Similarity Learning

Adaptive Near-Duplicate Detection via Similarity Learning Adaptive Near-Duplicate Detection via Similarity Learning Hannaneh Hajishirzi University of Illinois 201 N Goodwin Ave Urbana, IL, USA hajishir@uiuc.edu Wen-tau Yih Microsoft Research One Microsoft Way

More information

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,

More information

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.

More information

A LITERATURE SURVEY ON WEB CRAWLERS

A LITERATURE SURVEY ON WEB CRAWLERS A LITERATURE SURVEY ON WEB CRAWLERS V. Rajapriya School of Computer Science and Engineering, Bharathidasan University, Trichy, India rajpriyavaradharajan@gmail.com ABSTRACT: The web contains large data

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance 1666 JOURNAL OF SOFTWARE, VOL. 8, NO. 7, JULY 2013 The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance Junxiu An Chengdu University of Information Technology, Chengdu, P.R.China

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads Automation of URL Discovery and Flattering Mechanism in Live Forum Threads T.Nagajothi 1, M.S.Thanabal 2 PG Student, Department of CSE, P.S.N.A College of Engineering and Technology, Tamilnadu, India 1

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING

A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING Lavanya Pamulaparty 1, Dr. C.V Guru Rao 2 and Dr. M. Sreenivasa Rao 3 1 Department of CSE, Methodist college of Engg. & Tech., OU,

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Leena Lulu, Boumediene Belkhouche, Saad Harous College of Information Technology United Arab Emirates University Al Ain, United

More information

Topic: Duplicate Detection and Similarity Computing

Topic: Duplicate Detection and Similarity Computing Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman

More information

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD PROBABILISTIC SIMHASH MATCHING A Thesis by SADHAN SOOD Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

More information

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Monika Henzinger Google Inc. & Ecole Fédérale de Lausanne (EPFL) monika@google.com ABSTRACT Broder et al. s [3] shingling algorithm

More information

Clustering-Based Distributed Precomputation for Quality-of-Service Routing*

Clustering-Based Distributed Precomputation for Quality-of-Service Routing* Clustering-Based Distributed Precomputation for Quality-of-Service Routing* Yong Cui and Jianping Wu Department of Computer Science, Tsinghua University, Beijing, P.R.China, 100084 cy@csnet1.cs.tsinghua.edu.cn,

More information

Metric Learning Applied for Automatic Large Image Classification

Metric Learning Applied for Automatic Large Image Classification September, 2014 UPC Metric Learning Applied for Automatic Large Image Classification Supervisors SAHILU WENDESON / IT4BI TOON CALDERS (PhD)/ULB SALIM JOUILI (PhD)/EuraNova Image Database Classification

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

A Language Independent Author Verifier Using Fuzzy C-Means Clustering A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

A Two-Tier Distributed Full-Text Indexing System

A Two-Tier Distributed Full-Text Indexing System Appl. Math. Inf. Sci. 8, No. 1, 321-326 (2014) 321 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080139 A Two-Tier Distributed Full-Text Indexing System

More information

Towards a hybrid approach to Netflix Challenge

Towards a hybrid approach to Netflix Challenge Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the

More information

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad

More information

New Issues in Near-duplicate Detection

New Issues in Near-duplicate Detection New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing Ternary Tree Tree Optimalization for n-gram for n-gram Indexing Indexing Daniel Robenek, Jan Platoš, Václav Snášel Department Daniel of Computer Robenek, Science, Jan FEI, Platoš, VSB Technical Václav

More information

Text Clustering Incremental Algorithm in Sensitive Topic Detection

Text Clustering Incremental Algorithm in Sensitive Topic Detection International Journal of Information and Communication Sciences 2018; 3(3): 88-95 http://www.sciencepublishinggroup.com/j/ijics doi: 10.11648/j.ijics.20180303.12 ISSN: 2575-1700 (Print); ISSN: 2575-1719

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

A Supervised Method for Multi-keyword Web Crawling on Web Forums

A Supervised Method for Multi-keyword Web Crawling on Web Forums Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Xiaotang Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Finding Similar Items:Nearest Neighbor Search

Finding Similar Items:Nearest Neighbor Search Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether

More information

Plagiarism detection by similarity join

Plagiarism detection by similarity join Plagiarism detection by similarity join Version of August 11, 2009 R. Schellenberger Plagiarism detection by similarity join THESIS submitted in partial fulfillment of the requirements for the degree

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Online Document Clustering Using the GPU

Online Document Clustering Using the GPU Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline

More information

A Model for Interactive Web Information Retrieval

A Model for Interactive Web Information Retrieval A Model for Interactive Web Information Retrieval Orland Hoeber and Xue Dong Yang University of Regina, Regina, SK S4S 0A2, Canada {hoeber, yang}@uregina.ca Abstract. The interaction model supported by

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Cardinality Estimation: An Experimental Survey

Cardinality Estimation: An Experimental Survey : An Experimental Survey and Felix Naumann VLDB 2018 Estimation and Approximation Session Rio de Janeiro-Brazil 29 th August 2018 Information System Group Hasso Plattner Institut University of Potsdam

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Selection of n in K-Means Algorithm

Selection of n in K-Means Algorithm International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 6 (2014), pp. 577-582 International Research Publications House http://www. irphouse.com Selection of n in

More information

Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS

Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS Mehmet Ali Abdulhayoglu 1,2 (0000-0002-1288-2181), Bart Thijs 1 (0000-0003- 0446-8332) 1 ECOOM, Center for R&D Monitoring,

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Duplicate News Story Detection Revisited

Duplicate News Story Detection Revisited Duplicate News Story Detection Revisited Omar Alonso Microsoft Corporation omalonso@microsoft.com Dennis Fetterly Microsoft Research, Silicon Valley Lab fetterly@microsoft.com Mark Manasse Microsoft Research,

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Live Virtual Machine Migration with Efficient Working Set Prediction

Live Virtual Machine Migration with Efficient Working Set Prediction 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Uei-Ren Chen 1, Chin-Chi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology

More information

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas

More information

Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems

Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems Mohammad Hassan, Yaser A. Al-Lahham Zarqa University Jordan mohdzita@zu.edu.jo Journal of Digital

More information

K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors

K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors Shao-Tzu Huang, Chen-Chien Hsu, Wei-Yen Wang International Science Index, Electrical and Computer Engineering waset.org/publication/0007607

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

FOCUS: ADAPTING TO CRAWL INTERNET FORUMS

FOCUS: ADAPTING TO CRAWL INTERNET FORUMS FOCUS: ADAPTING TO CRAWL INTERNET FORUMS T.K. Arunprasath, Dr. C. Kumar Charlie Paul Abstract Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Classification of Page to the aspect of Crawl Web Forum and URL Navigation

Classification of Page to the aspect of Crawl Web Forum and URL Navigation Classification of Page to the aspect of Crawl Web Forum and URL Navigation Yerragunta Kartheek*1, T.Sunitha Rani*2 M.Tech Scholar, Dept of CSE, QISCET, ONGOLE, Dist: Prakasam, AP, India. Associate Professor,

More information

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

Frame based Video Retrieval using Video Signatures

Frame based Video Retrieval using Video Signatures Frame based Video Retrieval using Video Signatures Siva Kumar Avula Assistant Professor Dept. of Computer Science & Engg. Ashokrao Mane Group of Institutions-Vathar Shubhangi C Deshmukh Assistant Professor

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Crawler with Search Engine based Simple Web Application System for Forum Mining

Crawler with Search Engine based Simple Web Application System for Forum Mining IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 04, 2015 ISSN (online): 2321-0613 Crawler with Search Engine based Simple Web Application System for Forum Mining Parina

More information

Online Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010

Online Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010 Online Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010 Barna Saha, Vahid Liaghat Abstract This summary is mostly based on the work of Saberi et al. [1] on online stochastic matching problem

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

A Comparison of Algorithms used to measure the Similarity between two documents

A Comparison of Algorithms used to measure the Similarity between two documents A Comparison of Algorithms used to measure the Similarity between two documents Khuat Thanh Tung, Nguyen Duc Hung, Le Thi My Hanh Abstract Nowadays, measuring the similarity of documents plays an important

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information