SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content
|
|
- June Harrell
- 5 years ago
- Views:
Transcription
1 SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan Department of Computer Science and Technology, Peking University Abstract. Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don t work well for short Web pages 1, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed nearduplicate news articles, which include both short and long Web pages. Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs, SizeSpotSigs, Information Retrieval. 1 Introduction Detection of duplicate or near-duplicate Web pages is an important and difficult problem for Web search engines. Lots of algorithms have been proposed in recent years [6,8,20,13,18]. Most approaches can be characterized as different types of distance or overlap measures operating on the HTML strings. State-of-the-art algorithms, such as Broder et al. s [2] and Charikar s [3], achieve reasonable precision or recall. Especially, SpotSigs[19] could avert the process of removing noise in Web page because of its smart feature selection. Existing deduplicate 1 In this page, Web pages are classified into long (Web) pages and short (Web) page based on their core content size. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp , c Springer-Verlag Berlin Heidelberg 2011
2 538 X. Mao et al. algorithms don t take size of the page core content into account. Essentially, these algorithms are more suitable for processing the long Web pages because they just take surfacing features to present documents. For short documents, however, the presentation is not sufficient. Especially, when documents have noise information, like ads within the Web page, the presentation is worse. Our experiments in section 5.3 also proves that the state-of-the-art deduplicate algorithm is relatively poor for short Web pages, just 0.62(F1) against 0.92(F1) for long Web pages. In fact, there are large amount of short Web pages which have duplicated core content on the World Wide Web. At the same time, they are also very important, for example, the central bank announces some message, such as, interest rate adjustment. Fig.1 shows a pair of same-core Web pages that only differ in the framing, advertisements, and navigational banners. Both articles exhibit almost identical core contents, reporting on the match review between Uruguay and Netherlands. Fig. 1. Near-duplicate Web Pages: identical core content with different framing and banners(additional ads and related links removed) and the size of core contents are short So it is important and necessary to improve the effectiveness of deduplication for short Web pages.
3 SizeSpotSigs: An Effective Deduplicate Algorithm Contribution 1. Analyze the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2. Based on our analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages, which leads to AF SpotSigs algorithm; 3. We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. 2 Related Work There are two families of methods for near duplicate detection. One is contentbased methods, the other is non-content-based methods. The content-based methods were to detect near duplicates by computing similarity between contents of documents, while the non-content-based methods made use of non-content features[10,1,17](i.e. URL pattern) to detect near duplicates. The non-contentbased methods were only used to detect the near duplicate pages in one web site while the content-based methods have no any limitation. Content-based algorithms could be also divided into two groups according to whether they need noise removing. Most of the existing content-based deduplicate algorithms needed the process of removing noise. Broder et al. [6] proposed a DSC algorithm(also called Shingling), as a method to detect near duplicates by computing similarity among the shingle sets of the documents. The similarity between two documents is computed based on the common Jaccard overlap measure between these document shingle set. In order to reduce the complexity of Shingling for processing large collections, DSC- SS(also called super shingles) was later proposed by Broder in [5]. DSC-SS makes use of meta-shingles, i.e., shingles of shingles, with only a little decrease in precision. A variety of methods for getting good shingles are investigated by Hod and Zobel [14]. Buttcher and Clarke [7] focus on Kullback-Leibler divergence in the more general context of search. A lager-scale evaluation was implemented by Henzinger[13] to compare the precision of shingling and simhash algorithms by adjusting their parameters to maintain almost same recall. The experiment shows that neither of the algorithms works well for finding nearduplicate pairs on the same site because of the influence of templates, while both achieve a higher precision for near-duplicate pairs on different sites. [21] proposed that near-duplicate clustering should incorporating information about document attributes or the content structure. Another widespread duplicate detection technique is to generate a document fingerprint, which is a compact description of the document, and then to compute pair-wise similarity of document fingerprints. The assumption is that fingerprints can be compared much more quickly than complete documents. A common method of generating fingerprints is to select a set of character sequences from a document, and to generate a fingerprint based on the hash values of these sequences. Similarity between two documents is measured by Jaccard
4 540 X. Mao et al. formula. Different algorithms are characterized, and their computational costs are determined, by the hash functions and how character sequences are selected. Manber [18] started the research firstly. I-Match algorithm [9,16] uses external collection statistics and make recall increase by using multiple fingerprints per document. Position based schemes [4] select strings based on their offset in a document. Broder etc. [6] pick strings whose hash values are multiples of an integer. Indyk and Motwani [12,15] proposed Locality Sensitive Hashing (LSH), which is an approximate similarity search technique that scales to both large and high-dimensional data sets. There are many variant of LSH, such as LSH-Tree [3] or Hamming-LSH [11]. Generally, the noise removing is an expensive operation. If possible, the nearduplicate detection algorithm should avoid noise removing. Martin Theobald proposed SpotSigs [19] algorithm, which used word chain around stop words as features to construct feature set. For example, consider the sentence: On a street in Milton, in the city s inner-west, one woman wept as she toured her waterlogged home. Choosing the articles a, an, the and the verb is as antecedents with a uniform spot distance of 1 and chain length of 2, we obtain the set of spot signatures S = {a:street:milton, the:city s:inner-west}. The SpotSigs only needs a single pass over a corpus, which is much more efficient, easier to implement, and less error-prone because expensive layout analysis is omitted. Meanwhile, it remains largely independent of the input format. The method will be taken as our baseline. In this paper, considering special merits, we focus on the algorithms without noise removing, and we also take Jaccard overlap measure as our similarity measure. 3 Relation between Noise-Content Ratio and Similarity 3.1 Concepts and Notation For calculating the similarity, we need to extract features from Web pages. We define all the features from one page as page-feature set; Also we split these features into content-feature set and noise-feature set. A feature comes from the core content of page is defined as content feature (element) and belongs to the content feature set; otherwise, the feature is called noise feature (element) and belongs to the noise feature set. The noise-content (feature) ratio represents the ratio between the size of noise feature set and the size of content feature set. 3.2 Theoretical Analysis Let sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 be the default Jaccard similarity as defined over two sets P 1 and P 2, each consisting of distinct page-feature set in our case. P 1c and P 2c are the content-feature sets; P 1n and P 2n are the noise-feature sets, which subject to P 1c P 1n = P 1 and P 2c P 2n = P 2. The similarity between P 1c and P 2c is sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c, which is the real value we care in the near-duplicate detection.
5 SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact, near-duplicate detection is to compare the similarity of the core contents of two pages, but Web pages have many noisy content, such as banners and ads. Most of algorithms is to use sim(p 1,P 2 )to approach sim(p 1c,P 2c ). If sim(p 1,P 2 )isclosetosim(p 1c,P 2c ), it shows that the near-duplicate detection algorithm works well, and vice versa. In order to describe the difference between sim(p 1,P 2 )andsim(p 1c,P 2c ), we could get the Theorem 1 as follow: Theorem 1. Given two sets, P 1 and P 2, subject to P 1c P 1, P 1n P 1 and P 1c P 1n = P 1. Similarly, P 2c P 2, P 2n P 2 and P 2c P 2n = P 2 ;Atthesame time, sim(p 1,P 2 )= P 1 P 2 / P 1 P 2 and sim(p 1c,P 2c )= P 1c P 2c / P 1c P 2c. Let the noise-content ratio P1n P 1c ɛ and P2n P 2c ɛ, whereɛ is a small number. Then, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (1) Proof: leta = P 1c P 2c,B = P 1c P 2c,then A (P 1c P 1n ) (P 2c P 2n ) A +2 max{ P 1n, P 2n } (2) B (P 1c P 1n ) (P 2c P 2n ) B +2 max{ P 1n, P 2n } (3) From (2) and (3), we can get the following inequality: A (P1c P1n) (P2c P2n) A +2 max{ P1n, P2n } B +2 max{ P 1n, P 2n } (P 1c P 1n) (P 2c P 2n) B (4) From (4), wet get the following inequality: 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n }) (P 1c P 1n ) (P 2c P 2n ) A 2 max{ P 1n, P 2n } (P 1c P 1n ) (P 2c P 2n ) B B (5) Obviously, A B and B max{ P 1c, P 2c }. So, we get: max{ P 1n, P 2n } B Another inequality is: max{ P 1n, P 2n } max{ P 1c, P 2c } ɛ. (6) 2A max{ P 1n, P 2n } B(B +2 max{ P 1n, P 2n } = 2A max{ P 1n, P 2n } B B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 B +2 max{ P 1n, P 2n } max{ P 1n, P 2n } 2 max{ P 1c, P 2c } +2 max{ P 1n, P 2n } 2 max{ P 1n, P 2n } max{ P 1c, P 2c } 1+ 2 max{ P1n, P2n } max{ P 1c, P 2c } ( 2ɛ)/(1 + 2ɛ) (7)
6 542 X. Mao et al. So, (5)could be reformed as: 2ɛ 1+2ɛ (P 1c P 1n ) (P 2c P 2n ) (P 1c P 1n ) (P 2c P 2n ) A B 2ɛ (8) That is, 2ɛ 1+2ɛ sim(p 1,P 2 ) sim(p 1c,P 2c ) 2ɛ (9) Theorem1 shows:(1). When ɛ is small enough, the similarity sim(p 1,P 2 )isclose to the similarity sim(p 1c,P 2c ); (2). When ɛ reaches a certain small value, the difference between two similarity is little even though ɛ continue to become smaller, the difference varies little. That is, when noise-content ratio reaches a certain small number, the increase of effectiveness of near-duplicate detection algorithm will be little. Without loss of generality, we assume P2n P 2c P1n P 1c = ɛ. ThenFormula(9) could be reformed as: 2 P 1n P 1c +2 P 1n sim(p 1,P 2 ) sim(p 1c,P 2c ) 2 P 1n P 1c (10) Formula(10) shows P 1c should be large for robust; Otherwise, P 1c or P 1n changes slightly will cause fierce change of upper bound and lower bound, which shows the algorithm is not robust. For example, assume two upper-bounds: 5/100 and 5/100, the upper bound become (5+5)/( ) after combining feature sets, which is equal with 5/100. but (5+1)/100 > (5+5+1)/( ). Obviously, (5+5)/( ) is more robuster than 5/100, though they have same value. In a word, when ɛ is large relatively, we could make the algorithm work better by two rules as follows:(a). Select features that have small noise-content ratio to improve effectiveness; (b). When the noise-content ratios of two types of feature are the same, we should select the feature with larger content-feature set to make the algorithm robust, which implies that if the noise-content ratios of several types of features are very close, these features should be combined to increase the robustness while the effectiveness changes little. 4 AF SpotSigs and SizeSpotSigs Algorithm SpotSigs[19] provided a stopword feature, which aimed to filter natural-language text passages out of noisy Web page components, that is, noise-content ratio was small, which gave us an intuition that we should choose features that tend to occur mostly in the core content of Web documents and skip over advertisements, banners, and the navigational components. In this paper, based on thought in SpotSigs and our analysis in section 3.2, we developed four features which
7 SizeSpotSigs: An Effective Deduplicate Algorithm 543 Fig. 2. Table of Meaning of Chinese Punctuations and Table of Markers of Chinese Stopwords in the paper all have small noise-content ratio. Details are as follows: 1).Stopword feature; It is similar to the feature in SpotSigs that is a string of stopword and its neighboring words, except that the stopwords are different because languages are different; Because the stopwords in noisy content are less than ones in core content, so the features could decrease the noise-content ratio against Shingling features. The Chinese stopwords and corresponding marker used in this paper are listed in the Fig.2. 2).Chinese punctuation feature; In English, many punctuations are the same with the special characters in HTML language. So in English, we can t use the punctuation to extract feature. In Chinese, however, this is not the case. As we known, the Chinese punctuations occurs less in the noisy area. We choose a string of punctuation and its neighboring words as Chinese punctuation feature, which makes the noise-content ratio small. The Chinese punctuations and corresponding English punctuations used in this paper are also listed in the Fig.2. 3).Sentence feature; The string between two Chinese punctuations is thought as sentence; Considering the sentence with punctuation is little in noisy area, so the sentence features could decrease noise-content ratio notably. 4).Sentence shingling feature; Assuming the length of one sentence is n, all 1-gram, 2-gram,..., (n-1)-gram are taken as new features, aiming to increase the number of content-feature set for robustness and effectiveness, which would also make noise-content ratio small based on sentence feature. The Stopword feature is used by the state-of-the-art algorithm, SpotSigs [19]. Though the stopwords are different because languages are different, we still call the algorithm SpotSigs. The experiments in Section 5.3 showed that SpotSigs could reach 0.92(F1) on long Web pages, but only 0.62 on short Web pages. Obviously, SpotSigs could not process the short Web pages well, and we need new algorithm. If all four features are used to detect near duplication, the algorithm is called AF SpotSigs. The experiments in Section 5.3 showed that AF SpotSigs could reach 0.77(F1) against 0.62(F1) of SpotSigs for short Web pages, but only increasing by 0.04(F1) with 28.8 times time overhead for long Web pages, which presents AF SpotSigs could work better than SpotSigs for short Web pages, and the effectiveness of AF SpotSigs is slightly better than that of SpotSigs for long Web pages but cost is higher. Considering the balance between efficiency and effectiveness, we propose algorithm called SizeSpotSigs that chooses only stopword features to judge the near duplication for long Web pages(namely SpotSigs) while the algorithm chooses all four-type features mentioned above for short Web pages(namely AF SpotSigs).
8 544 X. Mao et al. 5 Experiment 5.1 Data Set For verifying our algorithms, AF SpotSigs and SizeSpotSigs, we construct 4 datasets. Details are as follows: Collection Shorter/Collection Longer: weconstructthecollection Shorter and Longer humanly. The Collection Shorter has 379 short Web pages and 48 clusters; And the Collection Longer has 332 long Web pages and 40 clusters. Collection Mixer/Collection Mixer Purity: The Collection Shorter and Collection Longer are mixed as Collection Mixer, which includes 88 clusters and 711 Web pages totally. For each Web page in the Collection Mixer, we get its core content according to human judge, which lead to Collection Mixer Purity. 5.2 Choice of Stopwords Because quantity of stopwords is large, e.g.370 more in Chinese, we need to select the most representative stopwords to improve performance. SpotSigs, however, just did experiments on English Collection. We don t know how to choose stopwords or the length of its neighboring words on Chinese collection. At the same time, for AF SpotSigs, we also need to choose stopwords and the length. We find that F1 varies slightly about 1 absolute percent from a chain length of 1 to distance of 3 (figures omitted). So we choose two words as length parameter for the two algorithms. In this section, we will seek to the best combination of stopwords for AF Spot Sigs and SpotSigs for Chinese. We now consider variations in the choice of Spot- Sigs antecedents(stopwords and its neighboring words), thus aiming to find a good compromise between extracting characteristic signatures while avoiding an over-fitting of these signatures to particular articles or sites. For SpotSigs, which is fit for long Web pages, the best combination was searched in the collection Longer Sample which was sampled 1/3 clusters from the collection Longer. Moreover, for AF SpotSigs, which is fit for short Web pages, we get the parameter over the collection Shorter Sample, which was sampled 1/3 clusters from the collection Shorter. Fig.3(a) shows that we obtain the best F1 result for SpotSigs from a combination of De1, Di, De2, Shi, Ba, Le, mostly occurring in core contents and less likely to occur in ads or navigational banners. Meanwhile, for AF SpotSigs, Fig.3(b) shows the best F1 result is obtained on stopword De1. Using a full stopword list (here we use the most frequent 40 stopwords) already tends to yield overly generic signatures but still performs good significantly. 5.3 AF SpotSigs vs. SpotSigs After obtaining the parameters of AF SpotSigs and SpotSigs, we could compare the two algorithms from F1 value to computing cost. So, the two algorithms run on the Collection Shorter and Longer to do comparison.
9 SizeSpotSigs: An Effective Deduplicate Algorithm 545 SpotSigs on Longer AF_SpotSigs on Shorter F1 F {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list 0.55 {De1} {Di} {De2} {De1,Di} {De1,Di,De2} {De1,Di,De2,Ba,Le} {De1,Di,De2,Shi,Ba,Le} {De1,Di,De2,Yu1,He,Mei} {Yi,Le,Ba,Suo,Dou,Yu2} Full stopword list (a) (b) Fig. 3. (a)the effectiveness of SpotSigs with different stopwords on Longer collection;(b)the effectiveness of AF SpotSigs with different stopwords on Shorter collection Fig.4showstheF1scoresofAFSpotSigs are both better than SpotSigs on Shorter and Longer. Moreover, F1 score of SpotSigs is far worse than AF SpotSigs on Shorter while F1 scores of two algorithms are very close on Longer. However, Table 1 shows that AF SpotSigs took much more time than SpotSigs. Considering balance between effectiveness and efficiency, we could partition one collection into two parts, namely the short part and long part. SpotSigs works on the long part while AF SpotSigs runs on the short part, namely SizeSpotSigs algorithm. SpotSigs AF_SpotSigs F Shorter Longer Shorter Longer F SpotSigs Time(Sec.) F AF SpotSigs Time(Sec.) Fig. 4. The effectiveness of SpotSigs and AF SpotSigs on Shorter and Longer Table 1. The F1 value and cost of two algorithms
10 546 X. Mao et al. Mixer_Purity Mixer F1 F1 SizeSpotSigs AF_SpotSigs SpotSigs SizeSpotSigs AF_SpotSigs SpotSigs cluster partition point (a) 1 1 cluster partition point (b) Fig. 5. F1 values of SizeSpotSigs, AF SpotSigs and SpotSigs on Collection Mixer purity(a) and Mixer(b) 5.4 SizeSpotSigs over SpotSigs and AF SpotSigs To verify SizeSpotSigs, all clusters in Mixer are sorted from small to large as their average size of core contents. We select three partition point (22,44,66) to partition set of clusters. For example, if partition point is 22, the first 22 clusters in the sorted clusters are took as small part while the rest clusters are large part. Table 2 demonstrates the nature of two parts in the every partition. Specially, 0/88 means that all clusters are took into large part which make SizeSpotSigs becomes SpotSigs while 88/0 means all clusters belong to small part which make SizeSpotSigs becomes AF SpotSigs. Fig.5(b) shows SizeSpotSigs works better than SpotSigs while worse than AF SpotSigs. Moreover, the F1 value of SizeSpotSigs increases with the increase of partition point. When purified collection is used, noise-content ratio is zero. So based on formula (9), sim(p 1,P 2 ) = sim(p 1c,P 2c ), which leads to F1 value depends on sim(p 1c,P 2c ) completely. Fig.5(a) demonstrates F1 of SizeSpotSigs rise and fall in a irregular manner, but among a reasonable interval, which all above All details are listed in Table 3. Table 2. The Nature of Partitions Partition 0/88 22/66 44/44 66/22 88/0 Avg Size(Byte) 0/ / / / /0 File Num 0/ / / / /0
11 SizeSpotSigs: An Effective Deduplicate Algorithm 547 Table 3. the F1 value and time for 3 algorithms on partitions(s is Sec.) SpotSigs AF SpotSigs SizeSpotSigs SizeSpotSigs SizeSpotSigs (0/88) (88/0) (22/66) (44/44) (66/22) F Mixer Time(s) F Mixer Purity Time(s) Conclusions and Future Works We analyzed the relation between noise-content ratio and similarity theoretically, which leads to two rules that could make the near-duplicate detection algorithm work better. Then, the paper proposed 3 new features to improve the effectiveness and robustness for short Web pages, which leaded to our AF SpotSigs method. Experiments confirm that 3 new features are effective, and our AF SpotSigs work 15% better than the state-of-the-art method for short Web pages. Besides, SizeSpotSigs that considers the size of page core content performs better than SpotSigs over different partition points. Future work will focus on 1). How to decide the size of the core content of Web page automatically or approximately; 2). Design more features that is fit for short Web page to improve the effectiveness, as well as generalizing the bounding approach toward other metrics such as Cosine. Acknowledgments This work is supported by NSFC Grant No , and , FSSP 2010 Grant No.15. At the same time, we thank Jing He, Dongdong Shan for a quick review of our paper close to the submission deadline. References 1. Agarwal, A., Koppula, H., Leela, K., Chitrapura, K., Garg, S., GM, P., Haty, C., Roy, A., Sasturkar, A.: URL normalization for de-duplication of web pages. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp ACM, New York (2009) 2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison- Wesley, Reading (1999) 3. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, p ACM, New York (2005) 4. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. ACM SIGMOD Record 24(2), 409 (1995) 5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM LNCS, vol. 1848, pp Springer, Heidelberg (2000)
12 548 X. Mao et al. 6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), (1997) 7. Buttcher, S., Clarke, C.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, p ACM, New York (2006) 8. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, p ACM, New York (2002) 9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 191 (2002) 10. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping URLs via rewrite rules. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp ACM, New York (2008) 11. Datar, M., Gionis, A., Indyk, P., Motwani, R., Ullman, J., et al.: Finding Interesting Associations without Support Pruning. IEEE Transactions on Knowledge And Data Engineering 13(1) (2001) 12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp Morgan Kaufmann Publishers Inc., San Francisco (1999) 13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM, New York (2006) 14. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology 54(3), (2003) 15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp ACM, New York (1998) 16. Ko lcz,a.,chowdhury,a.,alspector,j.:improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p ACM, New York (2004) 17. Koppula, H., Leela, K., Agarwal, A., Chitrapura, K., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp ACM, New York (2010) 18. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp (1994) 19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM, New York (2008) 20. Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop on E-Commerce (1996) 21. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p ACM, New York (2006)
A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm
A Fast Text Similarity Measure for Large Document Collections Using Multi-reference Cosine and Genetic Algorithm Hamid Mohammadi Department of Computer Engineering K. N. Toosi University of Technology
More informationSpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Department of Computer Science 353 Serra Mall, Stanford
More informationDetecting Near-Duplicates in Large-Scale Short Text Databases
Detecting Near-Duplicates in Large-Scale Short Text Databases Caichun Gong 1,2, Yulan Huang 1,2, Xueqi Cheng 1, and Shuo Bai 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing,
More informationAchieving both High Precision and High Recall in Near-duplicate Detection
Achieving both High Precision and High Recall in Near-duplicate Detection Lian en Huang Institute of Network Computing and Distributed Systems, Peking University Beijing 100871, P.R.China hle@net.pku.edu.cn
More informationCombinatorial Algorithms for Web Search Engines - Three Success Stories
Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will
More informationSpotSigs: Near-Duplicate Detection in Web Page Collections
SpotSigs: Near-Duplicate Detection in Web Page Collections Masters Thesis Report Siddharth Jonathan Computer Science, Stanford University ABSTRACT Motivated by our work with political scientists we present
More informationAdaptive Near-Duplicate Detection via Similarity Learning
Adaptive Near-Duplicate Detection via Similarity Learning Hannaneh Hajishirzi University of Illinois 201 N Goodwin Ave Urbana, IL, USA hajishir@uiuc.edu Wen-tau Yih Microsoft Research One Microsoft Way
More informationDetection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences
Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,
More informationHYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL
International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.
More informationA LITERATURE SURVEY ON WEB CRAWLERS
A LITERATURE SURVEY ON WEB CRAWLERS V. Rajapriya School of Computer Science and Engineering, Bharathidasan University, Trichy, India rajpriyavaradharajan@gmail.com ABSTRACT: The web contains large data
More informationAutomated Path Ascend Forum Crawling
Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering
More informationThe Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance
1666 JOURNAL OF SOFTWARE, VOL. 8, NO. 7, JULY 2013 The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance Junxiu An Chengdu University of Information Technology, Chengdu, P.R.China
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationMining Quantitative Association Rules on Overlapped Intervals
Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,
More informationAutomation of URL Discovery and Flattering Mechanism in Live Forum Threads
Automation of URL Discovery and Flattering Mechanism in Live Forum Threads T.Nagajothi 1, M.S.Thanabal 2 PG Student, Department of CSE, P.S.N.A College of Engineering and Technology, Tamilnadu, India 1
More informationClean Living: Eliminating Near-Duplicates in Lifetime Personal Storage
Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft
More informationA NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING Lavanya Pamulaparty 1, Dr. C.V Guru Rao 2 and Dr. M. Sreenivasa Rao 3 1 Department of CSE, Methodist college of Engg. & Tech., OU,
More informationFrom Passages into Elements in XML Retrieval
From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles
More informationCandidate Document Retrieval for Arabic-based Text Reuse Detection on the Web
Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Leena Lulu, Boumediene Belkhouche, Saad Harous College of Information Technology United Arab Emirates University Al Ain, United
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationPROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD
PROBABILISTIC SIMHASH MATCHING A Thesis by SADHAN SOOD Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE
More informationFinding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Monika Henzinger Google Inc. & Ecole Fédérale de Lausanne (EPFL) monika@google.com ABSTRACT Broder et al. s [3] shingling algorithm
More informationClustering-Based Distributed Precomputation for Quality-of-Service Routing*
Clustering-Based Distributed Precomputation for Quality-of-Service Routing* Yong Cui and Jianping Wu Department of Computer Science, Tsinghua University, Beijing, P.R.China, 100084 cy@csnet1.cs.tsinghua.edu.cn,
More informationMetric Learning Applied for Automatic Large Image Classification
September, 2014 UPC Metric Learning Applied for Automatic Large Image Classification Supervisors SAHILU WENDESON / IT4BI TOON CALDERS (PhD)/ULB SALIM JOUILI (PhD)/EuraNova Image Database Classification
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationA Language Independent Author Verifier Using Fuzzy C-Means Clustering
A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationEnabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines
Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationNDoT: Nearest Neighbor Distance Based Outlier Detection Technique
NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationA Two-Tier Distributed Full-Text Indexing System
Appl. Math. Inf. Sci. 8, No. 1, 321-326 (2014) 321 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080139 A Two-Tier Distributed Full-Text Indexing System
More informationTowards a hybrid approach to Netflix Challenge
Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the
More informationARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES
ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad
More informationNew Issues in Near-duplicate Detection
New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder
More informationExploiting Index Pruning Methods for Clustering XML Collections
Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,
More informationA New Measure of the Cluster Hypothesis
A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer
More informationTernary Tree Tree Optimalization for n-gram for n-gram Indexing
Ternary Tree Tree Optimalization for n-gram for n-gram Indexing Indexing Daniel Robenek, Jan Platoš, Václav Snášel Department Daniel of Computer Robenek, Science, Jan FEI, Platoš, VSB Technical Václav
More informationText Clustering Incremental Algorithm in Sensitive Topic Detection
International Journal of Information and Communication Sciences 2018; 3(3): 88-95 http://www.sciencepublishinggroup.com/j/ijics doi: 10.11648/j.ijics.20180303.12 ISSN: 2575-1700 (Print); ISSN: 2575-1719
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationA Supervised Method for Multi-keyword Web Crawling on Web Forums
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationMining di Dati Web. Lezione 3 - Clustering and Classification
Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised
More informationIndex Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationLearning the Three Factors of a Non-overlapping Multi-camera Network Topology
Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Xiaotang Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationFinding Similar Items:Nearest Neighbor Search
Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether
More informationPlagiarism detection by similarity join
Plagiarism detection by similarity join Version of August 11, 2009 R. Schellenberger Plagiarism detection by similarity join THESIS submitted in partial fulfillment of the requirements for the degree
More informationAutomatic Query Type Identification Based on Click Through Information
Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China
More informationOnline Document Clustering Using the GPU
Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University
More informationRanking Clustered Data with Pairwise Comparisons
Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline
More informationA Model for Interactive Web Information Retrieval
A Model for Interactive Web Information Retrieval Orland Hoeber and Xue Dong Yang University of Regina, Regina, SK S4S 0A2, Canada {hoeber, yang}@uregina.ca Abstract. The interaction model supported by
More informationC-NBC: Neighborhood-Based Clustering with Constraints
C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is
More informationCardinality Estimation: An Experimental Survey
: An Experimental Survey and Felix Naumann VLDB 2018 Estimation and Approximation Session Rio de Janeiro-Brazil 29 th August 2018 Information System Group Hasso Plattner Institut University of Potsdam
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationSelection of n in K-Means Algorithm
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 6 (2014), pp. 577-582 International Research Publications House http://www. irphouse.com Selection of n in
More informationUse of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS
Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS Mehmet Ali Abdulhayoglu 1,2 (0000-0002-1288-2181), Bart Thijs 1 (0000-0003- 0446-8332) 1 ECOOM, Center for R&D Monitoring,
More informationA PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS
A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,
More informationA Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization
A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationMining High Average-Utility Itemsets
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering
More informationStatic Index Pruning for Information Retrieval Systems: A Posting-Based Approach
Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu
More informationWEIGHTING QUERY TERMS USING WORDNET ONTOLOGY
IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk
More informationDuplicate News Story Detection Revisited
Duplicate News Story Detection Revisited Omar Alonso Microsoft Corporation omalonso@microsoft.com Dennis Fetterly Microsoft Research, Silicon Valley Lab fetterly@microsoft.com Mark Manasse Microsoft Research,
More informationRanking Web Pages by Associating Keywords with Locations
Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationUsing Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department
Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationLive Virtual Machine Migration with Efficient Working Set Prediction
2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationMeshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods
Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Uei-Ren Chen 1, Chin-Chi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology
More informationREDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India
REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil
More informationDistance-based Outlier Detection: Consolidation and Renewed Bearing
Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction
More informationA Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information
A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas
More informationLocality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems
Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems Mohammad Hassan, Yaser A. Al-Lahham Zarqa University Jordan mohdzita@zu.edu.jo Journal of Digital
More informationK-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors
K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors Shao-Tzu Huang, Chen-Chien Hsu, Wei-Yen Wang International Science Index, Electrical and Computer Engineering waset.org/publication/0007607
More informationIMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL
IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationFOCUS: ADAPTING TO CRAWL INTERNET FORUMS
FOCUS: ADAPTING TO CRAWL INTERNET FORUMS T.K. Arunprasath, Dr. C. Kumar Charlie Paul Abstract Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant
More informationWeighted Suffix Tree Document Model for Web Documents Clustering
ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationClassification of Page to the aspect of Crawl Web Forum and URL Navigation
Classification of Page to the aspect of Crawl Web Forum and URL Navigation Yerragunta Kartheek*1, T.Sunitha Rani*2 M.Tech Scholar, Dept of CSE, QISCET, ONGOLE, Dist: Prakasam, AP, India. Associate Professor,
More informationUse of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University
Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the
More informationCompact Encoding of the Web Graph Exploiting Various Power Laws
Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department
More informationFrame based Video Retrieval using Video Signatures
Frame based Video Retrieval using Video Signatures Siva Kumar Avula Assistant Professor Dept. of Computer Science & Engg. Ashokrao Mane Group of Institutions-Vathar Shubhangi C Deshmukh Assistant Professor
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationCrawler with Search Engine based Simple Web Application System for Forum Mining
IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 04, 2015 ISSN (online): 2321-0613 Crawler with Search Engine based Simple Web Application System for Forum Mining Parina
More informationOnline Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010
Online Stochastic Matching CMSC 858F: Algorithmic Game Theory Fall 2010 Barna Saha, Vahid Liaghat Abstract This summary is mostly based on the work of Saberi et al. [1] on online stochastic matching problem
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationA Comparison of Algorithms used to measure the Similarity between two documents
A Comparison of Algorithms used to measure the Similarity between two documents Khuat Thanh Tung, Nguyen Duc Hung, Le Thi My Hanh Abstract Nowadays, measuring the similarity of documents plays an important
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,
More information