An Improved Topic Relevance Algorithm for Focused Crawling

Size: px

Start display at page:

Download "An Improved Topic Relevance Algorithm for Focused Crawling"

Kristian Stevens
6 years ago
Views:

1 An Improved Topic Relevance Algorithm for Focused Crawling Hong-Wei Hao, Cui-Xia Mu,2 Xu-Cheng Yin *, Shen Li, Zhi-Bin Wang Department of Computer Science, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 00083, China 2 Department of Computer Science, China Women s University, Beijing 000, China hhw@ustb.edu.cn; nwuc_mucuixia@63.com; xuchengyin@ustb.edu.cn; lishenxydlgzs@63.com; wzb88@yahoo.cn Abstract Topic relevance of pages and hyperlinks is the key issue in focused crawling. In this paper, an improved topic relevance algorithm for focused crawling is proposed. First, we implement a prototype system of the focused crawler a topicspecific news gathering system which is prepared for comparative experiments on different similarity measures with the anchor text. Second, experiments on Chinese text corpus show that using LSI (Latent Semantic Indexing) outperforms using TF-IDF (term frequency- inverse document frequency) for hyperlink topic relevance prediction and pages topic relevance calculation. Third, in real crawling experiments on the prototype system, the crawler using TF-IDF has high performance with the accumulated topic relevance increasing quickly at the beginning of crawling, however the crawler using LSI can find more related pages and tunnel through. Fourth, combining their advantages of LSI and TF-IDF, we propose TFIDF+LSI algorithm to guide the crawling. Last, the crawler using TFIDF+LSI performs the same crawl task and demonstrates the combination advantage of TF-IDF and LSI. The experiment suggests that the crawler s performance using TFIDF+LSI is greatly superior to that using either TF-IDF or LSI respectively. Keywords topical crawler, link prediction, topic relevance, Latent Semantic Indexing, Web information retrieval I. INTRODUCTION Currently, the Internet is becoming the main way to obtain information. Statistics indicates that Google can search about a trillion pages by the end of 2008 []. However, the explosive growth of information on the Web resulting in information overload, has caused people submerged in the sea of information, so the universal search engines can not fully meet people s needs. As artificial intelligence technology further mature and information services diversification, search engine technology must become intelligent, personalized and domainspecific. Therefore, the vertical search engines have become the research focus. Vertical search engines are the domain-specific professional search engines, which reduce the search scope to increase search accuracy and improve the capacity for tracking web resources. As the core part of vertical search engines, the topical crawlers collect and update information from the Web. Different from the universal breadth-first crawler, the topical crawler uses a different priority calculation method to selectively download pages that meet the specific topic. * Corresponding author: Xu-Cheng Yin. This work is supported by the R&D Special Fund for Public Welfare Industry (Meteorology) of China under Grant No. GYHY and the Fundamental Research Funds for the Central Universities under Grant No. FRF-BR-0-034B //$ IEEE 850 Locality principle of Web topics believes that pages authors may choose hyperlinks which are related with the topic in this page. So the topical crawlers will traverse all pages relevant to the subject, beginning from some relevant page seeds. When crawling on the Web, the topical crawlers will analyze each hyperlink and try to predict which may be relevant to the subject. The topical crawlers will firstly choose relevant URLs and discard those irrelevant. Starting with the first focused crawler Fish-Search [2], there is Shark Search [3], a more aggressive variant of Fish Search. We now see a variety of topical crawling algorithms. Focused crawlers based on classifier [4] using the classifier to calculate the topic relevance of web pages, first select the hyperlinks on the pages with high topic relevance. Focused crawlers based on hyperlinks analysis [5], use PageRank algorithms to analyze the importance of hyperlinks and first select sub-hyperlinks of the important hyperlinks. Focused Crawlers using Context Graphs [6] emphasize contextual knowledge for the topic. What is directly related with the priority of crawling URLs is the topic relevance. Calculating topic relevance of the pages or URLs is so important and worth of further study [7]. Topic relevance refers to the comparison result between the respective topic of query and document. We consider a single page or a part of that as a document, and the user query to the search engine as a query to the file system. Therefore, topic relevance of the page mainly refers to the text similarity measure based on effective text in topic pages. There are several key issues in text similarity measure, including text representation, text feature selection and text similarity algorithm. VSM (Vector Space Model) [8] proposed in 958 is widely used in IRS (Information Retrieval System) in recent years. In order to solve the binary characteristic of Boolean model, Extended Boolean Model [9] was developed. Probabilistic model was also applied in IRS. LSI (Latent Semantic Indexing) [0][] proposed in 990, a natural language processing method, can solve the synonyms and polysemy problem. The main difficulty of text processing is high dimensional and sparse, and feature selection and feature extraction can reduce the number of dimensions and noise. The widely used feature selection methods are document frequency (DF), mutual information (MI), CHI statistics, expected cross entropy (ECE), odds ratio (OR), weight of evidence for text (WET), etc. The commonly used similarity functions are cosine similarity, Dice coefficient similarity, Jaccard coefficient similarity and so on.

2 Most of topical crawlers guide the crawling based on VSM and TF-IDF [9] or improved. Essentially, being a strict string matching algorithm, TF-IDF can not handle approximate meaning, and so many researchers increase keywords relevant to the topic by query expansion to solve tunneling through. With some probability, the focused crawling should be allowed to follow a series of bad pages in order to get to a good one. This procedure traversing the irrelevant web page to get more relevant pages is called tunneling through[6]. LSI uses SVD (singular value decomposition) to deal with the latent semantic, but LSI used for focused crawling is studied rarely. We believe that there exists the semantic relation between the anchor text and body text of the topic page (referred to as page text), and LSI should have better performance in guiding crawling. So, combining advantages of TF-IDF and LSI, TFIDF+LSI algorithm is proposed and applied to the prototype system. The experiment suggests that the crawler s performance using TFIDF+LSI is superior to that using either TF-IDF or LSI respectively. The rest of this paper is organized as follows. In Section II, the related similarity measures including TF-IDF and LSI are described, and comparative experiments on Chinese text corpus are performed. In Section III, the TFIDF+LSI algorithm is proposed, and the comparative experiments using TF-IDF, LSI and TFIDF+LSI on the prototype system are described. Finally, some conclusions and discussions are presented in Section IV. II. SIMILARITY MEASURES A. Vector Space Model and TF-IDF The Vector Space Model [8] is widely used in information retrieval and first introduced to SMART. In the vector space model, documents and queries are represented as vectors a sequence of features and their weights. The dimension of the feature space is equal to the number of different words in all of the documents and queries. Document vector is represented as in d j ( w, j, w2, j,..., wt, j). () Query vector is represented as in q ( w, q, w2, q,..., wt, q). (2) Each axis in this space corresponds to a term. t is the number of terms in all documents and queries, w t,j is the weight of the t term in document d j. w t,q is the weight of the t term in query q. The methods of giving weights to the terms may vary. The most common TF-IDF scheme [8] gives the weight for the term i in a document, as in wi tfi*log( N ), (3) df i where tf i is the frequency of the i term in the document, N is the number of all documents, and df i is the number of documents containing the i term. There are some other forms. In the vector space model, the similarity between a document and a query is usually based on the distance between the vectors in some metric. The cosine similarity measure is the most common, as in wtd w t tq w w. (4) t td t tq There are many other possible similarity measures for their particular purpose, for example, Dice coefficient similarity (5) and Jaccard coefficient similarity (6). 2 ww w t t tq td tq w t td. (5) ww t tq td w t tq + w t td w t tqwtd. (6) B. LSI LSI [] [2] is based on high dimensional term-document vector space. The Latent Semantic Indexing assumes that the variability of word choice partially obscures the semantic structure of the document. So it uses the singular value decomposition (SVD) to reduce the dimensions of the termdocument space, and then the underlying semantic relationships between documents or queries are revealed, and much of the noise (differences in word usage, terms that do not help to distinguish documents, etc) is eliminated. LSI [] [2] is performed as follows. First, a t d matrix X of terms and documents is formed, where each matrix element is the term frequency or other weight, t is the number of rows of X, d is the number of columns of X, m is the rank of X. X can be decomposed into the product of three other matrices as in X TSD ' (7) This is called the singular value decomposition of X. T 0 and D 0 are the matrices of left and right singular vectors, and S 0 is the diagonal matrix of singular values. Singular value decomposition (SVD) is unique up to certain row, column and sign permutations and by convention the diagonal elements of S 0 are constructed to be all positive and ordered in decreasing magnitude. The singular values in S 0 are ordered by size, the first k largest may be kept and the remaining smaller ones set to zero. The product of the resulting matrices is a matrix X hihat ( X ) which is only approximately equal to X. Deleting the zero rows 85

3 and columns of S 0 to obtain a new diagonal matrix S, and then deleting the corresponding columns of T 0 and D 0 to obtain T and S respectively. The result is a reduced model, as in X X TSD', (8) which is the rank-k model with the best possible least-squaresfit to X. The choice of k is critical to our work. The proper way to make such choices is an open issue in the factor analytic literature. The dot product between two row vectors of X hihat reflects the extent to which two terms have a similar pattern of occurrence across the set of documents. It is easy to verify that relevance and its partly matching results in low precision. In addition, it is not suitable for long documents. Compared with TF-IDF, LSI makes all vectors nornormalization, and thus long or short document bias issues have been solved. In addition, LSI linearly maps the high dimensional term-document vector space onto a lower dimensional subspace in a provably optimal way, and then the semantic relationships between documents are revealed. ' 2 X * X T* S* D'* D* S* T' T* S * T', (9) where the i, j cell of X hihat X hihat is the similarity between term i and term j. Thus the matrix X hihat X hihat contains the document-todocument dot products, as in Figure. the correlation of between anchor relevance and page relevance using TF-IDF (global corpus) ' 2 X * X D* S* T'* T* S* D' D* S * D', (0) where the i, j cell of X hihat X hihat is the similarity between document i and document j. The fundamental comparison between a term and a document is the value of an individual cell of X hihat, as in X X TSD'. () In order to compare a query to other documents, we need to start with its term vector X q and derive a representation q in reduced dimensional vector space, as in Figure 2. the correlation of between anchor relevance and page relevance using TF-IDF (local corpus) q Xq ' TS. (2) And then q can be made between or within comparisons, respectively. The cosine similarity between query q and document d can be t qd i i i t t qi di i i, (3) where in latent semantic space, q i and d i are respectively the normalization weight of the term i in query q and document d. C. Experiments on Chinese Text Corpus There are some limitations, though TF-IDF based on VSM is widely used in IRS. For example, it can t deal with semantic Figure 3. the correlation of between anchor relevance and page relevance using LSI (global corpus) We believe that there exists a latent semantic relationship between the anchor text of hyperlink and body text of the topic page (referred to as page text) [2] [3], and LSI algorithm should have better performance in guiding topical crawlers. Therefore, we design the experiments on Chinese text corpus form Sogou Lab, to verify the effectiveness of LSI in topic relevance prediction of the hyperlinks. We define a set of topic 852

4 keywords, and then use the TF-IDF similarity and LSI similarity respectively to obtain two groups of data. One group includes anchor text topic relevance, and another contains the topic relevance of the corresponding page text. Ideally, the topic relevance of anchor text and the topic relevance of page text should be a positive linear correlation. Chinese text corpus includes ten categories: car, finance, IT, health, sports, tourism, education, recruitment, culture and military. Since SVD consumes greatly system resources, LSI only uses a partial corpus and makes SVD of 00-dimensional sparse matrix. TF-IDF makes the comparison between global and local corpus experiments. Here, local corpus refers to the part related with the topic in whole Chinese text corpus, and global corpus refers to the whole corpus. Firstly, using local and global corpus respectively, TF-IDF explores the correlation between the topic relevance of the anchor text and that of page text, shown in Fig. and Fig. 2. Note that one point in Fig., Fig. 2 and Fig. 3 represents a 2- tuple <topic relevance of the anchor text, topic relevance of the page text>. Secondly, choosing local corpus, LSI explores the correlation too, shown in Fig. 3. From Fig. and Fig. 2, the weakly linear correlation is shown when using TF-IDF similarity measure. On the contrary, Fig. 3 shows stronger linear monotonically increasing correlation when using LSI algorithm. In addition, regression analysis suggests that the former s coefficient of certainty R 2 is 0.2 and the latter is So using LSI to guide focused crawling will be feasible. III. TFIDF+LSI ALGORITHM AND EXPERIMENTS In this section, we detail the implementation of the topicspecific news gathering system, the new TFIDF+LSI algorithm and the comparison of different topic relevance algorithms based on real crawling tasks. A. Implementation of Topic-specific News Gathering System A topic-specific news gathering system [4], the prototype system of focused crawler, is implemented based on the open source Web crawler Heritrix from SourceForge ( Heritrix is built to a modular design, with excellent reusability and scalability. Figure 4. data flow of Heritrix crawling Next, we analyze the system framework and data processing to identify extensions to achieve the focused crawler, shown in Fig. 4. Heritrix finishes the task including page fetching, page parsing and hyperlink extraction by CrawlURI. The modules related with achieving the focused crawler are CrawlURI and CandidateURI. We choose Extractor in ProcessorChains as the most import part to extend and implement the focused crawler, because the Extractor is mainly responsible for extracting the related hyperlinks from the pages downloaded. The Extractor of the Heritrix provides hyperlinks to EPA (External Page Analyzer), as shown in Fig. 5. Figure 5. data flow between Heritrix and EPA EPA will analyze each hyperlink and estimate its topic relevance, and then sorts them for the URLs queue. So we put most emphasis on EPA. Due to space limitations, here we only describe the two interfaces implementation of similarity measure: Taking into account the flexibility and scalability, we implement a flexible interface when designing the module of relevance calculation information. For example, class TopicalURL includes URL string and Link-Theme, and class LinkTheme is implemented involving the anchor text, information near the anchor text and page title corresponding to the URL. In order to facilitate the expansion of similarity calculation methods, we only define the interface DistanceMetric for similarity calculation, and the interface simply defines two necessary member methods (similarity and distance) for distance-based similarity. We design TfIdfDistance module to calculate the similarity. B. TF-IDF and LSI Experiments on the Prototype System We choose the NETEASE s Military Channel ( as a seed page and "missile" as the topic. Using three different algorithms to guide crawling by the anchor text, the crawler collects 000 valid topic pages each time. And along with crawling, the topic relevance of 000 pages text is accumulated. Therefore we use the accumulated topic relevance of page text to evaluate the crawler s performance which adopts different similarity measures, as in 000 ATR Sim( P( i), T ). (4) i ATR is the accumulated topic relevance of pages text, and P(i) represents the i page text vector, and T represents the topic vector. In theory, the more superior method, the accumulated topic relevance of page text should be higher. Three different algorithms are breadth-first algorithm, TF-IDF and LSI, and breadth-first algorithm is selected as the base. So other algorithms should be better than breadth-first algorithm. The result is shown in Fig

5 Then calculate the topic relevance of the rest of hyperlinks using LSI, and put the hyperlinks whose topic relevance is higher than certain threshold into the secondary queue in the ascending order. First fetch hyperlinks/urls from the main queue when crawling, and fetch those from the secondary queue while the main queue is empty. Internet when the main URLs queue is empty Figure 6. the accumulated topic relevance of page text using breadth-first, TF-IDF and LSI when crawling respectively We can analyze it as follows: Firstly, for the crawler using breadth-first, the accumulated topic relevance increases steadily and slowly throughout the whole crawling. Secondly, at the beginning of crawling, the crawler using TF-IDF has high performance with the accumulated topic relevance increasing quickly; but the accumulated topic relevance do not increasing any longer after 20 pages. The crawling ends, because there aren t available hyperlinks/urls whose anchor text precisely contains the keyword missile. Thirdly, for the crawler using LSI, the accumulated topic relevance increases more slowly than that of TF- IDF and more quickly than that of breadth-first at the beginning, and the accumulated topic relevance of LSI eventually surpasses that of TF-IDF. It shows that the crawler using LSI can crawling beyond the scope of that using TF-IDF, and can find more topic-related pages whose corresponding hyperlinks anchor text does not directly contain the keyword missile. After about 350 to 600 pages, the accumulated topic relevance remains unchanging, and the crawler goes into tunneling area. Then, after a while, it continues to increase. So the crawler using LSI has the ability of tunneling-through. C. TFIDF+LSI Algorithm From the above results, although the crawler using TF-IDF will stop ahead time, the accumulated topic relevance can increase more quickly. In addition, the crawler using LSI can find more topic-related pages and tunnel through, although the accumulated topic relevance increases more slowly than that of TF-IDF. So combining their advantages, we propose the TFIDF+LSI algorithm to guide focused crawling, as shown in Fig. 7. The main idea of TFIDF+LSI is following. Using two priority queues that are the main queue and secondary queue, first calculate the topic relevance of hyperlinks (by anchor text ) in downloaded pages using TF-IDF, and then put the hyperlinks whose topic relevance is higher than certain threshold into the main queue in the ascending order. downloader of Heritrix Web pages Web pages base extended URLs parser extended URLs filter the main URLs queue of Heritrix resources domain knowledge base TF-IDF the secondary URLs queue of Heritrix LSI URLs filter of Heritrix Figure 7. the crawler framework using TFIDF+LSI TFIDF+LSI algorithm is described as follows: handlepage(string content) { NodeList LinkNodeLst ParseLink(content); for(link link in LinkNodeList) { mainsim TFIDFSim(link.getAnchor(), topicstr); assistsim LSISim(link.getAnchor(), topicstr); if(mainsim > mainthresh) { MainQueue.insert(link, mainsim); } else if (assistsim > assistthresh) { AssistQueue.insert(link, assistsim);}}} D. TFIDF+LSI Experiments on the Prototype System Now, we apply the TFIDF+LSI to the focused crawler to guide crawling by anchor text, and let it finish the same crawl task as the above. The results are shown in Fig. 8 and Fig. 9, which suggest TFIDF+LSI s better performance than other algorithms respectively from the start-up phase and global process. From Fig. 8, at the start-up phase, the crawler using TFIDF+LSI inherits the high performance from that using TF- IDF with the accumulated topic relevance of page text increasing quickly; but the accumulated topic relevance of page text can keep increasing as the former rate while the crawler using TF-IDF stops. In addition, the crawler using TFIDF+LSI can crawl more quickly than that using LSI or breadth-first at the beginning. From Fig. 9, after 000 pages, the accumulated topic relevance of page text using TFIDF+LSI is 25.8 times that using LSI, times that using TF-IDF and times that using breadth-first. Therefore, TFIDF+LSI shows greatly superior performance when guiding the crawler by the anchor text. 854

6 However, the crawler using TFIDF+LSI can only collect pages within a certain range because of the limitations of LSI and only using anchor text. So in another experiment, without the limit to the number of downloaded pages, topic drift occurs and the accumulated topic relevance of page text increases no longer after 2500 pages, shown in Fig. 0. Obviously, the tunneling-through ability of LSI is limited to gray tunneling, so the crawler using TFIDF+LSI can not migrate to another information region after finishing the related pages in one region. Figure 8. the accumulated topic relevance of page text using TFIDF+LSI at the beginning of crawling Figure 9. the accumulated topic relevance of page text using TFIDF+LSI in the whole crawling Figure 0. topic-drift using TFIDF+LSI when crawling IV. CONCLUSIONS In this paper, the topic-specific news gathering system is implemented based on open source crawler Heritrix. We perform some experiments to explore the related issues with topic relevance. First, in Chinese text corpus experiments, with the stronger linear monotonically increasing correlation between topic relevance of anchor text and topic relevance of page text, LSI outperforms TF-IDF. And in real crawling experiments on the topics-specific news gathering system, respectively using TF-IDF and LSI, each has its good points. So we propose TFIDF+LSI algorithm to guide the crawling by combining their advantages of TF-IDF and LSI. The crawler s performance using TFIDF+LSI is greatly superior to that using either TF-IDF or LSI respectively. However, due to the limitations of LSI and using only anchor text and other factors, the topical crawler using TFIDF+LSI may still cause topic drift. So further research will combine structured data such as ontology [5] with LSI, which is expected to perform even better in hyperlink prediction. REFERENCES [] We knew web was big. available at July [2] Paul De Bra, Geert-Jan Houben, Yoram Kornatzky and Reinier Post, Information Retrieval in distributed hypertexts, Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, 994. [3] Hersovici M, Jacovi M, Maarek Y S, etc. The shark-search algorithm. An application: tailored Web site mapping, Proceedings of the seventh international conference on World Wide Web, Brisbane, vol.30, pp , April 998. [4] Soumen Chakrabarti, Byron Dom, and Piortr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the international conference on SIGMOD 98, 998. [5] P. Srinivasan, F. Menczer, G. Pant, A general evaluation framework for topical crawlers, Information Retrieval, vol. 8(3), pp , [6] M. Diligenti, F M Coetzee, S. Lawrence, Giles C L and M. Gori, Focused crawling using context graphs, Proceedings of the 26th International Conference on Very Large Databases, Roma, pp , [7] Lin-Tao Lv, Li-Ping Chen, Hong-Fang Zhou, An improved topic relevance algorithm for vertical search engines, ICWAPR '08, Hong Kong, pp , Aug [8] Fox.E A, Extending the boolean and vector space models of Information Retrieval with p-norm queries and multiple concept types, Dissertation Abstracts Internatinal Part B: Science and Engineering, NYC: Cornell University, vol.44, no. 9, pp. 386, 984. [9] Salton G, Fox E, Wu H. Extended Boolean Information Retrieval, Communications of the ACM, vol. 26, no., pp , 983. [0] Scott Deerwester, Susan T. Dumais, Richard Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science,, vol. 4 (6), pp , 990. [] Landauer T.K., Foltz P.W., Laham D., An introduction to Latent Semantic Analysis, Discourse Processes, vol. 25, pp , [2] G. Almpanidis, C. Kotropoulos, I. Pitas, Combining text and link analysis for focused crawling, Information Systems, vol. 32, pp , [3] J. Gelernter, D. Cao, and J.Carbonell, Studies on relevance, ranking and results display, Journal of Computing, vol. 2, pp. 7-20, 200. [4] Soumen Chakrabarti, Martin van den Berg, Byron Domc, Focused crawling: a new approach to topic-specific Web resource discovery, in Proceeding of the eighth international conference on World Wide Web (999), pp , 999. [5] M. Ehrig, A. Maedche, Ontology-focused crawling of Web documents, Proceedings of the 2003 ACM symposium on Applied computing, Melbourne, pp , [6] D. Bergmark, Carl Lagoze and Alex Sbityakov, Focused Crawls, Tunneling, and Digital Libraries, Prof. of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp.9-06,

Improving Relevance Prediction for Focused Web Crawlers

2012 IEEE/ACIS 11th International Conference on Computer and Information Science Improving Relevance Prediction for Focused Web Crawlers Mejdl S. Safran 1,2, Abdullah Althagafi 1 and Dunren Che 1 Department