Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh, India mukesh_rai9@yahoo.com, mukesh_rai9@pu.ac.in, renuvig@hotmail.com Abstract. Rapidly growing size of the World-Wide-Web poses unprecedented challenges for general purpose crawlers and Search Engines. It is impossible for any search engine to index the complete Web. Focused crawler cope with the growing size by selectively seeking out pages that are relevant to a predefined set of topics and avoiding irrelevant regions of the Web. Rather than collecting all accessible Web documents, focused crawler analyses its crawl boundary to find the links likely to be the most relevant for the crawl. This paper presents a focused crawler that makes use of TIDS (Term-frequency Inverse-Document frequency Definition Semantic) score, derived from the set of documents marked as highly relevant to the domain by the Web users while browsing the Web, to decide upon the set of link to be included for future crawl that can lead to the pages most relevant to the domain. Keywords: Focused Web crawler, information retrieval, Tf-Idf, semantics, search engine, indexing. 1 Introduction Currently World Wide Web contains billions of publicly available documents. Besides its huge size the Web is characterized by its huge growth and change rates. It grows rapidly in terms of new servers, sites and documents. The addresses of documents and their contents are changed, and documents are removed from the web. As more information becomes available on the Web it is more difficult to find relevant information from it. Web search engines such as Goggle, Atla Vista provides access to the Web documents. A search engine s crawler[14] collects Web documents and periodically revisits the pages to update the index of the search engine. Due to the Web s immense size and dynamic nature no crawler is able to cover the entire Web and to keep up all the changes. This fact has motivated the development of focused crawlers [8, 10, 11, 12]. Focused crawlers are designed to download Web documents that are relevant to a predefined domain, and to avoid irrelevant areas of the Web. The benefit of the focused crawling approach is that it is able to find a large proportion of relevant documents on that particular domain and is able to effectively discard irrelevant documents and hence leading to significant savings in both computation P.V. Krishna, M.R. Babu, and E. Ariwa (Eds.): ObCom 2011, Part II, CCIS 270, pp. 31 36, 2012. Springer-Verlag Berlin Heidelberg 2012

32 M. Kumar and R. Vig and communication resources, and high quality retrieval results. In this paper a focused crawler architecture that retrieves the documents based upon TIDS score is proposed. 2 Related Work In some early works on the subject of focused collection of data from the Web, Web crawling was simulated by a group of fish migrating on the Web [9]. In the so called fish search, each URL corresponds to a fish whose survivability is dependent on visited page relevance and remote server speed. Page relevance is estimated using a binary classification by using a simple keyword or regular expression match. Only when fish traverse a specified amount of irrelevant pages they die off. The fish consequently migrate in the general direction of relevant pages which are then presented as results. J. Cho, H. Gracia-Molina and L. Page [4] proposed calculating the PageRank [6] score on the graph induced by pages downloaded so far and then using this score as a priority of URLs extracted from a page. They show some improvement over the standard breadth-first algorithm. The improvement however is not large. This may be due to the fact that the PageRank score is calculated on a very small, non-random subset of the web and also that the PageRank algorithm is too general for use in topic-driven tasks. M. Ehrig and A. Meadche [7] considered an ontology-based algorithm for page relevance computation. After pre-processing, entities (words occurring in the ontology) are extracted from the page and counted. Relevance of the page with regard to user selected entities of interest is then computed by using several measures on ontology graph (e.g. direct match, taxonomic and more complex relationships). Most of the existing focused crawlers [1, 2, 3] are based on simple keyword matching or some very complex machine learning techniques for guiding the future crawls. 3 Proposed Work Tf-Idf [13] (Term frequency Inverse document frequency) weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus or in turn to the domain. If we are having a corpus of documents which are all highly related with a specific domain then the Tf-Idf score of a term in a document gives the importance of that term for that document with respect to the whole corpus. Now if we add Tf-Idf score obtained by a term for all documents in the corpus, then the resulting score can be seen as a meaningful, semantic, score for that term with respect to the whole corpus. Based upon this thought a TIDS Score Table is constructed, whose entries are supposed to help the crawler for deciding the future crawls. The TIDS Score Table is adaptive is nature, means that for each crawler s complete run the Web page having highest relevancy to the domain, if the page is not present in the Relevant Page Set, is added to the Relevant Page Set and TIDS Score Table is regenerated and used for the future crawl. The TIDS Score Table generation algorithm is given below:

TIDS Based Focused Web Crawler 33 Algorithm 1: TIDS Score Table Generation 1. User browses the Web for domain related pages. 2. If the page is not highly relevant to the domain then GOTO 1. 3. Remove Stop Words from the page. 4. Apply Stemmer to the page. 5. Add the page to the Relevant_Page_Set. 6. If Relevant Page Set limit is not reached then GOTO 1. 7. Generate Tf-Idf Score Inverted Index Table for all the documents in the Relevant_Page_Set. 8. For each term t in the Tf-Idf Score Inverted Index Table Do 8.1. Calculate sum of the Tf-Idf score obtained by t in all documents from Tf-Idf Score Inverted Index Table, let it be TIDS_Score. 8.2. Insert entry <t, TIDS_Score> into TIDS Score Table. 8.3. Normalize the TIDS_Score values in TIDS Score Table. According to the TIDS Score Table Generation Algorithm user, while browsing the Web, marks the pages which seems to be most relevant to the specific domain. Before adding a page to the Relevant_Page_Set stemming, which is the process for reducing inflected (or sometimes derived) words to their stem, base or root form generally a written word form, stop words removal is performed upon the page. After construction of a healthy Relevant_Page_Set, Tf-Idf score of the collection is calculated. The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d, df t is the document frequency of t, means the number of documents that contain t. The df t is an inverse measure of the informativeness of t also df t N where N is the total number of documents in the Relevant Page Set. Then the idf (inverse document frequency) of t is given by idf The Tf-Idf weight of a term t in the document d ( and its idf weight and will be given by The TIDS_Score of a term t is given by t = log ( N/df ) (1) t w t,d w t, d t, d t ) is the product of its tf weight = log(1 + tf ) log ( N / df ) (2) TIDS_Score ( t) = d tf.idf Re levant_ Page_ Set t, d (3) TIDS Score Table is used by the crawler which works as according to the Algorithm 2. According to the TIDS Crawler Algorithm, SeedUrls, which is the set of preferred urls which can act as starting point for the crawler, is initialized to the set of urls returned by various existing popular search engines for the particular domain. Similarity score for the Seed Urls is calculated and all the SeedUrls are inserted into a priority queue, CrawlQueue, along with their description text similarity score.

34 M. Kumar and R. Vig Now the highest sore url is dequeued from the CrawlQueue, the url page is downloaded and its text similarity score is calculated. Anchor similarity score for each link present in url is calculated and text similarity score of the parent is added to it to obtain the final score for the link of the url. If the calculated score is greater than some Relevancy_Threshold then the url is enqueued to the CrawlQueue. Algorithm 2: TIDS Crawler 1. Initialize SeedUrls. 2. Create TIDS Score Table from the users browsing patterns. 3. While SeedURls is not empty 3.1 URL=SeedUrls.Next(); 3.2 URL_Score= Similarity score of URL.discription terms from TIDS Score Table. 3.3 Enqueue(CrawlQueue, URL, URL_Score); 4. While CrawlQueue is not empty 4.1 URL=Dequeue(URL_with_maximum_score, CrawlQueue); 4.2 Doc= Download( URL). 4.3 If Doc is not present in the Crawler Repository then add Doc to the Crawler Repository else GOTO 4. 4.4 Doc_Score= Similarity score of URL.text terms from TIDS Score Table. 4.5 If Doc_Score is greater than or equal to the text Similarity score of Relevant Page Set pages and the Doc is not present in the Relevant Page Set 4.5.1 Add Doc to Relevant Page Set and regenerate TIDS Score Table. 4.6 For all Link in Doc.links 4.6.1 Linkscore= Similarity score of Link.anchor terms from TIDS Score Table. 4.6.2 Score= Doc_Score + Linkscore; 4.6.3 If Score > Relevancy_Threshold 4.5.3.1 Enqueue(CrawlQueue, Link, Score); 4 Experimental Results The proposed TIDS crawler is implemented in Java using MySql Server as backend on a machine having Windows 7(64-bit) operating system, 3.0 GB of RAM, Intel Core 2 Duo 3.0 GHz processor. The experiment is conducted on Industry domain i.e. we want to retrieve the pages belonging to the Industry. SeedUrls is initialized with 20 top links resulted by Google with respect to the domain specific quires. Initially TIDS Score Table is generated for 300 highly domain relevant pages, in the Relevant Page Set, as marked by the users while browsing the Web and looking for Industry. and precision is calculated for various time durations for the proposed crawler. Let denotes the total number of Web pages discarded by the TIDS crawler (Step 4.6.3 of TIDS Crawler Algorithm) up to a certain time.

TIDS Based Focused Web Crawler 35 denotes the total number of pages present in the Crawler Repository at a certain time. denotes the number of pages which are relevant to the domain, from the Crawler Repository. Then is given by (4) is given by The results for the mentioned setup are plotted as a graph as shown in Fig. 1. (5) Fig. 1. Precision and Discard Ratio Graph The graph shows that with passage of time the tends to decrease while tend to increase this is because with the passage of time Relevant Score Set goes on enriching and the future crawl links tends to be more relevant to the domain. 5 Conclusion A focused crawler that makes use of TIDS (Term-frequency Inverse-Document frequency Definition Semantic ) score, derived from the set of documents marked as highly relevant to the domain by the Web users while browsing the Web, to guide the future crawl is proposed. Results shows that the proposed crawler tends to increase the relevant quotient of the relevant pages with time and also the discarded pages going out of the more relevant pages tend to decrease indicating the quality of the pages being retrieved.

36 M. Kumar and R. Vig References [1] Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: 10th International WWW Conference, Hong Kong (2001) [2] Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 91 106 (2002) [3] Ester, M., Gro, M., Kriegel, H.P.: Focused Web crawling: A generic framework for specifying the user interest and for adaptive crawling strategies: Technical report, Institute for Computer Science, University of Munich (2001) [4] Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: 7th International WWW Conference, Brisbane, Australia (1998) [5] Cho, J., Gasrcia-Molina, H.: Parallel Crawlers. In: WWW (2002) [6] Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project [7] Ehrig, M., Maedche, A.: Ontology-focused Crawling of Web Documents. In: ACM Symposium on Applied computing (2003) [8] Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB, Cairo, Egypt (2000) [9] De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World-Wide Web: Makingclient-based searching feasible. Computer Networks and ISDN Systems 27(2), 183 192 [10] Chakrabarti, S., van den Berg, M., Domc, B.: Focused crawling: a new approach to topic specific Web resource discovery. In: 8th International World Wild Web Conference, Toronto, Canada (1999) [11] Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30(1), 107 117 (1998) [12] http://www.google.co.in [13] http://www.wikipedia.org [14] Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed web crawler. Software Pract. Exper. 34(8), 711 726 (2004)