Exploiting routing information encoded into backlinks to improve topical crawling

Size: px

Start display at page:

Download "Exploiting routing information encoded into backlinks to improve topical crawling"

Emily Boyd
6 years ago
Views:

1 2009 International Conference of Soft Computing and Pattern Recognition Exploiting routing information encoded into backlinks to improve topical crawling Alban Mouton Valoria European University of Brittany Vannes, France Pierre-Francois Marteau Valoria European University of Brittany Vannes, France Abstract Local link analysis of topical graphs on the Web allows to experiment focused crawling strategies in a detailed way. In this scope, models, parameters and metrics used to orientate the crawler can be better understood, tuned and evaluated. We develop a methodological and experimental approach that exploits link analysis in order to determine what constitutes a good content analysis metric able to guide efficiently topical crawlers toward highly relevant areas of the web. Our experimentations show that partial knowledge of the local topology of topical graph highlights our understanding of routing capabilities of various metrics. Furthermore, our experimentations demonstrate that significant crawling efficiency improvement can be reached. Index Terms Topical crawling, Web topology, backlinks 1. Introduction The size and heterogeneity of the Web in terms of content and structure make Web information retrieval particularly difficult. General purpose search systems such as the most popular Web search engines show their limits both in precision (results are often noisy) and recall. As an alternative to general purpose search engine, topical or focused crawling is a popular approach to address the problem of focused information retrieval in the vastness of the Internet. Unfortunately, modeling new topical crawlers and improving their heuristics and metrics is a complex task. This is in part due to the lack of homogeneity of the evaluation methods used in research projects. This problem is addressed by P. Srinivasan & al who propose an evaluation framework [1] mostly based on the resources made available by the Open Directory Project [ The difficulty is also linked to the lack of extensive knowledge of the Web environment and therefore of the nature of the task that is needed for detailed design and evaluation. As the study of local graphs of the web allows for a better representation of a topical crawler s task, we propose in this paper to exploit a partial knowledge of the local topology of the web surrounding some target pages that define the topic of interest. We then use this knowledge of the local topology to evaluate routing capabilities of several similarity measures (from simple keywords based metrics to ontology based metrics) as the distance to the target pages increases. To date, the best existing maps of the Web are the ones described in the databases of the main search engines. Some of this information is available to the public through restricted APIs. For our experimentations We have built partial graphs surrounding topical zones of the Web by using inlinks (or backlinks) requests on the Yahoo Search API [ from ODP topical pages. The second section of the paper presents shortly related works. The proposed approach is formalized into the third section. The experimentations and evaluations are provided into the fourth section and a concluding section summarizes the main results and presents some perspectives. 2. Related work Precursors in Web topology analysis Albert&al [2] [3], characterize large scale characteristics from local connectivity of Web documents. In the late nineties Gibson&al [4] worked on the inferrance of Web communities from the study of the link topology of the Web. In 2004 Chandrasekar&al [5] proposed to improve Web Information Retrieval by exploiting user defined subwebs information. Even more recently Haveliwala [6] and Nie&al [7] proposed methods to use topical link based authority to improve Web search and annotation in addition to the general link based authority exploited by algorithms such as pagerank [8] and HITS [9]. Topical crawling was first introduced by Menczer&al [10], it is a research field in Web Information Retrieval that focus on developping automated Web surfing programs to collect documents relevant to specific topics. As it is usual in research fields, and more specifically when the exploration domain is as vast as the Web, the problem of standardized evaluation is central. Srinivasan&al [1] propose a work framework, built around the data of the Open /09 $ IEEE DOI /SoCPaR

2 Directory Project [ for experimenting with topical crawling. Studying Web topology in order to improve search algorithms and metrics is not a new idea, adding a topical analysis level is also a popular approach. However studying topical subgraphs of the Web in the scope of topical crawlers design is a new approach to our knowledge. 3. Formalized approach 3.1. Topics data Menczer &al [1] propose an evaluation framework for topical crawlers in which they describe the extraction of topics from the Open Directory Project or ODP [ dmoz.org/]. ODP is a vast human edited general purpose Web directory project referencing millions of pages. We used Menczer &al recommendations to create our workbench. Each topic matches an ODP node having a specified depth and generality. For each topic we dispose of human edited relevant links with their descriptions. We call these links the targets of the topic. For our experimentations we consider various representations of the topic : A compilation of the textual descriptions of the targets in ODP. The closest matching Wikipedia page in HTML. A simple keyword or list of keywords. Some topical ontologies extracted from Opencyc [http: // and Wordnet [11] Backlinks collection Topical graphs on the Web are embedded into the huge web structure in which topical crawlers try to find the best paths toward targets. We collect backlinks of sample targets in order to get a subset of such paths. Here is a description of back links collection at increasing distance of the targets: Let T (t) be the subset of target pages for topic t. p T (t) means that the page p is a sample target for topic t Let F L(p) be the set of pages that belong to the forelinks of page p, p 2 F L(p 1 ) means that there exists a link from p 1 to p 2 Let BL(p) be the set of pages that belong to the backlinks of page p. p 1 BL(p 2 ) p 2 F L(p 1 ) By extension if s is a set of pages, BL(s) is the set of pages in the backlinks of at least one page of s BL h (t) is the set of pages located at exactly h hops from a target page for topic t BLD 0 (t) = T (t) BLD h (t) = BL(BL h 1 (t)) \ h 1 i=0 BL i(t) Unfortunately, for a given page p, the complete set BL(p) of the backlinks of a page is not easy to obtain. We use Figure 1. Example of a Web subgraph around some targets with hop distances instead, as an approximation, a subset of BL(p) obtained by querying the Yahoo Search API [ com/search/]. The number of backlinks for a set of pages can be rapidly extremely large, even when limited by the constraints of a specific API, and after a few hops the amount of data can be intractable. Therefore, to create a working workbench, we use a random sample of the backlinks at each hop h to approximate BLD h (t), to fetch the next one, e.g. BLD h+1 (t) Content metrics Topical crawlers are guided by a document content metric that is used to evaluate the relevance of a page relatively to the topic specification. This relevance is then exploited by the crawler according to its strategy. For example simple Best First crawlers will sort their frontier according to the relevance of the pages from which the links were extracted. More sophisticated crawlers will analyze more precisely the contexts of the links and then again apply a content metric in order to sort their priority. Our purpose is to compare different content metrics in the context of topical crawling with respect to their routing capability. In this study we consider a vector model, that exploits simple cosine similarity and term frequency, using the topical data described in section 3.1. Let Similarity(d, t) be the similarity between a document d and a request (or textual representation of a topic) t D T Similarity(d, t) = where D and T are D T the term frequency (tf) vectors of d and t tf i,j is the number of occurrences of term i in j. Additionally to textual representations of the topics we use weighted lexical views of ontological data. The building, weighting and matching of these ontological data with documents is still in an early stage, but it allows us to illustrate semantic enlargment of the topical request. Let Similarity(d, l) be the similarity between a document d and an ontological lexical dictionary l

3 D L Similarity(d, l) = where D is the term D L frequency (tf) vector of d and L is the weight (w) vector of l w i,j is the weight of term i in j. To create a weighted topical terminology dictionary out of ontological data we use a few relations and a simple algorithm (figures 2 and 3). Let L(c) be the set of expressions associated to the label of the concept c given by the standard rdfs : label relation Let R be the set of relations authorized to build the terminology dictionary (we filter out semantically irrelevant relations, depending on the resource that we use) Let R(O, c) be the set of concepts associated to c by at least one of the relations in R inside the set of ontologies O INPUT : Ontology resource set O, Set of core topical concepts C, Weight propagation ratio r, Minimal weight threshold m lexic = dictionnary() for concept in C do lexic = buildlexic(lexic,concept,1,set()) end for Figure 2. Ontology lexic building algorithm STATIC INPUT : Ontology resource set O, Weight propagation ratio r, Minimal weight threshold m INPUT : Lexical dictionary to fill lexic, Current concept concept, Current weight weight, Previously visited concepts pastconcepts pastconcepts.append(concept) if weight > m then for label L(concept) do if label lexic then lexic[label] = max(weight, lexic[label]) else lexic[label] = weight end if end for for c R(concept) do lexic = buildlexic(lexic, c, weight r/ R(concept), pastconcepts) end for end if Figure 3. buildlexic() recursive ontology lexical building method 'artificial intelligence ','computational intelligence ' 1 Figure 4. with lexic l exi c w ei ght OpenCyc : artificial intelligence broader OpenCyc : computer science l exi c 'computer science','computing' w ei ght 0.2 sameas 'artificial intelligence','ai' l exi c w ei ght Wordnet : artificial intelligence Example of small weighted topical ontology 3.4. A custom Target Recall metric using ODP targets To evaluate the success of topical crawlers in the context of this study we use a custom definition of Target Recall using ODP targets. Let T be the set of known targets for a given topic Let T R(d) be a function from the set of document into 0,1. This function characterizes whether a document d is an extended target of a topic or not: T R(d) = 1 means that there exists t in T such as Similarity(d, t) > 0.8. Otherwise T R(d) = 0. By extension T R(D) is the sum of T R(d) for d in D and T R i (H) is the sum of T R(d) for d in the crawl history H at a rank i 4. Experimentation 4.1. Experimental data For each topic we collected back links unto seven hops of distance from the targets. Note that the web is known to be a small world network with mean radius around seven. At each iteration we used a sample of one thousand pages to fetch the next hop. For each studied topic we built six topical representations to be matched with the collected documents. Keywords : Minimal keywords request ODP : Concatenation of all descriptions of the topical targets in ODP Wikipedia : Textual content of the wikipedia page closer to the topic OpenCyc Min : Terminology dictionary built on the OpenCyc ontology with weight threshold of 0.9 and weight propagation ratio of 0.0 OpenCyc Short : terminology dictionary built on the OpenCyc ontology with weight threshold of 0.1 and weight propagation ratio of 0.5 OpenCyc Wide : terminology dictionary built on the OpenCyc ontology with weight threshold of 0.01 and weight propagation ratio of 0.9 Data collection and analysis was concentrated upon three sample topics from ODP :

Artificial Intelligence : ODP : http://www.dmoz.org/computers/artificial Intelligence/ Keywords : Artificial Intelligence Wikipedia : http://en.wikipedia.

org/computers/software/ Operating Systems/Mac OS/ Keywords : Mac OS Wikipedia : http://en.wikipedia.org/wiki/mac os OpenCyc core concept : http://www.cycfoundation.

4 Artificial Intelligence : ODP : Intelligence/ Keywords : Artificial Intelligence Wikipedia : intelligence OpenCyc core concept : org/concepts/artificialintelligence Number of targets : 1148 Mac OS : ODP : Operating Systems/Mac OS/ Keywords : Mac OS Wikipedia : os OpenCyc core concept : org/concepts/macos Number of targets : 366 Robotics : ODP : Keywords : Robotics Wikipedia : OpenCyc core concept : org/concepts/robotics Number of targets : 745 Figures 8, 9 and 10 show the similarity of backlinks pages from the three studied topics with ontological requests OpenCyc Min, Short and Wide. In all cases those metrics behave in a very similar way, with a regular downward gradient through the seven backlinks hops but with high mean errors. The larger terminology dictionary of OpenCyc Wide leads to a slightly more progressive downward curve. These observations tend to show that the a priori knowledge of the ODP metric pays and leads to the most reliable content metric among the six metrics experimented. However there is no obvious metric ranking independent from the topics to deduce because Wikipedia, Keywords and ontology metrics also demonstrate interesting qualities on specific topics. In a very general way the regularity of most similarity gradients show that statistical textual content metrics can be good indicators of the distance of Web pages from topical targets and therefore have good routing capabilities for topical crawling. Still the high mean errors show that the gradient can only be used in an ensemble approach (using a population of crawling agents) and that there is no guarantee for the efficiency of a single crawler using local and partial data only Backlinks results Figures 5, 6, 7,8, 9 and 10 show the similarity of backlinks pages from the three studied topics with all six types of requests used. The shape of the downward similarity gradient for each metric illustrates its routing capability toward the targets of the topic. The more progressive and regular is the curve, the better the content s analysis indicates the proximity of a group of documents to targets and therefore the appropriateness of their ranking into the frontier of a topical crawler. Along with the general shape of these curves, the mean error indicates the reliability of the metric to estimate the distance with targets of the topic. Figures 5, 6 and 7 show the similarity of backlinks pages from the three studied topics with textual requests Keywords, ODP and Wikipedia. The ODP similarity metric which is supposed to best match the content of the targets themselves behaves in quite the same way in all three cases with a regular downward curve all the way through the seven hops and a mean error of about half the mean similarity value. On the other hand the Keywords and Wikipedia metrics have a more unpredictable behavior. Keywords based metric leads to a steep but regular curve with the smallest mean error for the Artificial Intelligence topic until distance 3 (in hop) is reached but is otherwise quite irregular with a high mean error. Wikipedia s mean error is generally about the same as ODP s, its curve has a quite regular shape on Robotics and Mac OS, but much less on Artificial Intelligence. Figure 5. Textual metrics similarities with back links of topic Artificial Intelligence Figure 6. Textual metrics similarities with back links of topic Mac OS 4.3. Experimental crawlers To match the observations of backlinks similarity data with actual crawling results we run a simple Best First crawler on multiple test cases. For each test we specify the topic targeted, the content metric used to sort the frontier,

Figure 7. Textual metrics similarities with back links of topic Robotics the length of the crawl and the distance of the seeds from the targets of a topic.

5 Figure 7. Textual metrics similarities with back links of topic Robotics the length of the crawl and the distance of the seeds from the targets of a topic. For example at distance one of the topic Robotics, two hundred seeds are randomly selected in the set of one thousand backlinks collected at one hop of the targets. For each topic we experiment with crawlers using each content metric at distance 1, 3 and 6. All crawlers are ran 8 times. Crawlers starting at distance 1 and 3 have length of 3000 documents, while crawlers starting at distance 6 have length of documents. Target recall curves are drawn using the custom Target Recall metric defined in section 3.4. Calculation time being a hard constraint in this context we were unable to complete all results for the longer tests of length Crawler results Figure 8. Ontology metrics similarities with back links of topic Artificial Intelligence Figure 9. Ontology metrics similarities with back links of topic Mac OS Figure 10. Ontology metrics similarities with back links of topic Robotics Some Target Recall curves are shown in figures 11, 12 and 13). The clearness of the results suffers from very important standard deviations (that are not displayed to preserve the clarity of the charts) and due to a lack of experimental data we cannot yet draw a clear conclusion about a correlation between backlinks observations and crawler success. Most results like those displayed in figure 11 are to close to call and the differences observed could be attributed to the lack of test cases. Still some interesting observations can be made among the clearest results collected. For example, the keyword metric on the Artificial Intelligence topic displays the cleanest backlinks similarity gradient and a quite small standard deviation until distance 3 (see figure 5) and is also very successful in crawling experimentation (see figure 12). Also, in the case where the crawlers start at a distance of 6 from the targets of the Mac OS topic (see figure 13), the metrics ODP, OpenCyc Short and OpenCyc Wide that are semantically rich and that display regular backlinks similarity gradients (see figures 6 and 9) perform much better than Wikipedia and Keywords that have noisier backlinks similarity results (see figure 6). These encouraging results can be at the basis of some metric selection criteria that can depend on the selected topic, in function of the gradient form, the associated standard deviation and the estimated distance of the targets (i.e. difficulty of the topic). Results show that all experimented metrics have the potential to bring good results as well as bad ones depending on the test case. The very important variations illustrate what was said about the necessity of an ensemble approach and the absence of guarantee of success of a single crawling agent. To see the impact of the various components of a topical crawler design such as its routing metric the aggregation of the results of an important population of crawling agents is necessary. In the same way such a population approach is needed to exploit the clear gradient derived from our observations of topical backlinks

6 Figure 11. Median target recall values of Best First crawlers on topic Artificial Intelligence with seeds at distance 1 and history length of 3000 Figure 13. Median target recall values of Best First crawlers on topic Mac OS with seeds at distance 6 and history length of [3] A. laszlo Barabasi, R. Albert, and H. Jeong, Scale-free characteristics of random networks: The topology of the world-wide web, Figure 12. Median target recall values of Best First crawlers on topic Artificial Intelligence with seeds at distance 3 and history length of Conclusion We described a method to retrieve partial topical graphs data from the Web. Among other uses, studying these graphs is one of the key to develop better topical crawlers or experiment with existing crawlers. In this first application we validate the guiding capability of documents content metrics based on words frequency and ontology knowledge for focused crawlers. Results show both the importance of using a rich representation of the topic and of avoiding ambiguity and noise in the information. Finally, to exploit the guiding capability of the metrics, we have demonstrated that ensemble approaches are necessary. One potential main improvement is also the exploitation of the occurring context of the link within the pages to improve the ranking of the crawler frontier. We are into the process of building a consolidated study based on a multi agent architecture for crawling topical data from the web to confirm and extend our main results. References [1] P. Srinivasan, F. Menczer, and G. Pant, A general evaluation framework for topical crawlers, Inf. Retr., vol. 8, no. 3, pp , [4] D. Gibson, J. Kleinberg, and P. Raghavan, Inferring web communities from link topology, in HYPERTEXT 98: Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space structure in hypermedia systems. New York, NY, USA: ACM, 1998, pp [5] R. Chandrasekar, H. Chen, S. Corston-Oliver, and E. Brill, Subwebs for specialized search, in SIGIR 04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2004, pp [6] T. H. Haveliwala, Topic-sensitive pagerank, in WWW 02: Proceedings of the 11th international conference on World Wide Web. New York, NY, USA: ACM, 2002, pp [7] L. Nie, B. D. Davison, and X. Qi, Topical link analysis for web search, in SIGIR 06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2006, pp [8] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web, [9] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, vol. 46, no. 5, pp , [10] F. Menczer, R. K. Belew, and W. Willuhn, Artificial life applied to adaptive information agents, in AAAI Spring Symposium on Information Gathering, [11] Fellbaum, WordNet: An Electronic Lexical Database (Language, Speech, and Communication), C. Fellbaum, Ed. The MIT Press, May [Online]. Available: citeulike07-20&path=asin/ x [2] R. Albert, H. Jeong, and A. L. Barabasi, The diameter of the world wide web, Nature, vol. 401, pp ,

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.