2 Approaches to worldwide web information retrieval

Size: px

Start display at page:

Download "2 Approaches to worldwide web information retrieval"

Noah Bell
5 years ago
Views:

1 The WEBFIND tool for finding scientific papers over the worldwide web. Alvaro E. Monge and Charles P. Elkan Department of Computer Science and Engineering University of California, San Diego La Jolla, California Phone: (619) Fax: (619) Introduction Information retrieval in the worldwide web environment poses unique challenges. The worldwide web is a distributed, always changing, and ever expanding collection of documents. These features of the web make it difficult to find information about a specific topic. The most common approaches involve indexing, but indexes introduce centralization and can never be up-to-date. Available information retrieval software has been designed for very different environments with typical tools [Salton and McGill, 1983; Salton and Buckley, 1988] working on an unchanging corpus, with the entire corpus available for direct access. This paper describes WEBFIND, an application that discovers scientific papers made available by their authors on the web. WEBFIND uses a novel approach to performing information retrieval on the worldwide web. The approach is to use a combination of external information sources as a guide for locating where to look for information on the web. The external information sources used by WEBFIND are MELVYL and NETFIND. MELVYL is a University of California library service that includes comprehensive databases of bibliographic records, including a science and engineering database called INSPEC [University of California, 1996]. NETFIND is a white pages service that gives internet host addresses and people s addresses [Schwartz and Pu, 1994]. Separately, these services do not provide enough information to locate papers on the web. WEBFIND integrates the information provided by each in order to find a path for discovering on the web the information actually wanted by a user. 2 Approaches to worldwide web information retrieval The most common approach to resource discovery over the web is to use an index to store information about web documents. This approach involves periodic automatic searching of the web and gathering of information about all the documents found in these searches. AltaVista [Digital Equipment Corporation, 1996], WebCrawler [Pinkerton, 1994], Lycos [Mauldin and Leavitt, 1994], Infoseek [Randall, 1995], and Inktomi [Brewer and Gauthier, 1995] are the most important examples of applications which use this indexing approach. WebCrawler can also use its own index to suggest starting points for online searches. The main alternative to resource discovery based on offline indexing is to perform automated online searching. Such online searching requires sophisticated heuristic reasoning to be sufficiently focused. The most developed example of this approach is the so-called Internet Softbot [Etzioni and Weld, 1994]. The

2 Softbot is a software agent that transforms a user s query into a goal and applies a planning algorithm to generate a sequence of actions that should achieve the goal. The Softbot planner possesses extensive knowledge (some acquired through learning) about the information sources available to it. The WEBFIND approach to resource discovery is similar to the WebCrawler and Softbot approaches, in that WEBFIND performs online searching of the web. However, unlike the WebCrawler, the starting points used by WEBFIND are suggested by inference from information provided by reliable external sources, not by a precomputed index of the web. Unlike the Softbot, WEBFIND uses application-specific algorithms for reasoning with its information sources. In principle a planning algorithm could generate the reasoning algorithm used by WEBFIND, but in practice the WEBFIND algorithms are more sophisticated than it is feasible to synthesize automatically. 3 The design of WEBFIND This section describes the protocol followed by WEBFIND when retrieving a scientific paper over the worldwide web. The two main phases are, first, integrating information provided by INSPEC and NETFIND, and second, discovering a worldwide web server, an author s home page, and finally the location of the wanted paper. 3.1 INSPEC and NETFIND integration A WEBFIND search starts with the user providing keywords to identify the paper, exactly as he or she would in searching INSPEC directly. A paper can be identified using any combination of the names of its authors, words from its title or abstract, or other bibliographic information. After the user confirms that the right paper has been identified, WEBFIND queries INSPEC to find the institutional affiliation of the principal author of the paper. Then, WEBFIND uses NETFIND to provide the internet address of a host computer with the same institutional affiliation. A query to NETFIND is a set of keywords describing an affiliation. Useful keywords are typically words in the name of the institution or in the name of the city, state, and/or country where it is located [Schwartz and Pu, 1994]. The NETFIND query engine is incapable of processing abbreviations, so WEBFIND chooses full words found in affiliation given by INSPEC, with a few common abbreviations expanded, such as univ. to university. In general, the result of a NETFIND query is all hosts whose affiliation contains the keywords in the query. There can be many such hosts, and WEBFIND must determine which of them is best. Since institutions are designated very differently in INSPEC and NETFIND, it is non-trivial to decide when an INSPEC institution corresponds to a NETFIND institution. WEBFIND uses the recursive field matching algorithm described in Monge and Elkan [1996] to do this. The algorithm returns a score between 0:0 and 1:0, where 1:0 means certain equivalence and 0:0 means certain non-equivalence. The internet host selected is the one whose NETFIND affiliation has the highest matching score with the INSPEC affiliation. 3.2 Discovery phase The searching of the worldwide web done by WEBFIND is real-time in two senses. First, the search takes place while the user is waiting for a response to his or her query. Second, information gathered from one retrieved document is analyzed and used to guide what documents are retrieved next.

3 The first step in the discovery phase is to find a worldwide web server on the chosen internet host. This step uses heuristics based on common patterns for naming servers. The most widely used convention is to use the prefix www. or www-. WEBFIND tests the existence of a server named with either of these prefixes by calling the Unix ping utility. If either prefix yields a server, then WEBFIND continues with the next step of the discovery phase. Otherwise, WEBFIND strips off the first segment of the internet host name and applies the same heuristics again. For example, cs.ucsd.edu is transformed to ucsd.edu and then the potential servers and www-ucsd.edu are pinged. Once a worldwide web server has been identified, WEBFIND follows links until the wanted article is found. This search proceeds in two stages: find a web page for the principal author, and find a web page that is the wanted article. Each stage of the search uses a priority queue whose entries are candidate links to follow. The priority of each link in the queue is equal to the estimated relevance of the link. For the first stage, the priority queue initially has a single link, the link for the main page of the server. For the second stage, the priority queue initially contains just the result of the first stage. When a link is added to the priority queue, its relevance is estimated using the recursive field matching algorithm applied to the context of the link, and each of two sets of keywords, a primary set and a secondary set. The context of a link is its anchor text and the two lines before and two after the line containing the link, provided no other link appears in those lines. Links are ranked lexicographically, first using degree of match to the primary set, and then using degree of match to the secondary set. In the first stage of search, the primary set of keywords is the name of the principal author, while the secondary set is fstaff, people, facultyg. Intuitively, the main objective is to find a home page for the author, while the fall-back objective is to find a page with a list of people at the institution. In the second stage of search, the primary set of keywords is the title of the wanted article, while the secondary set has keywords fpublications, papers, reportsg. Here, the main objective is to find the actual wanted paper, while the fall-back objective is to find a page with pointers to papers in general. At each stage, the search procedure is to repeatedly remove the first link from the priority queue, and to retrieve the pointed-to web page. The search succeeds when this page is the wanted page. The search fails when the queue is in fact empty. If the page is not the wanted page, all links on it are added to the priority queue with their relevance estimated as just described. Even if either stage of search fails, the user still receives useful information. If the first stage fails, the user is given the web page of the author s institution. If the second stage fails, the user is given the web page of the author s institution and the author s own home page. 4 Experimental results This section reports on experiments performed with the initial implementation of WEBFIND. The aim of the experiments was to identify which aspects of this first version of WEBFIND are the limiting factors in its ability to locate authors and their papers on the worldwide web. Figure 1 shows an example of a WEBFIND discovery session. The experiments discussed here used queries in different areas of computer science concerning papers by authors at ten different institutions.

Figure 1: Results from a WEBFIND discovery session Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA Dept. of Cognitive Sci., California Univ., San Diego, La Jolla, CA, USA Dept. of Electr.

4 Figure 1: Results from a WEBFIND discovery session Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA Dept. of Cognitive Sci., California Univ., San Diego, La Jolla, CA, USA Dept. of Electr. Eng. & Comput. Sci., California Univ., Berkeley, CA, USA Dept. Comput. Sci. Eng., Washington Univ., Seattle, WA, USA Lab. for Comput. Sci., MIT, Cambridge, MA, USA Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA Dept. of Comput. Sci., Texas Univ., Austin, TX, USA Dept. of Electr. Eng. & Comput. Sci., Illinois Univ., Chicago, IL, USA Dept. of Comput. Sci., Waterloo Univ., Ont., Canada Dept. of Comput. Sci., Columbia Univ., New York, NY, USA INSPEC author affiliations. We report on the ability of WEBFIND to map affiliations to internet hosts, to discover worldwide web servers, to discover home pages for authors, and finally to discover the wanted paper. WEBFIND correctly associated eight of the ten INSPEC affiliations to internet hosts in NETFIND. The first affiliation that WEBFIND did not correctly identify was Dept. of Cognitive Sci., California Univ., San

5 Diego. The reason for this failure was that NETFIND does not have an entry for this department, although its internet host is cogsci.ucsd.edu. In future work we intend to quantify the comprehensiveness of the coverage of NETFIND, and if necessary we will extend WEBFIND to use additional white page resources. The other affiliation that WEBFIND did not find a correct host for was Lab. for Comput. Sci., MIT. The reason here is that fifteen different internet hosts all have a NETFIND description equivalent to Lab. for Comput. Sci., MIT. Each of these hosts corresponds to a different research group (for example cag.lcs.mit.edu belongs to the computer architecture group) but this information is not available in either the INSPEC or NETFIND affiliation descriptions. The next version of WEBFIND will overcome this problem by adding keywords to the INSPEC and/or the NETFIND affiliations, if necessary. Added INSPEC affiliation keywords will be subject keywords, while added NETFIND affiliation keywords will be host name segments. For example, adding the subject keywords computer architecture to Lab. for Comput. Sci., MIT would give a specific match to the host name cag.lcs.mit.edu. Note that this will often involve matching of abbreviations, e.g. of cag and computer architecture. Of the eight internet hosts that WEBFIND found correctly, there was only one that it could not find a worldwide web server for. Given the simple heuristic used for finding a server (Section 3.2), this is encouraging. WEBFIND found the home page for five principal authors on the seven worldwide web servers it searched. The other two principal authors did not have home pages of any kind on the servers found by WEBFIND. In these two cases, the authors were no longer affiliated with the institution that INSPEC provided. We will solve this problem in the next version of WEBFIND by using the most recent information that INSPEC can provide. Finally, WEBFIND successfully discovered two papers starting from the five author s home pages found. The low rate is due to the type of author pages which were discovered. Two of the five pages were not personal home pages, but rather they were annual reports or research statements which did not provide any outgoing links, so the wanted papers were not in fact available through their authors home pages. In summary, our experiments show that WEBFIND is successful at finding worldwide web servers and finding web pages designated for authors. WEBFIND is less successful at finding actual papers, most of all because many authors have not yet published their papers on the worldwide web. 5 Conclusion This paper describes a novel approach to the task of finding information relevant to a user s inquiry on the worldwide web. Existing approaches, namely the indexing of worldwide web pages, are plagued with problems caused by the size and distributed, dynamic nature of the worldwide web. Our approach uses external information sources to restrict the part of the worldwide web which is searched. This integration requires a flexible heuristic algorithm for detecting equivalence between alternative ways of writing the names of entities such as people and institutions. A first experimental evaluation indicates that our approach is effective, and that its present limitations are not fundamental. References [Brewer and Gauthier, 1995] Eric Brewer and Paul Gauthier. Inktomi search engine. URL,

6 [Digital Equipment Corporation, 1996] Digital Equipment Corporation. AltaVista search engine. URL, [Etzioni and Weld, 1994] Oren Etzioni and Daniel Weld. A softbot-based interface to the internet. Communications of the ACM, 37(7):72 76, July [Mauldin and Leavitt, 1994] Michael Mauldin and John Leavitt. Web-agent related research at the center for machine translation. In Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval, McLean, VA, August [Monge and Elkan, 1996] Alvaro E. Monge and Charles P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, To appear. [Pinkerton, 1994] Brian Pinkerton. Finding what people want: Experiences with the WebCrawler. In Electronic Proceedings of the Second International Conference on the World Wide Web, Chicago, October Elsevier Science BV. [Randall, 1995] Neil Randall. The search engine that could. (locating world wide web sites through search engines). PC/Computing, 8(9):165 (4 pages), September [Salton and Buckley, 1988] Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5): , [Salton and McGill, 1983] Gerard Salton and Michael J. McGill. Retrieval. McGraw Hill, Introduction to Modern Information [Schwartz and Pu, 1994] Michael Schwartz and Calton Pu. Applying an information gathering architecture to Netfind: a white pages tool for a changing and growing internet. Technical Report 5, Department of Computer Science, University of Colorado, October [University of California, 1996] Division of Library Automation University of California. Melvyl system welcome page. URL, May

A Complete Bibliography of the Proceedings Volumes of the ACM Symposia on the Theory of Computing ( )

A Complete Bibliography of the Proceedings Volumes of the ACM Symposia on the Theory of Computing (1970 1997) Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233