Finding Scientific Papers with HomePageSearch and MOPS

Size: px

Start display at page:

Download "Finding Scientific Papers with HomePageSearch and MOPS"

David Marsh
6 years ago
Views:

1 Finding Scientific Papers with HomePageSearch and MOPS Gerd Hoff Universität Trier FB IV Informatik D Trier, Germany Martin Mundhenk Friedrich-Schiller-Universität Jena Fakultät für Mathematik und Informatik Institut für Informatik D Jena, Germany ABSTRACT The fast dissemination of new research results on the worldwide web poses new challenges for search engines. In this paper we describe a new approach to seek scientific papers relevant to a pre-defined research area. Different from other approaches, we do not search for web pages which contain certain keywords, but we search for web pages which are created by scientists who are active in the research area under consideration. The names of these scientists are obtained from the DBLP server [9]. The HomePageSearch system finds the Home Pages according to the names, and Mops finds research papers close to the Home Pages. It creates an index of these papers and makes it accessible on the web. We conclude that such a focused crawling is very effective for building high-quality collections and indices of scientific papers, using ordinary desktop hardware. Keywords: scientific literature on the web, intelligent search techniques, focused navigation 1. INTRODUCTION In many research areas, scientists make their newest results electronically available on their web site, long before the results appear in conference proceedings or in journals. Whereas a decade ago, the state of the art in a research area could be found out by reading conference proceedings and journals in the local library, nowadays it is additionally necessary to find these electronic publications on the web. Traditional search engines do not help for this task, because they do not index e.g. postscript documents, which is the electronic format of many preprints appearing on the web. The few existing searchable indices for postscript documents either cover too large fields all of computer science, for example to be really helpful, or they depend on some submission procedure which delays the appearance of the documents on the web. We present an approach for constructing a searchable index for scientific papers which is specialized in a relatively small research area and also allows to find the latest new Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGDOC 01, October 21-24, 2001, Santa Fe, New Mexico, USA. Copyright 2001 ACM /01/ $5.00. documents. This proceeds in three steps. In the first step, a list of names of people who are active in the research under consideration is constructed. This information is obtained from electronic computer science bibliographies, and therefore the names found can be seen as certified. In the second step, the Home Pages of these people are found. In the third step, these Home Pages are used as starting points for a search engine, which collects scientific papers in the area close to these Home Pages. The documents are used to create a searchable index which is accessible on a web server. In the rest of this article we describe the three steps in more details. Section 2 describes related work. In Section 3 we describe how the list of scientist names can be constructed and maintained automatically. Finding their Home Pages is the technically most complicated task. We describe in Section 4 how our information system Home- PageSearch performs it. The search engine Mops finding the scientific documents is described in Section 5. All the three systems work independently of each other. The rough structure of their interconnection is shown in Figure 1. We tested our approach by creating two example indices. The research area for the one index is complexity theory, and for the other index it is BDDs (binary decision diagrams, a data structure for VLSI design and verification). 2. RELATED WORK There are a lot of web information systems helping users to search and browse the web. A good and systematic overview is given in Chapter 13 of [3]. We will focus here on systems for finding scientific literature, mainly in the field of Computer Science (CS). The system Ahoy! [17] searches for personal Home Pages. With the help of search engines candidate references are detected. Title, URLand description retrieved by the search engines are examined. If there is no good hit and the user has entered an institute name, Ahoy! guesses a Home Page URL. Unlike Ahoy!, HomePageSearch does not guess URLs. HomePageSearch loads web documents and stores important information in its database for direct access. Moreover, the priority of our system is set on finding pages of scientists. There are several searchable indices for computer science papers. They can be divided into two classes. The one gets the papers by searching the web, and the other gets them from submissions. The Research Index (formerly CiteSeer) [8] is a full text search engine for CS research papers. The system provides an excellent handling of citations (Autonomous Citation In- 201

2 DBLP- Interface HPSearch Homepage- URLs MOPS searchable index Figure 1: Creation of a preprint index dexing) and an excellent search index. Anyway, the start points for the search are obtained by keyword search. However, this makes our approach different, because we search in a more focused way for relevant Home Pages. By our experience we assume that Research Index does not necessarily cover research papers which are relatively new (2 months or so). By the focused choice of the start points for our search engine, the search space is kept small and new research papers are found much faster. Cora [16, 13] is an engine for CS research papers that uses machine learning techniques for categorizing the papers into a topic hierarchy in a Yahoo like fashion. Compared to our index, it covers relatively few papers. The latter both engines try to get the information where to find papers from the structure of the net. Actually, they cannot classify the importance of what they found. With our approach, by the careful selection for the starting points which environments are searched, the chance that important papers are found becomes higher. There are several search indices for technical reports, e.g. the New Zealand Digital Library [11], the ACM Research Repository [2], or NCSTRL[15]. All of them eventually depend on some submission procedure, which delays the appearance of the research papers. 3. FINDING SCIENTISTS The quality of the searchable index strongly depends on finding the scientists active in the considered research area. Traditionally, the community acknowledges the importance of a research contribution in that way, that it is accepted for presentation at a conference or in a journal. Therefore, the people active in a research area are those who have papers in conference proceedings or in journals. Hence, one way to find their names is to browse the table of contents of conference proceedings and journals. We do this in an automated way. The Computer Science Bibliography DBLP [9] can be searched for authors and titles of papers of given publications. There are several good web-services which provide similar bibliographic information in computer science [7, 1]. For our example indices, we used the DBLP server. The list of people for our complexity theory index consists of those who presented papers at the conference Computational Complexity during the last years. For the BDD index, we took the conference Computer Aided Verification, andmore- over we took the authors of publications in which titles appear the word BDD or binary decision diagram. Using the output of DBLP, these lists are constructed and maintained automatically. 4. HomePageSearch 4.1 Overview HomePageSearch is a web based, topic oriented information system for finding and watching scientifically relevant personal Home Pages. Personal Home Pages have obtained an important position in scientific communication. Preprints and long versions of published papers, project descriptions, course material and contact information can be found there. HomePageSearch shows two advantages in comparison to standard search engines: Our system has collect a large amount of topic specific data. The Home Pages of 75,000 scientists were searched. Their names are mainly provided by the DBLP. HomePageSearch has built up a domain specific knowledge. HomePageSearch works as follows: A set of candidate Home Pages is formed with the help of usual search engines. Each page is rated and a ranking is built. The highest rated candidate Home Pages are visited by an agent and are rated again, a final ranking is built. The result is stored in a database. The user queries HomePageSearch by a Web interface and receives a hypertext view of the results. He has also the possibility to start a Web search if the name is not in the database. HomePageSearch is implemented in Java. In order to maintain the information content the results are reconsidered regularly. Figure 2: Start page of HomePageSearch 202

3 In the rest of the section we discuss personal Home Pages (4.2), rating function and ranking (4.3), maintenance of information (4.4), architecture of the system (4.5), working plan of the search (4.6) and experiences (4.7). 4.2 Personal Home Pages A scientifical personal Home Page is a hypertext page (actually in HTML) which is created by the scientist himself or by his order. Its main function is to present users an overview over his work. The classification of hypertext pages and especially the identification of personal Home Pages is a difficult task. There are not any fixed rules which describe scientists personal Home Pages. We determine characteristics with the help of a set of training documents (reference set). These hypertext pages are known as personal Home Pages of computer scientists. A second set of pages (which is determined by the results of the search engines) is used for control (control set). Both sets have a size of 1,000 elements. AcharacteristicC has to be both relevant, a certain percentage R of documents in the reference set have it, and significant, the ratio S = percentage of documents with the characteristic C in the reference set percentage of documents with the characteristic C in the control set hasacertainvalue. Our experiences shows that expedient values are R = 2.5% and S =2. We obtain about 500 such characteristics and highlight the most important facts: The probability that certain words appear in different sections on the page is very different. We find out that it is reasonable to distinguish the sections title, first header tag on a page, all other header tags, labels from links, and other text within a HTML page. For example the word publications appears in 0.2 % (2.3 %) in the title, 0.7 % (1.0 %) in the first header, 12.2 % (1.4 %) in other headers, 25.4 % (3.4 %) in labels and 16.6 % (5.5 %) in other text in the reference set (values in brackets: control set). The name of the person is mostly found in title (89.8 % versus 20.0 % in control set), first header (51.0 % vs. 4.8 %) or URL(44.0 % vs. 3.0 %). The character (65.5 %) or strings like /people appear (in combination with the name) in the URL. There are references to his publications. The size of a personal Home Pages is small. The median size in the reference set is 3KB, in the control set it is 9KB. The domain specific knowledge itself is changing permanently too. A comparable test from May 5, 1997 shows for example that 46.0 % of the documents have a character intheurl.thereforewerepeattheaboveprocedureevery month automatically. 4.3 The rating function We assign points for characteristics to rate candidate Home Pages. From the characteristics worked out, a 2 step rating function is formed. We use our rating function instead of classical text classification algorithms like naive Bayes, nearest neighbor or decision trees (see for example [10, 14]) because of the following reasons: The structure of the document is hard to handle with pure text classification. The number of characteristics is too high. We rate some off the page criteria (e.g. URL structure, search engine provided information, links from other Web pages). In the first step of the search, we do not have the entire document. Everything in the system including the rating criteria is highly dynamic. With the automatic update-able rating function the system keeps implementable. Since there is a query term, the name of the person searched for, our application is no pure classification problem. In step 1, HomePageSearch evaluates the entries received by the search engines: Title, URL, description and position made by search engines ranking algorithms are made available. After creating a first ranking (presorting), HomePage- Search reads the best candidate Home Pages, whereby header, links, meta tags and normal text are evaluated (step 2). A final ranking is built. 4.4 Maintenance of information Maintaining of information is a primary task of Home- PageSearch. High actuality must be kept. Therefore Home- PageSearch has two parametrically controllable mechanisms which work periodically and event oriented. Evaluation of DBLP access logs: Daily, HomePageSearch evaluates the access logs of the person pages in DBLP. Names unknown to HomePageSearch or names for which the search took place a long time ago are searched for. In a middle sized time period, a function whichdependsonthedateofsearchandthescoreofthe best page decides whether a search is started or not. This direct interconnection to the access log of DBLP guarantees that the data collected by HomePageSearch is up to date. (See Figure 3.) Validation of URLs: Weekly, HomePageSearch checks for each name the URLof the best rated page on existence. Each URLis checked at least twice before it is removed from the HomePageSearch database. Then HomePageSearch starts a new search for that name. (See Figure 4.) It is important to notice that the above algorithms and the determination of characteristics (4.2) work completely automated. 4.5 Architecture The architecture of HomePageSearch is shown in figure 5. The core of HomePageSearch is connected with all components. The arrows depict the data flow. HomePageSearch 203

4 parameter: SEARCHDAYS1, SEARCHDAYS2 with SEARCHDAYS1 < SEARCHDAYS2 input: name of a person output: startsearch (search will be started) if (name not in HomePageSearch) then startsearch = true elsif (days since last search > SEARCHDAYS2) then startsearch = true elsif (days since last search < SEARCHDAYS1) then startsearch = false else startsearch = f(days since last search, best score) Figure 3: Evaluation of DBLP access logs parameter: CHECKDAYS1, CHECKDAYS2 with CHECKDAYS2 = CHECKDAYS1 + 7 input: URL output: startsearch (search will be started) if (days since last visit of the URL < CHECKDAYS1) then startsearch = false elsif (URLis valid) then enter new date of visit ; startsearch = false elsif (days since last visit of the URL > CHECKDAYS2) then startsearch = true else startsearch = false Figure 4: Validation of URLs asks queries to search engines which return result pages. Candidate Home Pages are read from the Web. Results are stored persistently in the database which runs with DB2 [6]. A hypertext view is generated to show the results (figures 6, 7 and 8). An export file for DBLP is generated. With the user interface HomePageSearch can be queried. 4.6 Working plan of the search The search is started by a user or by the mechanisms described in The name is ascertained. 2. Queries for different search engines are generated. 3. Responded pages are parsed, entries are extracted. 4. The entries are rated, a ranking is determined (Presorting). 5. The best hits are visited and rerated, a final ranking is determined. 6. The result is stored in the database. 7. A hypertext page is generated. Beyond that working plan, there are some minor but important extensions: We perform two forms of URL unification independently: Astring index. or a final / char is not significant. Host names are mapped to their IP number. For instance, the URLs hoff/ index.eng.html, hoff/index.html, and hoff are unified: We expand the set of candidate Home Pages, by shortening a URLto a user starting page and by adding pages which are linked from pages in the process and the label of that link contains the name of the searched person. 4.7 Experiences HomePageSearch provides about 75,000 names for which a probably good Home Page was found. In an evaluation test, HomePageSearch searches for persons whose Home Page was entered manually in the DBLP. HomePageSearch found in 84 % of them a correct personal Home Page. In DBLP, only 79 % of manually entered URLs are correct, other URLs were broken or out of date. The best search engine shows the correct personal Home Page in 60 % on position

WWW Candidate Home- Pages 00000 11111 00000000 11111111 0 1 0 1 000000 111111 Database System DB2 00000000 11111111 0 1 0 1 000000 111111 User Interface HPSearch Core Hypertextview Alta Vista Fast

5 WWW Candidate Home- Pages Database System DB User Interface HPSearch Core Hypertextview Alta Vista Fast Google Lycos Excite Hotbot Infoseek Northern Light search engines Figure 5: Architecture of HomePageSearch Some users want to send correct or remove wrong personal Home Pages. We have build a form for entering that pages. The manually entered Home Pages from DBLP are included in the HomePageSearch database, after URLchecking. 5. Mops Mops is a relatively simple search engine. Its goal is to provide an index of scientific papers found on or close to a given list of web addresses. From these documents, an index is created. It can be searched through a web interface. It answers questions by giving links to the documents containing the search items, their initial lines, and the choice to browse their contents in ASCII format. 5.1 Finding and sorting documents The start point of Mops is a list of web addresses. Given a web address, it starts a search from the given address following the links down to a given depth. It turned out to be sufficient to set the depth to 1 or 2. Therefore, only a small area around each given web address is searched, what makes the search quite fast. Mops collects all postscript, dvi and pdf documents (also in compressed form) found, because scientific papers in computer science are available mostly in these formats. For each document, the date when it was found, the sequence of links which led to it, and the name of the link i.e. the text <a href=url> name of the link </a> used by the web page s author for the description of the link is stored. When a document was found already earlier, it is not collected again, if its last collection date is not too long ago. Anyway, its sequence of links and name is updated accordingly. Dependent on the web address of the document, it is decided whether it is included into the collection of scientific documents, or into the collection of documents which have to do with lectures, classes, or other, non-scientific matters. Moreover, there are documents which appear under different addresses, e.g. because the web server is accessible with Figure 6: 1st level of the 2-level author tree different names, or because a scientist belongs to different research institutes. It is tried to find these duplicates and to prevent that they appear more than once in the index. 5.2 Creating the index Each document is stored in ASCII format, which is generated using pstotext [4]. This allows to use glimpseindex [12] for creation of an index for each collection. Search through this index is available by a web interface which uses glimpse [12]. Besides searching for documents in which certain terms appear, it is also possible to search directly for links, or to search through the names of the documents. For each document found, its original URL, its description, its initial lines and the first matching lines are shown. It is possible, to browse through the complete text of each document, and to view all the lines matching the query. Each search result can be refined by further queries. 5.3 Implementation notes Mops consists of a bunch of Perl scripts. It runs on an ordinary PC under a Linux operating system, and it uses lots of freely available software like glimpse, perl, pstotext, cgiwrap, apache. The search engine usually runs for 30 hours during the weekend. Afterwards, the index is updated. We produced a CD-ROM which contains all the data used for the index search of Mops and the scripts for the Web interface. To install the Web interface on a standard Linux PC takes copying one file from the CD-ROM to the disk. One is then able to search the complete index from CD- ROM without Internet connection. 205

Figure 7: 2nd level of the author tree 6. HomePageSearch JOINS Mops The quality of the Mops index depends on the list of starting points for the search.

Usually, scientists provide links to their scientific papers either on their Home Page or on a neighboured web page. Therefore, Home Pages are good starting points for the search.

6 Figure 7: 2nd level of the author tree 6. HomePageSearch JOINS Mops The quality of the Mops index depends on the list of starting points for the search. Mops searches only a small area around each starting point. If the starting points are too far away from the documents, they either will not be found, or the search takes too much time. Usually, scientists provide links to their scientific papers either on their Home Page or on a neighboured web page. Therefore, Home Pages are good starting points for the search. The first Mops index ( )wascreated for scientific papers in complexity theory. A hand written list of about 60 web addresses and a list automatically created using the interface to DBLP and HomePage- Search is used as starting points for the search. This list consists of Home Pages of complexity theorists, and of technical report servers of several universities. Currently, there are about 8,400 documents in the collection. On average, there are 40 new documents found each week. Another index ( omops/ ) was created for scientific papers in the BDD area. The first step was to get the names of the scientists in this field. This was managed by querying the DBLP server for publications which have BDD or Binary Decision Diagram in its titles. The bibliographic data obtained in this way yields names of scientists of the considered field. With these names, HomePageSearch produced a list of personal Home Pages. This is used as starting points for the search of Mops. Within two runs of the search engine, about 1000 scientific papers and about 250 other papers were collected. Meanwhile, few hand written addresses were added. The index now contains 3,300 papers. E.g. the BDD portal ( ) uses this service. Figure 8: Hypertext view: Personalised Page for Jeffrey D. Ullman 7. CONCLUSION AND OUTLOOK The web has changed the way scientists provides themselves with literature in an essential way. But, we have to be attentive that the literature keeps a certain level of quality. Normal search engines and web directories do not provide a helpful solution for this task. Topic specified crawling as presented for example in [5] is a better approach. But the most approaches in this area lack of using topic oriented knowledge. With the combination of three web information systems, developed at the University of Trier, we have demonstrated that a focus crawling with topic oriented knowledge is a primely way to produce a high quality collection of papers for a small topic. Since the modular structure, our approach is expandable in many ways: Other strategies to produce person names, Home Pages or the paper index can replace the existent. With little effort the approach is easy to apply to other topics: Only the topic specific parts must be replaced. Further efforts in this field are indispensable. Acknowledgement We thank Michael Ley for providing us with data from DBLP [9] and for very helpful discussions. 8. REFERENCES [1] Alf-Christian Achilles. The Collection of Computer Science Bibliographies. [2] ACM. The ACM Research Repository

Figure 9: The home page of Mops [3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, May 1999. [4] Andrew Birrell and Paul McJones. pstotext.

In Proceedings of the Eighth World-Wide Web Conference, 1999. [6] IBM. DB2 Product Family. http://www.software.ibm.com/data/db2/. [7] David M. Jones. The Hypertext Bibliography Project. http://theory.

7 Figure 9: The home page of Mops [3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, May [4] Andrew Birrell and Paul McJones. pstotext. pstotext.html. [5] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth World-Wide Web Conference, [6] IBM. DB2 Product Family. [7] David M. Jones. The Hypertext Bibliography Project. dmjones/hbp/. [8] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67 71, [9] Michael Ley. DBLP Computer Science Bibliography. [10] Y. H. Li and Anil K. Jain. Classification of text documents. The Computer Journal, 41(8): , [11] New Zealand Digital Library. Computer Science Technical Reports. Figure 10: The result of a Mops search for instance ANDcomplexity [12] Udi Manber and Sun Wu. GLIMPSE: a tool to search through entire file systems. In Usenix Winter 1994 Technical Conference, pages 23 32, [13] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval 3(2): , [14] Thomas M. Mitchell. Machine Learning. McGraw-Hill, [15] NCSTRL. Networked Computer Science Technical Library. [16] Jason Rennie and Andrew McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of the Sixteenth International Conference on Machine Learning, [17] Jonathan Shakes, Marc Langheinrich, and Oren Etzioni. Dynamic reference sifting: A case study in the homepage domain. In Proceedings of the Sixth International World Wide Web Conference,

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.