Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract

Size: px

Start display at page:

Download "Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract"

Harold Morton
5 years ago
Views:

1 Harvest: A Scalable, Customizable Discovery and Access System C. Mic Bowman Transarc Corp. Udi Manber University of Arizona Peter B. Danzig University of Southern California Michael F. Schwartz University of Colorado - Boulder March 12, 1995 Darren R. Hardy University of Colorado - Boulder Duane P. Wessels University of Colorado - Boulder Technical Report CU-CS Department of Computer Science University of Colorado - Boulder (Original Date: August 1994 Revised March 1995) Abstract Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly dicult to use eectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specic content indexes, exibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with WWW clients and with HTTP, FTP, Gopher, and NetNews information resources. We discuss the design and implementation of Harvest and its subsystems, give examples of its uses, and provide measurements indicating that Harvest can signicantly reduce server load, network trac, and space requirements when building indexes, compared with previous systems. We also discuss several popular indexes we have built using Harvest, underscoring the customizability and scalability of the system. 1

2 1 Introduction Over the past few years a progression of Internet publishing tools has appeared. Until 1992, FTP [45] and NetNews [42] were the principal publishing tools. Around 1992, Gopher [39] and WAIS [31] gained popularity because they simplied network interactions and provided better ways to navigate through information. With the introduction of Mosaic [1] in 1993, publishing information on the World Wide Web [2] gained widespread use, because of Mosaic's attractive graphical interface and ease of use for accessing multimedia data reachable via WWW links. While Internet publishing has become easy and popular, making eective use of Internetaccessible information has become more dicult. As the volume of Internet-accessible information continues to grow, it becomes increasingly dicult to locate relevant information. Moreover, current information systems experience serious server and network bottlenecks as a rapidly growing user populace attempts to access networked information. Finally, current systems primarily support text and graphics intended for end user viewing they provide little support for more complex data, as might be found in a digital library. For a more detailed discussion of these problems, the reader is referred to [9]. In this paper we discuss a system that addresses these problems using a variety of techniques. We call the system Harvest, to connote its focus on reaping the growing collection of Internet information. Harvest supports resource discovery through topic-specic content indexing made possible by avery ecient distributed information gathering architecture. It avoids bottlenecks through topology-adaptive index replication and object caching. Finally, Harvest supports structured data through a combination of structure-preserving indexes, exible search engines, and data type-specic manipulation and integration mechanisms. Because it is highly customizable, Harvest can be used in many dierent situations. The remainder of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we present anoverview of the Harvest system. In Sections 4-9 we discuss each of the subsystems, and provide measurements of these subsystems. In Section 10 we discuss several demonstrations of Harvest, and provide WWW pointers where readers can try these demonstrations. In Section 11 we discuss work in progress, and in Section 12 we summarize Harvest's contributions to the state of resource discovery. 2 Related Work While impossible to discuss all related work, we touch on some of the better-known eorts here. Resource Discovery Because of the diculty of keeping a large information space organized, the labor intensity of traversing large information systems, and the subjective nature of organization, many resource discovery systems create indexes of network-accessible information. Many of the early indexing tools fall into one of two categories: le name or menu name indexes of widely distributed information (such asarchie [20], Veronica [21], or WWWW [38]) and full content indexes of individual databases (such as Giord's Semantic File System [23], WAIS [31], and local Gopher [39] indexes). Name-only indexes are very space ecient, but support limited queries. For example, it is only possible to query Archie for \graphics packages" whose le names 2

3 happen to reect their contents. Moreover, global at indexes become less useful as the information space grows (causing queries to match toomuch information). One of the advantages of Harvest is that it provides a means of building partial content indexes in a fashion that supports much more powerful searches than le name indexes, with much smaller space requirements than full content indexes. Recently, anumber of eorts have been initiated to create indexes of widely distributed sites (e.g., Giord's Content Router [48] and Pinkerton's WebCrawler [43]). One of the goals of Harvest is to provide a exible and ecient system upon which such systems can be built. We discuss this point in Section 4. The WHOIS and Network Information Look Up Service Working Group in the Internet Engineering Task Force has dened an Internet standard called \WHOIS++" which gathers concise descriptions (called \centroids") of each indexed database [49]. In contrast to our approach, WHOIS++ does not provide an automatic data gathering architecture, nor many of the scaling features (such as caching and replication) that Harvest provides. However, Harvest can be used to build WHOIS++. Aliweb [33] collects site description templates formatted according to the WHOIS++ specication. An increasingly popular approach to resource discovery istheuseofweb robots [32]. These are programs that attempt to locate a large number of WWW documents by recursively enumerating hypertext links starting with some known set of documents. As will be demonstrated in Section 4, Harvest provides a much more ecient architecture for distributed indexing than robots can, because of its optimized gathering architecture, support for incremental updates, and coordinated gathering. Moreover, we believe robot-based indexes will become decreasingly useful over time because, unlike Harvest, they do not focus their content on a specic topic or community. Many of the ideas and some of the code for Harvest were derived from our previous work on the agrep string search tool [50], the Essence customized information extraction system [28], the Indie distributed indexing system [17], and the Univers attribute-based name service [11]. Caching and Replication A great deal of caching and replication research has been carried out in the operating systems community over the past 20 years (e.g., the Andrew File System [29] and ISIS [4]). More recently, a number of eorts have begun to build object caches into HTTP servers (e.g., Lagoon [12]), and to support replication [24] in Internet information systems such as Archie. In contrast to the at organization of existing Internet caches, Harvest supports a hierarchical arrangement of object caches, modeled after the Domain Naming System's caching architecture. We believe this is the most appropriate arrangement because of a simulation study we performed using NSFNET backbone trace data [16]. We also note that one of the biggest scaling problems facing the Andrew File System is its callback-based invalidation protocol, which would be reduced if caches were arranged hierarchically. Current replication systems (such as USENET news [42] and the Archie replication system) require replicas to be placed and congured manually. Harvest's approach is to derive the replica conguration automatically, adapting to measured physical network changes. We believe this approach is more appropriate in the large, dynamic Internet environment. 3

4 Structured Data Support At present there is essentially no support for structured data in Internet information systems. When a WWW client such as Mosaic encounters an object with internal structure (such as a relational database or some time series data), it simply asks the user where to save the le on the local disk, for later perusal outside of Mosaic. The database community has done a great deal of work with more structured distributed information systems, but to date has elded no widespread systems on the Internet. Similarly, a number of object-oriented systems have been developed (e.g., CORBA [25] and OLE [30]), but have yet to be deployed on the Internet at large. At present, Harvest supports structured data through the use of attribute-value structured indexes. We are currently extending the system to support a more object-oriented paradigm to index and access complex data. 3 System Overview To motivate the Harvest design, Figure 1 illustrates two important ineciencies experienced by most of the indexing systems described in Section 2. In this gure and the remainder of the paper, a Provider indicates a server running one or more of the standard Internet information services (FTP, Gopher, HTTP, and NetNews). Bold boxes indicate excessive load being placed on servers, primarily because the indexing systems gather information from Providers that fork a separate process to handle each retrieval request. Similarly, bold edges indicate excessive trac being placed on network links, primarily because the indexing systems retrieve entire objects and then discard most of their contents (for example, retaining only HTML anchors and links in the index). These ineciencies are compounded by the fact that current indexing systems gather information independently, without coordinating the eort among each other. Index Index Index Index Index Provider Provider Provider Figure 1: Ineciencies Caused by Uncoordinated Gathering of Information from Object Retrieval Systems 4

5 Figure 2 illustrates the Harvest approach to information gathering. A Gatherer collects and extracts indexing information from one or more Providers. For example, a gatherer can collect PostScript les and extract text from them, it can collect UNIX tar les and extract their contents, or it can collect NetNews articles and extract abstracts and author names. Broker (index) Broker (index) Broker (index) Broker (index) filter filter Gatherer Broker (index) Broker (index) Provider Provider Provider Gatherer (on provider host) Gatherer (on provider host) Gatherer (on provider host) Figure 2: Harvest Information Gathering Approach A Broker provides the indexing and the query interface to the gathered information. Brokers retrieve information from one or more Gatherers or other Brokers, and incrementally update their indexes. Gatherers and Brokers can be arranged in various ways to achieve much more exible and ecient use of network bandwidth and server capacity than the approach illustrated in Figure 1. The left half of Figure 2 shows a Gatherer accessing a Provider from across the network (using the native FTP, Gopher, HTTP, or NetNews protocols). This arrangement mimics the conguration shown in Figure 1, and is primarily useful for interoperating with sites that do not run the Harvest software. This arrangement experiences the same ineciencies as discussed above, although the gatherered information can be shared by many Brokers in this arrangement and thereby reduce network and server load. In the right half of Figure 2, the Gatherer runs on the Provider site hosts. As suggested by the lack ofboldboxes and lines, this conguration can save a great deal of server load and network trac. We discuss the techniques used by the Gatherer to achieve these eciency gains in Section 4. A Broker can collect information from many Gatherers (to build an index of widely distributed data) and a Gatherer can feed information to many Brokers (thereby saving repeated gathering costs). By retrieving information from other Brokers (illustrated in the rightmost part of Figure 2), Brokers can also cascade indexed views from one another using the Broker's query interface to lter or rene the information from one Broker to the next. In eect, this Gatherer/Broker design provides an ecient network pipe facility, allowing information to be gatherered, transmitted, and shared by many dierent types of indexes, with a great deal of exibility in how information is ltered and interconnected between providers and 5

6 Brokers. Eciency and exibility are important because we want to enable the construction of many dierent topic-specic Brokers. For example, one Broker might index scientic papers for microbiologists, while another Broker indexes PC software archives. By focusing index contents per topic and per community, a Broker can avoid many of the vocabulary [22] and scaling problems of unfocused global indexes (such as Archie and WWWW). Harvest includes a distinguished Broker called the Harvest Server Registry (HSR), which allows users to register information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet. The HSR is useful when searching for an appropriate Broker, and when constructing new Gatherers and Brokers, to avoid duplication of eort. Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF) [27], intended to be easily parsed yet suciently expressive to handle many kinds of objects. It delineates streams of object summaries, and allows for multiple levels of nested detail. SOIF is based on a combination of the Internet Anonymous FTP Archives (IAFA) IETF Working Group templates [19] and BibTeX [34]. Each template contains a type, a Uniform Resource Locator (URL) [3], and a list of byte-count delimited attribute-value pairs. We dene a set of mandatory and recommended attributes for Harvest system components. For example, attributes for a Broker describe the server's administrator, location, software version, and the type of objects it contains. An example SOIF record is illustrated in Figure 3. 1 Figure 4 illustrates the Harvest architecture in more detail, showing the Gatherer/Broker interconnection arrangements and SOIF protocol, the internal components of the Broker (including the Index/Search Engine and Query manager, which will be discussed in Section 5), and the caching, replication, and client interface subsystems. The Harvest Replicator can be used to replicate servers, to enhance user-base scalability. For example, the HSR will likely become heavily replicated, since it acts a point of rst contact for searches and new server deployment eorts. The Replication subsystem can also be used to divide the gathering process among many servers (e.g., letting one server index each U.S. regional network), distributing the partial updates among the replicas. The Harvest Object Cache reduces network load, server load, and response latency when accessing located information objects. In the future it will also be used for caching access methods, when we incorporate an object-oriented access mechanism into Harvest. At present Harvest simply retrieves and displays objects using the standard HTML mechanisms, but we are extending the system to allow providers to associate access methods with objects (see Section 11). Given this support, it will be possible to perform much more exible display and access operations. We discuss each of the Harvest subsystems below. 4 The Gatherer Subsystem As indexing the Web becomes increasingly popular, two types of problems arise: data collection ineciencies and duplication of implementation eort. Harvest addresses these problems by providing an ecient and exible system upon which to build many dierent indexes. Rather than building one indexer from scratch to index WWW pages, another to index Computer Science technical reports, etc., Harvest can be congured in various ways to produce all of these indexes, at great savings of Provider-site server load and network trac. 1 While the attributes in this example are limited to ASCII values, SOIF allows arbitrary binary data. 6

7 @FILE { update-time{9}: description{27}: About this document... time-to-live{8}: refresh-rate{7}: gatherer-name{57}: Networked Information Discovery and Retrieval gatherer-host{21}: bruno.cs.colorado.edu gatherer-version{3}: 1.0 type{4}: HTML file-size{4}: 2551 md5{32}: bea4c43ce6b976b3403c24f6a353f610 author{42}: Darren Hardy Wed Feb 15 13:27:56 MST 1995 keywords{68}: about document drakos harvest html index latex manual nikos this user url-references{274}: user-manual.html node98.html node3.html... partial-text{601}: About this document } Figure 3: Example SOIF Template Data Collection Eciency Koster oers a number of guidelines for constructing Web \robots" [32]. For example, he suggests using breadth-rst rather than depth-rst traversal, to minimize rapid-re requests to a single server. These guidelines make a good deal of sense for uncoordinated information gathering. In contrast, Harvest focuses on coordinating and optimizing the gathering process. The biggest ineciency in current Web indexing tools arises from collecting indexing information from systems designed to support object retrieval. For example, to build an index of HTML documents, each documentmust be retrieved from a server and scanned for hypertext links. This causes a great deal of load, because each object retrieval requires creating a TCP connection, forking a UNIX process, changing directories several levels deep, transmitting the object, and terminating the connection. An index of FTP le names can be built more eciently using recursive directory listing operations, but this still requires an expensive le system traversal. More recently there have also been eorts to reduce the costs of HTTP retrievals by creating servers that do not fork [41]. The Gatherer dramatically reduces these ineciencies through the use of Provider site-resident software optimized for indexing. This software scans the objects periodically and maintains a 7

8 Replication Manager Storage Manager and Index/Search Engine Broker SOIF Client 1. search Query Manager Collector SOIF Thesaurus Gatherer 2. retrieve object & access methods Broker [local or remote] Object Cache Provider Figure 4: Harvest System Architecture cache of indexing information, so that separate traversals are not required for each request. 2 More importantly, it allows a Provider's indexing information to be retrieved in a single stream, rather than requiring separate requests for each object. This results in a huge reduction in server load. The combination of traversal caching and response streaming can reduce server load signicantly. The Gatherer uses four techniques to reduce network trac substantially. First, it uses the Essence system [28] to extract content summaries before passing the data to a remote Broker. Essence uses type-specic procedures to extract the most relevant parts of documents as content summaries for example, extracting author and title information from LaTeX documents, and routine names from executable les. Hence, where Web robots might retrieve a set of HTML documents and extract anchors and URLs from each, Essence extracts content summaries at the Provider site, signicantly reducing the amount of data to be transmitted. Second, the Gatherer transmits indexing data in compressed form. Third, the Gatherer supports incremental updates by allowing Brokers to request only summaries that have changed since a specied time. For access methods that do not support time stamps, the Gatherer extracts and compares \MD5" [46] cryptographic checksums of each object to determine if the object has changed since the last time the Gatherer generated a summary for it. Finally, the Gatherer maintains a cache of recently retrieved objects, to reduce the load of starting the system back up after a crash. 3. We present measurements of these savings later in this section. 2 Some FTP sites provide a similar optimization by placing a le containing the results of a recent recursive directory listing in a high-level directory. The Gatherer uses these les when they are available. 3 The Gatherer uses its own cache rather than the normal Harvest Cache because gathering isn't likely to exhibit much locality, and would hence aect the Cache's performance. 8

9 Flexibility of Information Gathering and Extraction In addition to reducing the amount of indexing data that must be transmitted across the network, Essence's summarizing mechanism permits a great deal of exibility inhow information is extracted. Site administrators can customize how objecttypes are recognized (typically by examining le naming conventions and contents) which objects get indexed (e.g., excluding executable code for which there are corresponding source les) and how information is extracted from each type of object (e.g., extracting title and author information from bibliographic databases, and copyright strings from executable programs). Essence can also extract les with arbitrarily deep presentationlayer nesting (e.g., compressed \tar" les). An example of Essence processing is illustrated in Figure 5. Customized Type Recog. Methods Customized Selection Methods Customized Summarizing Methods Essence.tar.Z Type Recog. README [README] doc/essence.1 [manpage] doc/essence.ms [troff ms] doc/essence.ps [PostScript] src/pstext/pstext/dvi.c [C source]... Candidate Selection README [README] doc/essence.1 [manpage] doc/essence.ms [troff ms] Summarizing Presentation Unnesting SOIF records containing: keywords from README; usage line from essence.1; author, title, & abstract fields from Essence.ms Content Summary Customized Unnesting Methods Figure 5: Example of Essence Customized Information Extraction The Gatherer allows objects to be specied either individually (which we call \LeafNodes") or through type-specic recursive enumeration (which we call \RootNodes"). For example, FTP objects are enumerated using a recursive directory listing, while HTTP objects are enumerated by recursively extracting the URL links that point to other HTML objects. The RootNode enumeration mechanism provides several means by which users can control the enumeration, including site and URL stop lists (both based on regular expressions), enumeration depth limits, and enumerated URL count limits. The default limits are conservative (for example disallowing HTTP enumerations beyond the scope of a single server), to prevent a miscongured Gatherer from traversing large parts of the Web. The combination of Gatherer enumeration support and content extraction exibility allows the site administrator to specify a Gatherer's conguration very easily. Given this specication, the system automatically enumerates, gathers, unnests, and summarizes a wide variety of information resources, generating content summaries that range in detail from full-text to highly abridged and specialized \signatures." The Gatherer also allows users to include manually generated information (such as hand-chosen keywords or taxonometric classication references) along with the 9

10 Essence-extracted information. This allows users to improve index quality for data that warrants the added manual labor. For example, Harvest can incorporate site markup records specied using IETF-specied IAFA templates. The Gatherer stores manually and automatically generated information separately, so that periodic automated summarizing will not overwrite the manually created information. The other type of exibility supported by the Gatherer is in the distribution of the Gatherer process. Ideally, a Gatherer will be placed at each Provider site, to support ecient data gathering (discussed below). Because this may not always be feasible, we also allow Gatherers to be run remotely, accessing objects using FTP/Gopher/HTTP/NetNews. Measurements In this subsection we present measurements of the Gatherer's eciency with respect to provider site load and network trac. Provider Site Load Measurements To quantify Provider-site server savings from running local Gatherers, we measured the time taken to gather information three dierent ways: via individual FTP retrievals, via individual local le system retrievals, and from the Gatherer's cache. To focus these measurements on server load, we collected only empty les and used the network loopback interface in the case where data would normally go across the network. We collected the measurements over 1,000 iterations on an otherwise unloaded DEC Alpha workstation. Our results indicate that gathering via the local le system causes 7.0 times less load than gathering via FTP. 4 More importantly, retrieving an entire stream of SOIF records from the Gatherer's cache incurs 2.7 times less server load than retrieving a single object via FTP (and 2.8 times less load for retrieving all objects at a site, because of an optimization we implemented that keeps the data for this commonly-requested case cached). Thus, assuming an average archive site size of 2,300 les 5, serving indexing information via a Gatherer will reduce server load per data collection attempt by a factor of 6,429. Collecting all objects at the site increases this reduction to 6,631, because of the aforementioned common-case cache. As noted earlier, Harvest allows information to be collected remotely so that it can interoperate with sites that do not run the Harvest software. Because this poses so much more load on Providersite servers, we implemented a connection caching mechanism, which attempts to keep connections open across individual object retrievals. 6 When the available UNIX le descriptors are exhausted (or timed out with inactivity), they are closed according to a replacement policy. This mechanism reduces server load by a factor of 7.0 for FTP transfers. Network Trac Measurements As discussed previously, Harvest reduces network trac four ways: extracting indexing data before transmitting across the network, supporting incremental updates, transmitting the data in com- 4 Harvest provides a means by which site administrators can specify mappings from URL roots to local le system names, allowing URLs listed in the Gatherer conguration le to be accessed via the local le system. 5 We computed this average from measurements of the total number of sites and les indexed by Archie [47]. 6 At present this is only possible with the FTP control connection and NetNews. Gopher and HTTP close connections after each retrieval. 10

11 pressed form from Gatherers to Brokers, and by using a local cache of recently retrieved objects. Here we discuss the rst three ways, since the cache is primarily needed for insulating the network from re-retrievals in case of a system crash. The savings from pre-extraction are derived from measurements we performed for an earlier paper about Essence [28]. We found that Essence content summaries require from times less space than the data they summarize, depending on data type. For example, the reduction for PostScript les is quite high (since a good deal of PostScript content is positioning information that can be discarded when extracting indexing terms), while the reduction for short text les is fairly low (since Essence extracts full keyword content for short text les). To measure the savings from incremental gathering, we sampled \last update" times for a set of approximately 4,600 HTTP objects distributed among approximately 2,000 sites, between September 1 and November 1, We then computed the cost of incremental gathering for a range of update periods, based on the average update period for each object and the object's size. Figure 6 plots the proportion of bytes saved by incremental vs. full gathering for the sampled objects. For example, this gure shows that full gathering sends approximately 4.8 times as many bytes as incremental gathering does, for daily updates. The savings drop o rapidly as the gathering period increases because many objects are updated frequently. 5 Proportion Bytes Saved by Incremental vs. Full Gathering Update Interval (days) Figure 6: Measured Savings from Incremental Gathering for Sampled HTTP Documents While a factor of 4.8 is clearly worthwhile, we were initially surprised the savings were not higher. The explanation is that the savings depends heavily on how quickly the data change. The average lifetime of the sampled objects was approximately 44 days, with over 28% of the objects being updated at least every 10 days, and 1% being updated at least daily. 7 Webelieve that WWW data volatility is particularly high at present because the WWW is a new phenomenon. WWW may reach a somewhat less volatile steady state over time. Older systems like FTP probably experience much slower change rates, and hence incremental gathering would be of more value for such systems. 7 In some cases HTTP objects appear to be updated very frequently because they include constructs that recompute a value upon each access, such as the current user readership count. 11

12 One other consideration is that frequent incremental gathering allows more current indexing information to be maintained. For example, most Web \robots" perform infrequent gathering (e.g., only every several months), which means the majority of their content is out-of-date for the average lifetime of 44 days. The last way that Harvest reduces network trac while gathering indexing data is through the use of compression. The savings from compression is again data type-dependent, providing an additional factor of 2-10 in savings. Examples of Overall Trac and Provider Site Savings from Harvest Gathering The combination of pre-ltering, incremental updates, and compression can reduce network trac signicantly. As an example of these savings, our content index of Computer Science technical reports (see Section 10) starts with 2.44 GB of data distributed around the Internet, and extracts 111 MB worth of content summaries. It then transmits these data as a 41 MB compressed stream to requesting Brokers, for a 59.4 fold network transmission savings over collecting the documents individually and building a full-text index. Note that these computations do not include the additional savings from doing incremental updates. While one might suspect that Essence's use of content summaries causes the resulting indexes to lose useful keywords, our measurements indicate that Essence achieves nearly the same precision 8 and 70% the recall of WAIS, but requires only 3-11% as much index space and 70% as much summarizing and indexing time as WAIS [28]. For users who need the added recall and precision, however, Essence also supports a full-text extraction option. As an overall illustration of the savings on Provider-site server load, it took 10 hours to gather and extract the information for our AT&T Broker (see Section 10), pulling the data across the Internet for each of 4,100 HTML les. (This Gatherer ran remotely from the Provider site.) Once we had computed the compressed SOIF summary stream for these data, we could retrieve the entire stream across the Internet in approximately 3 minutes a 200 fold reduction! This reduction is explained in part by the reduced amount of data that needed to be sent (2.9 MB for the compressed SOIF summary stream vs. 20 MB for the original data) partly by the time required to perform content extraction only when the data are rst gathered and mostly by the fact that retrieving from the Gatherer can be done without any forking, whereas retrieving via HTTP required a fork for each of the 4,100 HTML objects. Once we had gathered the data from the AT&T machine, we could serve it to other Brokers much more eciently than could be accomplished by gathering the information independently. 5 The Broker Subsystem As illustrated in Figure 4, the Broker consists of four software modules: the Collector, the Registry, the Storage Manager and Index/Search Engine, and the Query Manager. The Registry and Storage Manager maintain the authoritative list of summary objects that exist in the Broker. For each object, the Registry records a time-to-live value and unique object identier (OID) consisting of the object's URL plus information about the Gatherer that generated 8 Intuitively, precision is the probability that all of the retrieved documents will be relevant, while recall is the probability that all of the relevant documents will be retrieved [6]. 12

13 the summary object. The Storage Manager archives a collection of summary objects on disk, storing each as a le in the underlying le system. The Broker also eliminates duplicate objects (e.g., those reached via multiple HTML pointers) based on a combination of MD5 signatures and Gatherer IDs. The Collector periodically requests updates from each Gatherer or Broker specied in its con- guration le. The Gatherer responds with a list of object summaries to be created, removed, or updated, based on a timestamp passed from the requesting Broker. The Collector parses the response and forwards it to the Registry for processing. When an object is created, an OID is added to the Registry, the summary object is archived by the Storage Manager, and a handle is given to the Index/Search Engine. When the collection is nished, the Registry is written to disk to complete the transaction. When an object is destroyed, the OID is removed and the summary object is deleted by the Storage Manager. When an object is refreshed, its time-to-live is recomputed. If the time-to-live expires for an object, the object is removed from the Broker. Since the state of the Storage Manager may become inconsistent with the Registry, periodic garbage collection removes any unreferenced objects from the Storage Manager. The Query Manager exports objects to the network. It accepts a query, translates it into an intermediate representation, and then passes it to the search engine. The search engine responds with a list of OIDs and some search engine-specic data for each. For example, one of our search engines (Glimpse see Section 6) returns the matching SOIF attribute for each summary object in the result list. The Query Manager constructs a short object description from the OID, the search engine-specic data, and the summary object. It then returns the list of object descriptions to the client. A Broker can gather objects directly from another Broker using a bulk transfer protocol within the Query Manager. When a query is sent to the source Broker, instead of returning the usual short description, the Query Manager returns the entire summary object. The query lters summary objects so that one Broker can gather a subset of the objects in another Broker. In this way a Broker can create an index on a specialized set of summary objects by ltering them from other Brokers. The Query Manager supports remote administration of the Broker's conguration and operations. Several Broker parameters can be manipulated, including the list of Gatherers to contact, the frequency of contact with a Gatherer, the default time-to-live for summary objects, and the frequency of garbage collection. 6 The Index and Search Subsystem To accommodate diverse indexing and searching needs, Harvest denes a general Broker- Index/Search Engine interface that can accommodate a variety of \backend" search engines. The principal requirements are that the backend support Boolean combinations of attribute-based queries, and that it support incremental updates. One can therefore use a variety of dierent backends inside a Broker, such as Ingres or WAIS (although some versions of WAIS do not support all the needed features). We have developed two particular search and index subsystems for Harvest, each optimized for dierent uses. Glimpse [36] supports space-ecient indexes and exible interactive queries, while 13

14 Nebula [10] supports fast searches and views based on complex standing queries that scan the data on a regular basis and extract relevant information. Glimpse Index/Search Subsystem The main novelty ofglimpse is that it requires a very small index (as low as 2-4% of the size of the data), yet allows very exible content queries, including the use of Boolean expressions, regular expressions, and approximate matching (i.e., allowing misspelling). Another advantage of Glimpse is that the index can be built quickly and modied incrementally. The index used by Glimpse is similar in principle to inverted indexes, with one additional feature. The main part of the index consists of a list of all words that appear in all les. For each word there is a pointer to a list of occurrences of the word. But instead of pointing to the exact occurrence, Glimpse points to an adjustable-sized area that contains that occurrence. This area can be simply the le that contains the word, a group of les, or perhaps a paragraph. 9 By combining pointers for several occurrences of each word and by minimizing the size of the pointers (since it is not necessary to store the exact location), the index becomes rather small. This allows Glimpse to use agrep [50] to search the index, and then search the areas found by the pointers in the index. In this way Glimpse can easily support approximate matching and other agrep features. Since the index is quite small, it is possible to modify it quickly. Glimpse has an incremental indexing option that scans all le modication times, compares each to the last index creation time, and reindexes only the new and modied les. In a le system with 6,000 les of about 70MB, incremental indexing takes about 2-4 minutes on a Sparcstation IPC. A complete indexing takes about 20 minutes. The combination of Glimpse features allows users to perform much more selective queries than simple keyword searches when the Gatherer was able to extract attributes from the gathered objects (such as \author's name," \institution," and \abstract"). For example, one can search for all articles about \incremental indexing" by \unknown author," allowing one spelling error in the author's name. Harvest's use of Essence content summarizing and Glimpse indexing provides a great deal of space savings compared with WAIS. In Section 4 we cited our earlier Essence measurements showing that Essence requires only 3-11% as much index space as WAIS. Because we now use Glimpse to index the space-ecient summaries generated by Essence, the space savings are drastically improved. Combining these two changes, we compute that Harvest now requires a factor of up to 42.6 less index space than WAIS. This savings makes it possible for Harvest to index a great deal of information using a moderate amount of disk space { a concrete example being a Broker we made of Computer Science technical reports from 280 sites around the world (see Section 10). This Broker would have taken 9 GB with a standard WAIS index, but only required 270 MB using Harvest. Nebula Index/Search Subsystem In contrast to Glimpse, Nebula focuses on providing fast queries at the expense of index size. It also provides particular support for constructing views, which are dened by standing queries against the database of indexed objects. Because views exist over time, it is easy to rene and extend 9 Paragraph-level indexing is not implemented yet. 14

15 them, and to observe the eect of query changes interactively. A view can scope a search in the sense that it constrains the search to some subset of the database. For example, within the scope of a view that contains computer science technical reports, a user may search for networks without matching information about social networks. Nebula's views facilitate the construction of hierarchical classication schemes for indexed objects. For example, a simple classication scheme for our PC software Broker (see Section 10) implements, at the top level, a view that selects all summary objects that represent PC software. A more specic class, like software for Microsoft Windows, is constructed from objects in the PC software view i.e., the PC software view scopes the Windows view. Nebula can also be extended to include domain-specic query resolution functions. For example, a server that contains bibliographic citations can be extended with a resolution function that prefers keywords found in \abstract" and \title" attributes to keywords found in the text of the document. The resolution function is constructed with domain-specic information namely, that the title and abstract contain the most important and descriptive information about a document. For performance reasons, each attribute tag has a corresponding index that maps values to objects. The indexing mechanism can dier from tag to tag, depending on the expected queries. Attributes like \description" and \summary" use a mechanism that indexes individual keywords in the value. The keyword mechanism supports queries based on pattern matching. Other tags, like \name" and \size", index values with a dynamic hash table for fast, precise retrieval. While Nebula prefers fast queries over small indexes, for structured summary objects some of the cost of indexing is reclaimed by storing a single copy of an attribute that appears in many summary objects. As a result, for a database of 20,000 bibliographic citations, the index adds about 9% overhead. A query on this database with two attribute expressions and one Boolean operator is resolved in less than 100 milliseconds on a Sparc IPC. 7 The Query Interface There are two conicting goals in designing a query interface for use in as diverse an environment as the Internet. On the one hand, we want toprovide a great degree of exibility and the ability to customize for dierent communities and for dierent users. On the other hand, we want a reasonably simple, uniform interface so that users can move from one domain to another without being overwhelmed. Uniform Interface Harvest supports a uniform client interface through the Query Manager in the Broker. The Query Manager in each Broker implements the same query language, which supports Boolean combinations of keyword- and attribute-based expressions. Query results are presented in a uniform format that includes some indexer-specic information plus the object identier for each object. The Query Manager uses an HTML form to collect the query from the user, and then passes the query to the Harvest gateway. The gateway retrieves the query from the form and sends it to the query manager. The set of objects returned by the Query Manager is converted into an HTML page with hypertext links to complete summary objects in the Broker's database and to the original resource described by the summary object. An example query is shown in Figure 7. 15

16 Figure 7: Example of Query Interface 16

17 Currently, the Query Manager maintains no state across queries. Thus, the Broker's uniform interface does not support query renement. However, simple query renement is implemented by most WWW clients. For example, Mosaic caches recently accessed documents in virtual memory. Previous queries can be retrieved and rened by moving through the cache. We also support a customizable means of displaying query results, which allows users to dene result layout, displayed elds, string manipulations, warning messages, and many other aspects of the display output. Interface Customization Harvest supports query interface customization through direct access to each Index/Search Engine. For example, the Glimpse-specic interface supports additional switches for approximate matching and record specic document retrieval, while the Nebula-specic interface supports extensive query renement and improved integration of organization and search. Direct access lets the user take advantage of the specic strengths of each Index/Search Engine. While this means users must learn multiple interfaces, it also increases exibility. A domainspecic Index/Search Engine can use the Broker to collect information from several sites. This information can be accessed by neophytes through the uniform interface. Expert users benet from the additional functionality provided by the Index/Search Engine through the direct interface. In this way, Harvest facilitates the construction of domain-specic databases (such as image or genome databases) that require a unique query language or result presentation. We are currently working on several tools to allow much greater customizations. 8 The Replication Subsystem Harvest replicates its Brokers with the mirror-d replication tool. Mirror-d maintains a weakly consistent, replicated directory tree of les. Mirror-d implements eventual consistency [5]: if all new updates cease, the replicas eventually converge. Each replica belongs to one or more data propagation groups and, within a group, mirror-d's companion process, ood-d [15], estimates the available network bandwidth and observed round trip delay between each pair of replicas. A group master periodically computes a graph that species which replicas should synchronize and propagate updates among each other. We call this the group's logical topology. Replicas synchronize updates with their neighbors in the logical topology graph using the \ftp-mirror" software [40]. Basically, mirror-d dynamically congures the ftp-mirror software between replicas in a replication group so that updates propagate along the highest bandwidth, lowest delay network paths. Since groups can be organized hierarchically, mirror-d should be able to scale to thousands of autonomously operated replicas. Groups become stitched together when one or more replicas belong to both groups. While mirror-d is written in Perl and ood-d is written in C, both mirror-d ( and ood-d ( speak HTTP and HTML. Each replication group is named, and one replica in each group is distinguished as the group leader. Flood-d's do not elect new leaders because the access control list of sites that can join the group resides at the group leader. Access control can be disabled or limited to a specic list of sites. If the leader fails, replication proceeds but the logical topology is not updated and new replicas cannot join the group. 17

18 Implementation Notes Upon startup, a new replica issues a join request. When the group master receives the request, it generates a new topology with the new member included and distributes the topology to the new group members. Frequently during the rst hour of operation, and less frequently afterwards, a replica selects another replica and measures the elapsed round trip for a null UDP-based remote procedure call and measures the elapsed time to transfer a 64KB data block via TCP. Periodically, replicas ood these estimates to one another. Flood-d does not implement an explicit group leave operation. Rather, if a replica is silent for a long time, each group member independently reclaims resources associated with it, and the group leader, when it recomputes the group topology, leaves out the silent replica. When new members join the group or when a timer expires the group leader computes and oods a new logical-topology to its members. Periodically or when requested through its HTTP interface, a replica's mirror-d process attempts to synchronize with its logical neighbors. Mirror-d obtains the list of its logical neighbors from its ood-d process and then requests that each neighbor return the version number of its directory tree. If no neighbor has a more recent version, then mirror-d does nothing. If one or more neighboring replicas has a more recent version, mirror-d spawns an ftp-mirror process that synchronizes the two replicas. For ease of synchronization, mirror-d will spawn only one concurrent ftp-mirror and enqueues other relevant update notications. If more than one neighbor have the highest version number, mirror-d chooses the neighbor with highest bandwidth to delay ratio. The group leader constructs a two- or three-connected logical topology that attempts to minimize the sum of the logical link costs. These costs are computed as the estimated bandwidth to delay ratio. While we originally employed a search algorithm to compute this graph, recently we switched this to a heuristic: take aminimum spanning tree and add low cost links until the graph's connectivity requirement is met. Since neighbors attempt to propagate updates along the minimum spanning tree, data transfers usually do not take place to a replica's less desirable neighbors. Performance and Experience We have tested mirror-d congurations of thirty replicas, organized into two or three overlapping groups. Two of the project members replicate the 60MB Harvest www-home-pages Broker at our homes, over 28KB/s PPP links. At the time of writing, four sites replicate Harvest's www-homepages Broker, and during testing we frequently run another ten replicas. Had these sites been replicated with ftp-mirror, rather than ood-d, all of them would have pulled their updates from the primary copy of the www-home-pages Broker. With mirror-d, the home machine mirrors o of a set of machines in a computing laboratory at the University of Southern California, the ten laboratory machines mirror o of each other, and one USC lab machine mirrors o of a machine at the University of Colorado. 9 The Object Caching Subsystem To meet ever-increasing demand on network links and information servers, Harvest includes a hierarchically organized Object Caching subsystem [13], as illustrated in Figure 8. At each level there can be several neighbor caches, allowing some load-sharing among cache servers. 18

The Harvest Information Discovery and Access System

The Harvest Information Discovery and Access System C. Mic Bowman, Member of the Technical Staff, Transarc, Inc. Peter B. Danzig, Assistant Professor, CS Dept., U. Southern California Darren R. Hardy,