Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract

Size: px
Start display at page:

Download "Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract"

Transcription

1 Harvest: A Scalable, Customizable Discovery and Access System C. Mic Bowman Transarc Corp. Udi Manber University of Arizona Peter B. Danzig University of Southern California Michael F. Schwartz University of Colorado - Boulder March 12, 1995 Darren R. Hardy University of Colorado - Boulder Duane P. Wessels University of Colorado - Boulder Technical Report CU-CS Department of Computer Science University of Colorado - Boulder (Original Date: August 1994 Revised March 1995) Abstract Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly dicult to use eectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specic content indexes, exibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with WWW clients and with HTTP, FTP, Gopher, and NetNews information resources. We discuss the design and implementation of Harvest and its subsystems, give examples of its uses, and provide measurements indicating that Harvest can signicantly reduce server load, network trac, and space requirements when building indexes, compared with previous systems. We also discuss several popular indexes we have built using Harvest, underscoring the customizability and scalability of the system. 1

2 1 Introduction Over the past few years a progression of Internet publishing tools has appeared. Until 1992, FTP [45] and NetNews [42] were the principal publishing tools. Around 1992, Gopher [39] and WAIS [31] gained popularity because they simplied network interactions and provided better ways to navigate through information. With the introduction of Mosaic [1] in 1993, publishing information on the World Wide Web [2] gained widespread use, because of Mosaic's attractive graphical interface and ease of use for accessing multimedia data reachable via WWW links. While Internet publishing has become easy and popular, making eective use of Internetaccessible information has become more dicult. As the volume of Internet-accessible information continues to grow, it becomes increasingly dicult to locate relevant information. Moreover, current information systems experience serious server and network bottlenecks as a rapidly growing user populace attempts to access networked information. Finally, current systems primarily support text and graphics intended for end user viewing they provide little support for more complex data, as might be found in a digital library. For a more detailed discussion of these problems, the reader is referred to [9]. In this paper we discuss a system that addresses these problems using a variety of techniques. We call the system Harvest, to connote its focus on reaping the growing collection of Internet information. Harvest supports resource discovery through topic-specic content indexing made possible by avery ecient distributed information gathering architecture. It avoids bottlenecks through topology-adaptive index replication and object caching. Finally, Harvest supports structured data through a combination of structure-preserving indexes, exible search engines, and data type-specic manipulation and integration mechanisms. Because it is highly customizable, Harvest can be used in many dierent situations. The remainder of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we present anoverview of the Harvest system. In Sections 4-9 we discuss each of the subsystems, and provide measurements of these subsystems. In Section 10 we discuss several demonstrations of Harvest, and provide WWW pointers where readers can try these demonstrations. In Section 11 we discuss work in progress, and in Section 12 we summarize Harvest's contributions to the state of resource discovery. 2 Related Work While impossible to discuss all related work, we touch on some of the better-known eorts here. Resource Discovery Because of the diculty of keeping a large information space organized, the labor intensity of traversing large information systems, and the subjective nature of organization, many resource discovery systems create indexes of network-accessible information. Many of the early indexing tools fall into one of two categories: le name or menu name indexes of widely distributed information (such asarchie [20], Veronica [21], or WWWW [38]) and full content indexes of individual databases (such as Giord's Semantic File System [23], WAIS [31], and local Gopher [39] indexes). Name-only indexes are very space ecient, but support limited queries. For example, it is only possible to query Archie for \graphics packages" whose le names 2

3 happen to reect their contents. Moreover, global at indexes become less useful as the information space grows (causing queries to match toomuch information). One of the advantages of Harvest is that it provides a means of building partial content indexes in a fashion that supports much more powerful searches than le name indexes, with much smaller space requirements than full content indexes. Recently, anumber of eorts have been initiated to create indexes of widely distributed sites (e.g., Giord's Content Router [48] and Pinkerton's WebCrawler [43]). One of the goals of Harvest is to provide a exible and ecient system upon which such systems can be built. We discuss this point in Section 4. The WHOIS and Network Information Look Up Service Working Group in the Internet Engineering Task Force has dened an Internet standard called \WHOIS++" which gathers concise descriptions (called \centroids") of each indexed database [49]. In contrast to our approach, WHOIS++ does not provide an automatic data gathering architecture, nor many of the scaling features (such as caching and replication) that Harvest provides. However, Harvest can be used to build WHOIS++. Aliweb [33] collects site description templates formatted according to the WHOIS++ specication. An increasingly popular approach to resource discovery istheuseofweb robots [32]. These are programs that attempt to locate a large number of WWW documents by recursively enumerating hypertext links starting with some known set of documents. As will be demonstrated in Section 4, Harvest provides a much more ecient architecture for distributed indexing than robots can, because of its optimized gathering architecture, support for incremental updates, and coordinated gathering. Moreover, we believe robot-based indexes will become decreasingly useful over time because, unlike Harvest, they do not focus their content on a specic topic or community. Many of the ideas and some of the code for Harvest were derived from our previous work on the agrep string search tool [50], the Essence customized information extraction system [28], the Indie distributed indexing system [17], and the Univers attribute-based name service [11]. Caching and Replication A great deal of caching and replication research has been carried out in the operating systems community over the past 20 years (e.g., the Andrew File System [29] and ISIS [4]). More recently, a number of eorts have begun to build object caches into HTTP servers (e.g., Lagoon [12]), and to support replication [24] in Internet information systems such as Archie. In contrast to the at organization of existing Internet caches, Harvest supports a hierarchical arrangement of object caches, modeled after the Domain Naming System's caching architecture. We believe this is the most appropriate arrangement because of a simulation study we performed using NSFNET backbone trace data [16]. We also note that one of the biggest scaling problems facing the Andrew File System is its callback-based invalidation protocol, which would be reduced if caches were arranged hierarchically. Current replication systems (such as USENET news [42] and the Archie replication system) require replicas to be placed and congured manually. Harvest's approach is to derive the replica conguration automatically, adapting to measured physical network changes. We believe this approach is more appropriate in the large, dynamic Internet environment. 3

4 Structured Data Support At present there is essentially no support for structured data in Internet information systems. When a WWW client such as Mosaic encounters an object with internal structure (such as a relational database or some time series data), it simply asks the user where to save the le on the local disk, for later perusal outside of Mosaic. The database community has done a great deal of work with more structured distributed information systems, but to date has elded no widespread systems on the Internet. Similarly, a number of object-oriented systems have been developed (e.g., CORBA [25] and OLE [30]), but have yet to be deployed on the Internet at large. At present, Harvest supports structured data through the use of attribute-value structured indexes. We are currently extending the system to support a more object-oriented paradigm to index and access complex data. 3 System Overview To motivate the Harvest design, Figure 1 illustrates two important ineciencies experienced by most of the indexing systems described in Section 2. In this gure and the remainder of the paper, a Provider indicates a server running one or more of the standard Internet information services (FTP, Gopher, HTTP, and NetNews). Bold boxes indicate excessive load being placed on servers, primarily because the indexing systems gather information from Providers that fork a separate process to handle each retrieval request. Similarly, bold edges indicate excessive trac being placed on network links, primarily because the indexing systems retrieve entire objects and then discard most of their contents (for example, retaining only HTML anchors and links in the index). These ineciencies are compounded by the fact that current indexing systems gather information independently, without coordinating the eort among each other. Index Index Index Index Index Provider Provider Provider Figure 1: Ineciencies Caused by Uncoordinated Gathering of Information from Object Retrieval Systems 4

5 Figure 2 illustrates the Harvest approach to information gathering. A Gatherer collects and extracts indexing information from one or more Providers. For example, a gatherer can collect PostScript les and extract text from them, it can collect UNIX tar les and extract their contents, or it can collect NetNews articles and extract abstracts and author names. Broker (index) Broker (index) Broker (index) Broker (index) filter filter Gatherer Broker (index) Broker (index) Provider Provider Provider Gatherer (on provider host) Gatherer (on provider host) Gatherer (on provider host) Figure 2: Harvest Information Gathering Approach A Broker provides the indexing and the query interface to the gathered information. Brokers retrieve information from one or more Gatherers or other Brokers, and incrementally update their indexes. Gatherers and Brokers can be arranged in various ways to achieve much more exible and ecient use of network bandwidth and server capacity than the approach illustrated in Figure 1. The left half of Figure 2 shows a Gatherer accessing a Provider from across the network (using the native FTP, Gopher, HTTP, or NetNews protocols). This arrangement mimics the conguration shown in Figure 1, and is primarily useful for interoperating with sites that do not run the Harvest software. This arrangement experiences the same ineciencies as discussed above, although the gatherered information can be shared by many Brokers in this arrangement and thereby reduce network and server load. In the right half of Figure 2, the Gatherer runs on the Provider site hosts. As suggested by the lack ofboldboxes and lines, this conguration can save a great deal of server load and network trac. We discuss the techniques used by the Gatherer to achieve these eciency gains in Section 4. A Broker can collect information from many Gatherers (to build an index of widely distributed data) and a Gatherer can feed information to many Brokers (thereby saving repeated gathering costs). By retrieving information from other Brokers (illustrated in the rightmost part of Figure 2), Brokers can also cascade indexed views from one another using the Broker's query interface to lter or rene the information from one Broker to the next. In eect, this Gatherer/Broker design provides an ecient network pipe facility, allowing information to be gatherered, transmitted, and shared by many dierent types of indexes, with a great deal of exibility in how information is ltered and interconnected between providers and 5

6 Brokers. Eciency and exibility are important because we want to enable the construction of many dierent topic-specic Brokers. For example, one Broker might index scientic papers for microbiologists, while another Broker indexes PC software archives. By focusing index contents per topic and per community, a Broker can avoid many of the vocabulary [22] and scaling problems of unfocused global indexes (such as Archie and WWWW). Harvest includes a distinguished Broker called the Harvest Server Registry (HSR), which allows users to register information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet. The HSR is useful when searching for an appropriate Broker, and when constructing new Gatherers and Brokers, to avoid duplication of eort. Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF) [27], intended to be easily parsed yet suciently expressive to handle many kinds of objects. It delineates streams of object summaries, and allows for multiple levels of nested detail. SOIF is based on a combination of the Internet Anonymous FTP Archives (IAFA) IETF Working Group templates [19] and BibTeX [34]. Each template contains a type, a Uniform Resource Locator (URL) [3], and a list of byte-count delimited attribute-value pairs. We dene a set of mandatory and recommended attributes for Harvest system components. For example, attributes for a Broker describe the server's administrator, location, software version, and the type of objects it contains. An example SOIF record is illustrated in Figure 3. 1 Figure 4 illustrates the Harvest architecture in more detail, showing the Gatherer/Broker interconnection arrangements and SOIF protocol, the internal components of the Broker (including the Index/Search Engine and Query manager, which will be discussed in Section 5), and the caching, replication, and client interface subsystems. The Harvest Replicator can be used to replicate servers, to enhance user-base scalability. For example, the HSR will likely become heavily replicated, since it acts a point of rst contact for searches and new server deployment eorts. The Replication subsystem can also be used to divide the gathering process among many servers (e.g., letting one server index each U.S. regional network), distributing the partial updates among the replicas. The Harvest Object Cache reduces network load, server load, and response latency when accessing located information objects. In the future it will also be used for caching access methods, when we incorporate an object-oriented access mechanism into Harvest. At present Harvest simply retrieves and displays objects using the standard HTML mechanisms, but we are extending the system to allow providers to associate access methods with objects (see Section 11). Given this support, it will be possible to perform much more exible display and access operations. We discuss each of the Harvest subsystems below. 4 The Gatherer Subsystem As indexing the Web becomes increasingly popular, two types of problems arise: data collection ineciencies and duplication of implementation eort. Harvest addresses these problems by providing an ecient and exible system upon which to build many dierent indexes. Rather than building one indexer from scratch to index WWW pages, another to index Computer Science technical reports, etc., Harvest can be congured in various ways to produce all of these indexes, at great savings of Provider-site server load and network trac. 1 While the attributes in this example are limited to ASCII values, SOIF allows arbitrary binary data. 6

7 @FILE { update-time{9}: description{27}: About this document... time-to-live{8}: refresh-rate{7}: gatherer-name{57}: Networked Information Discovery and Retrieval gatherer-host{21}: bruno.cs.colorado.edu gatherer-version{3}: 1.0 type{4}: HTML file-size{4}: 2551 md5{32}: bea4c43ce6b976b3403c24f6a353f610 author{42}: Darren Hardy Wed Feb 15 13:27:56 MST 1995 keywords{68}: about document drakos harvest html index latex manual nikos this user url-references{274}: user-manual.html node98.html node3.html... partial-text{601}: About this document } Figure 3: Example SOIF Template Data Collection Eciency Koster oers a number of guidelines for constructing Web \robots" [32]. For example, he suggests using breadth-rst rather than depth-rst traversal, to minimize rapid-re requests to a single server. These guidelines make a good deal of sense for uncoordinated information gathering. In contrast, Harvest focuses on coordinating and optimizing the gathering process. The biggest ineciency in current Web indexing tools arises from collecting indexing information from systems designed to support object retrieval. For example, to build an index of HTML documents, each documentmust be retrieved from a server and scanned for hypertext links. This causes a great deal of load, because each object retrieval requires creating a TCP connection, forking a UNIX process, changing directories several levels deep, transmitting the object, and terminating the connection. An index of FTP le names can be built more eciently using recursive directory listing operations, but this still requires an expensive le system traversal. More recently there have also been eorts to reduce the costs of HTTP retrievals by creating servers that do not fork [41]. The Gatherer dramatically reduces these ineciencies through the use of Provider site-resident software optimized for indexing. This software scans the objects periodically and maintains a 7

8 Replication Manager Storage Manager and Index/Search Engine Broker SOIF Client 1. search Query Manager Collector SOIF Thesaurus Gatherer 2. retrieve object & access methods Broker [local or remote] Object Cache Provider Figure 4: Harvest System Architecture cache of indexing information, so that separate traversals are not required for each request. 2 More importantly, it allows a Provider's indexing information to be retrieved in a single stream, rather than requiring separate requests for each object. This results in a huge reduction in server load. The combination of traversal caching and response streaming can reduce server load signicantly. The Gatherer uses four techniques to reduce network trac substantially. First, it uses the Essence system [28] to extract content summaries before passing the data to a remote Broker. Essence uses type-specic procedures to extract the most relevant parts of documents as content summaries for example, extracting author and title information from LaTeX documents, and routine names from executable les. Hence, where Web robots might retrieve a set of HTML documents and extract anchors and URLs from each, Essence extracts content summaries at the Provider site, signicantly reducing the amount of data to be transmitted. Second, the Gatherer transmits indexing data in compressed form. Third, the Gatherer supports incremental updates by allowing Brokers to request only summaries that have changed since a specied time. For access methods that do not support time stamps, the Gatherer extracts and compares \MD5" [46] cryptographic checksums of each object to determine if the object has changed since the last time the Gatherer generated a summary for it. Finally, the Gatherer maintains a cache of recently retrieved objects, to reduce the load of starting the system back up after a crash. 3. We present measurements of these savings later in this section. 2 Some FTP sites provide a similar optimization by placing a le containing the results of a recent recursive directory listing in a high-level directory. The Gatherer uses these les when they are available. 3 The Gatherer uses its own cache rather than the normal Harvest Cache because gathering isn't likely to exhibit much locality, and would hence aect the Cache's performance. 8

9 Flexibility of Information Gathering and Extraction In addition to reducing the amount of indexing data that must be transmitted across the network, Essence's summarizing mechanism permits a great deal of exibility inhow information is extracted. Site administrators can customize how objecttypes are recognized (typically by examining le naming conventions and contents) which objects get indexed (e.g., excluding executable code for which there are corresponding source les) and how information is extracted from each type of object (e.g., extracting title and author information from bibliographic databases, and copyright strings from executable programs). Essence can also extract les with arbitrarily deep presentationlayer nesting (e.g., compressed \tar" les). An example of Essence processing is illustrated in Figure 5. Customized Type Recog. Methods Customized Selection Methods Customized Summarizing Methods Essence.tar.Z Type Recog. README [README] doc/essence.1 [manpage] doc/essence.ms [troff ms] doc/essence.ps [PostScript] src/pstext/pstext/dvi.c [C source]... Candidate Selection README [README] doc/essence.1 [manpage] doc/essence.ms [troff ms] Summarizing Presentation Unnesting SOIF records containing: keywords from README; usage line from essence.1; author, title, & abstract fields from Essence.ms Content Summary Customized Unnesting Methods Figure 5: Example of Essence Customized Information Extraction The Gatherer allows objects to be specied either individually (which we call \LeafNodes") or through type-specic recursive enumeration (which we call \RootNodes"). For example, FTP objects are enumerated using a recursive directory listing, while HTTP objects are enumerated by recursively extracting the URL links that point to other HTML objects. The RootNode enumeration mechanism provides several means by which users can control the enumeration, including site and URL stop lists (both based on regular expressions), enumeration depth limits, and enumerated URL count limits. The default limits are conservative (for example disallowing HTTP enumerations beyond the scope of a single server), to prevent a miscongured Gatherer from traversing large parts of the Web. The combination of Gatherer enumeration support and content extraction exibility allows the site administrator to specify a Gatherer's conguration very easily. Given this specication, the system automatically enumerates, gathers, unnests, and summarizes a wide variety of information resources, generating content summaries that range in detail from full-text to highly abridged and specialized \signatures." The Gatherer also allows users to include manually generated information (such as hand-chosen keywords or taxonometric classication references) along with the 9

10 Essence-extracted information. This allows users to improve index quality for data that warrants the added manual labor. For example, Harvest can incorporate site markup records specied using IETF-specied IAFA templates. The Gatherer stores manually and automatically generated information separately, so that periodic automated summarizing will not overwrite the manually created information. The other type of exibility supported by the Gatherer is in the distribution of the Gatherer process. Ideally, a Gatherer will be placed at each Provider site, to support ecient data gathering (discussed below). Because this may not always be feasible, we also allow Gatherers to be run remotely, accessing objects using FTP/Gopher/HTTP/NetNews. Measurements In this subsection we present measurements of the Gatherer's eciency with respect to provider site load and network trac. Provider Site Load Measurements To quantify Provider-site server savings from running local Gatherers, we measured the time taken to gather information three dierent ways: via individual FTP retrievals, via individual local le system retrievals, and from the Gatherer's cache. To focus these measurements on server load, we collected only empty les and used the network loopback interface in the case where data would normally go across the network. We collected the measurements over 1,000 iterations on an otherwise unloaded DEC Alpha workstation. Our results indicate that gathering via the local le system causes 7.0 times less load than gathering via FTP. 4 More importantly, retrieving an entire stream of SOIF records from the Gatherer's cache incurs 2.7 times less server load than retrieving a single object via FTP (and 2.8 times less load for retrieving all objects at a site, because of an optimization we implemented that keeps the data for this commonly-requested case cached). Thus, assuming an average archive site size of 2,300 les 5, serving indexing information via a Gatherer will reduce server load per data collection attempt by a factor of 6,429. Collecting all objects at the site increases this reduction to 6,631, because of the aforementioned common-case cache. As noted earlier, Harvest allows information to be collected remotely so that it can interoperate with sites that do not run the Harvest software. Because this poses so much more load on Providersite servers, we implemented a connection caching mechanism, which attempts to keep connections open across individual object retrievals. 6 When the available UNIX le descriptors are exhausted (or timed out with inactivity), they are closed according to a replacement policy. This mechanism reduces server load by a factor of 7.0 for FTP transfers. Network Trac Measurements As discussed previously, Harvest reduces network trac four ways: extracting indexing data before transmitting across the network, supporting incremental updates, transmitting the data in com- 4 Harvest provides a means by which site administrators can specify mappings from URL roots to local le system names, allowing URLs listed in the Gatherer conguration le to be accessed via the local le system. 5 We computed this average from measurements of the total number of sites and les indexed by Archie [47]. 6 At present this is only possible with the FTP control connection and NetNews. Gopher and HTTP close connections after each retrieval. 10

11 pressed form from Gatherers to Brokers, and by using a local cache of recently retrieved objects. Here we discuss the rst three ways, since the cache is primarily needed for insulating the network from re-retrievals in case of a system crash. The savings from pre-extraction are derived from measurements we performed for an earlier paper about Essence [28]. We found that Essence content summaries require from times less space than the data they summarize, depending on data type. For example, the reduction for PostScript les is quite high (since a good deal of PostScript content is positioning information that can be discarded when extracting indexing terms), while the reduction for short text les is fairly low (since Essence extracts full keyword content for short text les). To measure the savings from incremental gathering, we sampled \last update" times for a set of approximately 4,600 HTTP objects distributed among approximately 2,000 sites, between September 1 and November 1, We then computed the cost of incremental gathering for a range of update periods, based on the average update period for each object and the object's size. Figure 6 plots the proportion of bytes saved by incremental vs. full gathering for the sampled objects. For example, this gure shows that full gathering sends approximately 4.8 times as many bytes as incremental gathering does, for daily updates. The savings drop o rapidly as the gathering period increases because many objects are updated frequently. 5 Proportion Bytes Saved by Incremental vs. Full Gathering Update Interval (days) Figure 6: Measured Savings from Incremental Gathering for Sampled HTTP Documents While a factor of 4.8 is clearly worthwhile, we were initially surprised the savings were not higher. The explanation is that the savings depends heavily on how quickly the data change. The average lifetime of the sampled objects was approximately 44 days, with over 28% of the objects being updated at least every 10 days, and 1% being updated at least daily. 7 Webelieve that WWW data volatility is particularly high at present because the WWW is a new phenomenon. WWW may reach a somewhat less volatile steady state over time. Older systems like FTP probably experience much slower change rates, and hence incremental gathering would be of more value for such systems. 7 In some cases HTTP objects appear to be updated very frequently because they include constructs that recompute a value upon each access, such as the current user readership count. 11

12 One other consideration is that frequent incremental gathering allows more current indexing information to be maintained. For example, most Web \robots" perform infrequent gathering (e.g., only every several months), which means the majority of their content is out-of-date for the average lifetime of 44 days. The last way that Harvest reduces network trac while gathering indexing data is through the use of compression. The savings from compression is again data type-dependent, providing an additional factor of 2-10 in savings. Examples of Overall Trac and Provider Site Savings from Harvest Gathering The combination of pre-ltering, incremental updates, and compression can reduce network trac signicantly. As an example of these savings, our content index of Computer Science technical reports (see Section 10) starts with 2.44 GB of data distributed around the Internet, and extracts 111 MB worth of content summaries. It then transmits these data as a 41 MB compressed stream to requesting Brokers, for a 59.4 fold network transmission savings over collecting the documents individually and building a full-text index. Note that these computations do not include the additional savings from doing incremental updates. While one might suspect that Essence's use of content summaries causes the resulting indexes to lose useful keywords, our measurements indicate that Essence achieves nearly the same precision 8 and 70% the recall of WAIS, but requires only 3-11% as much index space and 70% as much summarizing and indexing time as WAIS [28]. For users who need the added recall and precision, however, Essence also supports a full-text extraction option. As an overall illustration of the savings on Provider-site server load, it took 10 hours to gather and extract the information for our AT&T Broker (see Section 10), pulling the data across the Internet for each of 4,100 HTML les. (This Gatherer ran remotely from the Provider site.) Once we had computed the compressed SOIF summary stream for these data, we could retrieve the entire stream across the Internet in approximately 3 minutes a 200 fold reduction! This reduction is explained in part by the reduced amount of data that needed to be sent (2.9 MB for the compressed SOIF summary stream vs. 20 MB for the original data) partly by the time required to perform content extraction only when the data are rst gathered and mostly by the fact that retrieving from the Gatherer can be done without any forking, whereas retrieving via HTTP required a fork for each of the 4,100 HTML objects. Once we had gathered the data from the AT&T machine, we could serve it to other Brokers much more eciently than could be accomplished by gathering the information independently. 5 The Broker Subsystem As illustrated in Figure 4, the Broker consists of four software modules: the Collector, the Registry, the Storage Manager and Index/Search Engine, and the Query Manager. The Registry and Storage Manager maintain the authoritative list of summary objects that exist in the Broker. For each object, the Registry records a time-to-live value and unique object identier (OID) consisting of the object's URL plus information about the Gatherer that generated 8 Intuitively, precision is the probability that all of the retrieved documents will be relevant, while recall is the probability that all of the relevant documents will be retrieved [6]. 12

13 the summary object. The Storage Manager archives a collection of summary objects on disk, storing each as a le in the underlying le system. The Broker also eliminates duplicate objects (e.g., those reached via multiple HTML pointers) based on a combination of MD5 signatures and Gatherer IDs. The Collector periodically requests updates from each Gatherer or Broker specied in its con- guration le. The Gatherer responds with a list of object summaries to be created, removed, or updated, based on a timestamp passed from the requesting Broker. The Collector parses the response and forwards it to the Registry for processing. When an object is created, an OID is added to the Registry, the summary object is archived by the Storage Manager, and a handle is given to the Index/Search Engine. When the collection is nished, the Registry is written to disk to complete the transaction. When an object is destroyed, the OID is removed and the summary object is deleted by the Storage Manager. When an object is refreshed, its time-to-live is recomputed. If the time-to-live expires for an object, the object is removed from the Broker. Since the state of the Storage Manager may become inconsistent with the Registry, periodic garbage collection removes any unreferenced objects from the Storage Manager. The Query Manager exports objects to the network. It accepts a query, translates it into an intermediate representation, and then passes it to the search engine. The search engine responds with a list of OIDs and some search engine-specic data for each. For example, one of our search engines (Glimpse see Section 6) returns the matching SOIF attribute for each summary object in the result list. The Query Manager constructs a short object description from the OID, the search engine-specic data, and the summary object. It then returns the list of object descriptions to the client. A Broker can gather objects directly from another Broker using a bulk transfer protocol within the Query Manager. When a query is sent to the source Broker, instead of returning the usual short description, the Query Manager returns the entire summary object. The query lters summary objects so that one Broker can gather a subset of the objects in another Broker. In this way a Broker can create an index on a specialized set of summary objects by ltering them from other Brokers. The Query Manager supports remote administration of the Broker's conguration and operations. Several Broker parameters can be manipulated, including the list of Gatherers to contact, the frequency of contact with a Gatherer, the default time-to-live for summary objects, and the frequency of garbage collection. 6 The Index and Search Subsystem To accommodate diverse indexing and searching needs, Harvest denes a general Broker- Index/Search Engine interface that can accommodate a variety of \backend" search engines. The principal requirements are that the backend support Boolean combinations of attribute-based queries, and that it support incremental updates. One can therefore use a variety of dierent backends inside a Broker, such as Ingres or WAIS (although some versions of WAIS do not support all the needed features). We have developed two particular search and index subsystems for Harvest, each optimized for dierent uses. Glimpse [36] supports space-ecient indexes and exible interactive queries, while 13

14 Nebula [10] supports fast searches and views based on complex standing queries that scan the data on a regular basis and extract relevant information. Glimpse Index/Search Subsystem The main novelty ofglimpse is that it requires a very small index (as low as 2-4% of the size of the data), yet allows very exible content queries, including the use of Boolean expressions, regular expressions, and approximate matching (i.e., allowing misspelling). Another advantage of Glimpse is that the index can be built quickly and modied incrementally. The index used by Glimpse is similar in principle to inverted indexes, with one additional feature. The main part of the index consists of a list of all words that appear in all les. For each word there is a pointer to a list of occurrences of the word. But instead of pointing to the exact occurrence, Glimpse points to an adjustable-sized area that contains that occurrence. This area can be simply the le that contains the word, a group of les, or perhaps a paragraph. 9 By combining pointers for several occurrences of each word and by minimizing the size of the pointers (since it is not necessary to store the exact location), the index becomes rather small. This allows Glimpse to use agrep [50] to search the index, and then search the areas found by the pointers in the index. In this way Glimpse can easily support approximate matching and other agrep features. Since the index is quite small, it is possible to modify it quickly. Glimpse has an incremental indexing option that scans all le modication times, compares each to the last index creation time, and reindexes only the new and modied les. In a le system with 6,000 les of about 70MB, incremental indexing takes about 2-4 minutes on a Sparcstation IPC. A complete indexing takes about 20 minutes. The combination of Glimpse features allows users to perform much more selective queries than simple keyword searches when the Gatherer was able to extract attributes from the gathered objects (such as \author's name," \institution," and \abstract"). For example, one can search for all articles about \incremental indexing" by \unknown author," allowing one spelling error in the author's name. Harvest's use of Essence content summarizing and Glimpse indexing provides a great deal of space savings compared with WAIS. In Section 4 we cited our earlier Essence measurements showing that Essence requires only 3-11% as much index space as WAIS. Because we now use Glimpse to index the space-ecient summaries generated by Essence, the space savings are drastically improved. Combining these two changes, we compute that Harvest now requires a factor of up to 42.6 less index space than WAIS. This savings makes it possible for Harvest to index a great deal of information using a moderate amount of disk space { a concrete example being a Broker we made of Computer Science technical reports from 280 sites around the world (see Section 10). This Broker would have taken 9 GB with a standard WAIS index, but only required 270 MB using Harvest. Nebula Index/Search Subsystem In contrast to Glimpse, Nebula focuses on providing fast queries at the expense of index size. It also provides particular support for constructing views, which are dened by standing queries against the database of indexed objects. Because views exist over time, it is easy to rene and extend 9 Paragraph-level indexing is not implemented yet. 14

15 them, and to observe the eect of query changes interactively. A view can scope a search in the sense that it constrains the search to some subset of the database. For example, within the scope of a view that contains computer science technical reports, a user may search for networks without matching information about social networks. Nebula's views facilitate the construction of hierarchical classication schemes for indexed objects. For example, a simple classication scheme for our PC software Broker (see Section 10) implements, at the top level, a view that selects all summary objects that represent PC software. A more specic class, like software for Microsoft Windows, is constructed from objects in the PC software view i.e., the PC software view scopes the Windows view. Nebula can also be extended to include domain-specic query resolution functions. For example, a server that contains bibliographic citations can be extended with a resolution function that prefers keywords found in \abstract" and \title" attributes to keywords found in the text of the document. The resolution function is constructed with domain-specic information namely, that the title and abstract contain the most important and descriptive information about a document. For performance reasons, each attribute tag has a corresponding index that maps values to objects. The indexing mechanism can dier from tag to tag, depending on the expected queries. Attributes like \description" and \summary" use a mechanism that indexes individual keywords in the value. The keyword mechanism supports queries based on pattern matching. Other tags, like \name" and \size", index values with a dynamic hash table for fast, precise retrieval. While Nebula prefers fast queries over small indexes, for structured summary objects some of the cost of indexing is reclaimed by storing a single copy of an attribute that appears in many summary objects. As a result, for a database of 20,000 bibliographic citations, the index adds about 9% overhead. A query on this database with two attribute expressions and one Boolean operator is resolved in less than 100 milliseconds on a Sparc IPC. 7 The Query Interface There are two conicting goals in designing a query interface for use in as diverse an environment as the Internet. On the one hand, we want toprovide a great degree of exibility and the ability to customize for dierent communities and for dierent users. On the other hand, we want a reasonably simple, uniform interface so that users can move from one domain to another without being overwhelmed. Uniform Interface Harvest supports a uniform client interface through the Query Manager in the Broker. The Query Manager in each Broker implements the same query language, which supports Boolean combinations of keyword- and attribute-based expressions. Query results are presented in a uniform format that includes some indexer-specic information plus the object identier for each object. The Query Manager uses an HTML form to collect the query from the user, and then passes the query to the Harvest gateway. The gateway retrieves the query from the form and sends it to the query manager. The set of objects returned by the Query Manager is converted into an HTML page with hypertext links to complete summary objects in the Broker's database and to the original resource described by the summary object. An example query is shown in Figure 7. 15

16 Figure 7: Example of Query Interface 16

17 Currently, the Query Manager maintains no state across queries. Thus, the Broker's uniform interface does not support query renement. However, simple query renement is implemented by most WWW clients. For example, Mosaic caches recently accessed documents in virtual memory. Previous queries can be retrieved and rened by moving through the cache. We also support a customizable means of displaying query results, which allows users to dene result layout, displayed elds, string manipulations, warning messages, and many other aspects of the display output. Interface Customization Harvest supports query interface customization through direct access to each Index/Search Engine. For example, the Glimpse-specic interface supports additional switches for approximate matching and record specic document retrieval, while the Nebula-specic interface supports extensive query renement and improved integration of organization and search. Direct access lets the user take advantage of the specic strengths of each Index/Search Engine. While this means users must learn multiple interfaces, it also increases exibility. A domainspecic Index/Search Engine can use the Broker to collect information from several sites. This information can be accessed by neophytes through the uniform interface. Expert users benet from the additional functionality provided by the Index/Search Engine through the direct interface. In this way, Harvest facilitates the construction of domain-specic databases (such as image or genome databases) that require a unique query language or result presentation. We are currently working on several tools to allow much greater customizations. 8 The Replication Subsystem Harvest replicates its Brokers with the mirror-d replication tool. Mirror-d maintains a weakly consistent, replicated directory tree of les. Mirror-d implements eventual consistency [5]: if all new updates cease, the replicas eventually converge. Each replica belongs to one or more data propagation groups and, within a group, mirror-d's companion process, ood-d [15], estimates the available network bandwidth and observed round trip delay between each pair of replicas. A group master periodically computes a graph that species which replicas should synchronize and propagate updates among each other. We call this the group's logical topology. Replicas synchronize updates with their neighbors in the logical topology graph using the \ftp-mirror" software [40]. Basically, mirror-d dynamically congures the ftp-mirror software between replicas in a replication group so that updates propagate along the highest bandwidth, lowest delay network paths. Since groups can be organized hierarchically, mirror-d should be able to scale to thousands of autonomously operated replicas. Groups become stitched together when one or more replicas belong to both groups. While mirror-d is written in Perl and ood-d is written in C, both mirror-d ( and ood-d ( speak HTTP and HTML. Each replication group is named, and one replica in each group is distinguished as the group leader. Flood-d's do not elect new leaders because the access control list of sites that can join the group resides at the group leader. Access control can be disabled or limited to a specic list of sites. If the leader fails, replication proceeds but the logical topology is not updated and new replicas cannot join the group. 17

18 Implementation Notes Upon startup, a new replica issues a join request. When the group master receives the request, it generates a new topology with the new member included and distributes the topology to the new group members. Frequently during the rst hour of operation, and less frequently afterwards, a replica selects another replica and measures the elapsed round trip for a null UDP-based remote procedure call and measures the elapsed time to transfer a 64KB data block via TCP. Periodically, replicas ood these estimates to one another. Flood-d does not implement an explicit group leave operation. Rather, if a replica is silent for a long time, each group member independently reclaims resources associated with it, and the group leader, when it recomputes the group topology, leaves out the silent replica. When new members join the group or when a timer expires the group leader computes and oods a new logical-topology to its members. Periodically or when requested through its HTTP interface, a replica's mirror-d process attempts to synchronize with its logical neighbors. Mirror-d obtains the list of its logical neighbors from its ood-d process and then requests that each neighbor return the version number of its directory tree. If no neighbor has a more recent version, then mirror-d does nothing. If one or more neighboring replicas has a more recent version, mirror-d spawns an ftp-mirror process that synchronizes the two replicas. For ease of synchronization, mirror-d will spawn only one concurrent ftp-mirror and enqueues other relevant update notications. If more than one neighbor have the highest version number, mirror-d chooses the neighbor with highest bandwidth to delay ratio. The group leader constructs a two- or three-connected logical topology that attempts to minimize the sum of the logical link costs. These costs are computed as the estimated bandwidth to delay ratio. While we originally employed a search algorithm to compute this graph, recently we switched this to a heuristic: take aminimum spanning tree and add low cost links until the graph's connectivity requirement is met. Since neighbors attempt to propagate updates along the minimum spanning tree, data transfers usually do not take place to a replica's less desirable neighbors. Performance and Experience We have tested mirror-d congurations of thirty replicas, organized into two or three overlapping groups. Two of the project members replicate the 60MB Harvest www-home-pages Broker at our homes, over 28KB/s PPP links. At the time of writing, four sites replicate Harvest's www-homepages Broker, and during testing we frequently run another ten replicas. Had these sites been replicated with ftp-mirror, rather than ood-d, all of them would have pulled their updates from the primary copy of the www-home-pages Broker. With mirror-d, the home machine mirrors o of a set of machines in a computing laboratory at the University of Southern California, the ten laboratory machines mirror o of each other, and one USC lab machine mirrors o of a machine at the University of Colorado. 9 The Object Caching Subsystem To meet ever-increasing demand on network links and information servers, Harvest includes a hierarchically organized Object Caching subsystem [13], as illustrated in Figure 8. At each level there can be several neighbor caches, allowing some load-sharing among cache servers. 18

The Harvest Information Discovery and Access System

The Harvest Information Discovery and Access System The Harvest Information Discovery and Access System C. Mic Bowman, Member of the Technical Staff, Transarc, Inc. Peter B. Danzig, Assistant Professor, CS Dept., U. Southern California Darren R. Hardy,

More information

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client. (Published in WebNet 97: World Conference of the WWW, Internet and Intranet, Toronto, Canada, Octobor, 1997) WebView: A Multimedia Database Resource Integration and Search System over Web Deepak Murthy

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

Harvest. University of Colorado(1994+) Major System Components

Harvest. University of Colorado(1994+)  Major System Components Harvest Distributed database technology University of Colorado(1994+) http://harvest.cs.colorado.edu/ Major System Components Gatherer : locate + process information(webspider+) also Essence subsystem(summarizer)

More information

How does Internet (Big-I) routing work? The Internet is broken up into Autonomous Systems (AS) An AS is an administrative boundary. Each ISP has 1 or

How does Internet (Big-I) routing work? The Internet is broken up into Autonomous Systems (AS) An AS is an administrative boundary. Each ISP has 1 or Internet Routing Robustness Improving ftp://engr.ans.net/pub/slides/ndss/feb-1999/ Curtis Villamizar How does Internet routing work? How noticeable are routing related outages? Why have

More information

Design and Implementation of A P2P Cooperative Proxy Cache System

Design and Implementation of A P2P Cooperative Proxy Cache System Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

\Classical" RSVP and IP over ATM. Steven Berson. April 10, Abstract

\Classical RSVP and IP over ATM. Steven Berson. April 10, Abstract \Classical" RSVP and IP over ATM Steven Berson USC Information Sciences Institute April 10, 1996 Abstract Integrated Services in the Internet is rapidly becoming a reality. Meanwhile, ATM technology is

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

C. E. McDowell August 25, Baskin Center for. University of California, Santa Cruz. Santa Cruz, CA USA. abstract

C. E. McDowell August 25, Baskin Center for. University of California, Santa Cruz. Santa Cruz, CA USA. abstract Unloading Java Classes That Contain Static Fields C. E. McDowell E. A. Baldwin 97-18 August 25, 1997 Baskin Center for Computer Engineering & Information Sciences University of California, Santa Cruz Santa

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition, Chapter 17: Distributed-File Systems, Silberschatz, Galvin and Gagne 2009 Chapter 17 Distributed-File Systems Background Naming and Transparency Remote File Access Stateful versus Stateless Service File

More information

In both systems the knowledge of certain server addresses is required for browsing. In WWW Hyperlinks as the only structuring tool (Robert Cailliau: \

In both systems the knowledge of certain server addresses is required for browsing. In WWW Hyperlinks as the only structuring tool (Robert Cailliau: \ The Hyper-G Information System Klaus Schmaranz (Institute for Information Processing and Computer Supported New Media (IICM), Graz University of Technology, Austria kschmar@iicm.tu-graz.ac.at) June 2,

More information

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy ONE: The Ohio Network Emulator Mark Allman, Adam Caldwell, Shawn Ostermann mallman@lerc.nasa.gov, adam@eni.net ostermann@cs.ohiou.edu School of Electrical Engineering and Computer Science Ohio University

More information

Internet Resource Discovery Services. Peter B. Danzig, Katia Obraczka and Shih-Hao Li

Internet Resource Discovery Services. Peter B. Danzig, Katia Obraczka and Shih-Hao Li Internet Resource Discovery Services Peter B. Danzig, Katia Obraczka and Shih-Hao Li Computer Science Department University of Southern California Los Angeles, California 90089-0781 danzig@usc.edu 213-740-4780

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Discover: A Resource Discovery System based on Content. Routing

Discover: A Resource Discovery System based on Content. Routing Discover: A Resource Discovery System based on Content Routing Mark A. Sheldon 1, Andrzej Duda 2, Ron Weiss 1, David K. Giord 1 1 Programming Systems Research Group, MIT Laboratory for Computer Science,

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

World Wide Web Caching for Internet in Thailand. Anawat Chankhunthod.

World Wide Web Caching for Internet in Thailand. Anawat Chankhunthod. World Wide Web Caching for Internet in Thailand Anawat Chankhunthod University of Southern California Los Angeles, California 90089-0781 chankhun@catarina.usc.edu Abstract World Wide Web (WWW) is one of

More information

Token Ring VLANs and Related Protocols

Token Ring VLANs and Related Protocols Token Ring VLANs and Related Protocols CHAPTER 4 Token Ring VLANs A VLAN is a logical group of LAN segments, independent of physical location, with a common set of requirements. For example, several end

More information

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996 TR-CS-96-05 The rsync algorithm Andrew Tridgell and Paul Mackerras June 1996 Joint Computer Science Technical Report Series Department of Computer Science Faculty of Engineering and Information Technology

More information

Successful Scalability Techniques for Illinois Web Archive Search

Successful Scalability Techniques for Illinois Web Archive Search Successful Scalability Techniques for Illinois Web Archive Search Larry S. Jackson & Huamin Yuan UIUC GSLIS Tech Report UIUCLIS--2007/1+EARCH April 27, 2007 Abstract The Capturing Electronic Publications

More information

Internet Resource Discovery Services. Katia Obraczka, Peter B. Danzig and Shih-Hao Li. phone: (213) fax: (213)

Internet Resource Discovery Services. Katia Obraczka, Peter B. Danzig and Shih-Hao Li. phone: (213) fax: (213) Internet Resource Discovery Services Katia Obraczka, Peter B. Danzig and Shih-Hao Li Computer Science Department University of Southern California Los Angeles, California 90089-0781 kobraczk,danzig,shli@usc.edu

More information

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches Background 20: Distributed File Systems Last Modified: 12/4/2002 9:26:20 PM Distributed file system (DFS) a distributed implementation of the classical time-sharing model of a file system, where multiple

More information

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile.

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile. Block Addressing Indices for Approximate Text Retrieval Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 212 - Santiago - Chile frbaeza,gnavarrog@dcc.uchile.cl

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

Performance Comparison Between AAL1, AAL2 and AAL5

Performance Comparison Between AAL1, AAL2 and AAL5 The University of Kansas Technical Report Performance Comparison Between AAL1, AAL2 and AAL5 Raghushankar R. Vatte and David W. Petr ITTC-FY1998-TR-13110-03 March 1998 Project Sponsor: Sprint Corporation

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology A World Wide Web Resource Discovery System Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong Abstract

More information

Chapter 3: Naming Page 38. Clients in most cases find the Jini lookup services in their scope by IP

Chapter 3: Naming Page 38. Clients in most cases find the Jini lookup services in their scope by IP Discovery Services - Jini Discovery services require more than search facilities: Discovery Clients in most cases find the Jini lookup services in their scope by IP multicast/broadcast Multicast UDP for

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

Token Ring VLANs and Related Protocols

Token Ring VLANs and Related Protocols CHAPTER 4 Token Ring VLANs and Related Protocols A VLAN is a logical group of LAN segments, independent of physical location, with a common set of requirements. For example, several end stations might

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Ecient Precomputation of Quality-of-Service Routes. Anees Shaikhy, Jennifer Rexfordz, and Kang G. Shiny

Ecient Precomputation of Quality-of-Service Routes. Anees Shaikhy, Jennifer Rexfordz, and Kang G. Shiny Ecient Precomputation of Quality-of-Service Routes Anees Shaikhy, Jennifer Rexfordz, and Kang G. Shiny ydepartment of Electrical Engineering znetwork Mathematics Research and Computer Science Networking

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Database Systems Concepts *

Database Systems Concepts * OpenStax-CNX module: m28156 1 Database Systems Concepts * Nguyen Kim Anh This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract This module introduces

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Chapter The LRU* WWW proxy cache document replacement algorithm

Chapter The LRU* WWW proxy cache document replacement algorithm Chapter The LRU* WWW proxy cache document replacement algorithm Chung-yi Chang, The Waikato Polytechnic, Hamilton, New Zealand, itjlc@twp.ac.nz Tony McGregor, University of Waikato, Hamilton, New Zealand,

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

Configuring STP and RSTP

Configuring STP and RSTP 7 CHAPTER Configuring STP and RSTP This chapter describes the IEEE 802.1D Spanning Tree Protocol (STP) and the ML-Series implementation of the IEEE 802.1W Rapid Spanning Tree Protocol (RSTP). It also explains

More information

SMD149 - Operating Systems - File systems

SMD149 - Operating Systems - File systems SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection

More information

Technology Insight Series

Technology Insight Series IBM ProtecTIER Deduplication for z/os John Webster March 04, 2010 Technology Insight Series Evaluator Group Copyright 2010 Evaluator Group, Inc. All rights reserved. Announcement Summary The many data

More information

[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information

[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information [8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information ltering methods. Communications of the ACM, 35(12), December 1992. [9] Christopher Fox. A stop list

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap, Simple Input/Output Streaming in the Operating System Frank Miller, George Apostolopoulos, and Satish Tripathi Mobile Computing and Multimedia Laboratory Department of Computer Science University of Maryland

More information

Benchmarking the CGNS I/O performance

Benchmarking the CGNS I/O performance 46th AIAA Aerospace Sciences Meeting and Exhibit 7-10 January 2008, Reno, Nevada AIAA 2008-479 Benchmarking the CGNS I/O performance Thomas Hauser I. Introduction Linux clusters can provide a viable and

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Data reduction for CORSIKA. Dominik Baack. Technical Report 06/2016. technische universität dortmund

Data reduction for CORSIKA. Dominik Baack. Technical Report 06/2016. technische universität dortmund Data reduction for CORSIKA Technical Report Dominik Baack 06/2016 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft (DFG) within

More information

Andrew T. Campbell, Javier Gomez. Center for Telecommunications Research, Columbia University, New York. [campbell,

Andrew T. Campbell, Javier Gomez. Center for Telecommunications Research, Columbia University, New York. [campbell, An Overview of Cellular IP Andrew T. Campbell, Javier Gomez Center for Telecommunications Research, Columbia University, New York [campbell, javierg]@comet.columbia.edu Andras G. Valko Ericsson Research

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Routing Basics. What is Routing? Routing Components. Path Determination CHAPTER

Routing Basics. What is Routing? Routing Components. Path Determination CHAPTER CHAPTER 5 Routing Basics This chapter introduces the underlying concepts widely used in routing protocols Topics summarized here include routing protocol components and algorithms In addition, the role

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Stackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science

Stackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science Stackable Layers: An Object-Oriented Approach to Distributed File System Architecture Thomas W. Page Jr., Gerald J. Popek y, Richard G. Guy Department of Computer Science University of California Los Angeles

More information

INTELLIGENT CACHING FOR WORLD-WIDE WEB OBJECTS DUANE WESSELS. B.S. Physics, Washington State University. A thesis submitted to the

INTELLIGENT CACHING FOR WORLD-WIDE WEB OBJECTS DUANE WESSELS. B.S. Physics, Washington State University. A thesis submitted to the INTELLIGENT CACHING FOR WORLD-WIDE WEB OBJECTS by DUANE WESSELS B.S. Physics, Washington State University A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial

More information

Layered Architecture

Layered Architecture 1 Layered Architecture Required reading: Kurose 1.7 CSE 4213, Fall 2006 Instructor: N. Vlajic Protocols and Standards 2 Entity any device capable of sending and receiving information over the Internet

More information

Configuring Rapid PVST+

Configuring Rapid PVST+ This chapter describes how to configure the Rapid per VLAN Spanning Tree (Rapid PVST+) protocol on Cisco NX-OS devices using Cisco Data Center Manager (DCNM) for LAN. For more information about the Cisco

More information

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA TCP over Wireless Networks Using Multiple Acknowledgements (Preliminary Version) Saad Biaz Miten Mehta Steve West Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX

More information

Microsoft SQL Server Fix Pack 15. Reference IBM

Microsoft SQL Server Fix Pack 15. Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Note Before using this information and the product it supports, read the information in Notices

More information

The Measured Cost of. Conservative Garbage Collection. Benjamin Zorn. Department of Computer Science. Campus Box 430

The Measured Cost of. Conservative Garbage Collection. Benjamin Zorn. Department of Computer Science. Campus Box 430 The Measured Cost of Conservative Garbage Collection Benjamin Zorn Department of Computer Science Campus Box #430 University of Colorado, Boulder 80309{0430 CU-CS-573-92 April 1992 University of Colorado

More information

Visual Basic Live Techniques

Visual Basic Live Techniques Page 1 of 8 Visual Basic Live Techniques Mike Wills Microsoft Corporation Revised December 1998 Summary: Discusses common Web site development techniques discovered during the development of the Visual

More information

Query based Site Selection for Distributed Search Engines

Query based Site Selection for Distributed Search Engines Query based Site Selection for Distributed Search Engines Nobuyoshi SATO, Minoru DAAWA, Minoru EHARA, Yoshifumi SAKAI, Hideki MORI {jju,ti980039}@ds.cs.toyo.ac.jp, {uehara,sakai,mori}@cs.toyo.ac.jp Department

More information

Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands

Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands Utility networks are going through massive transformations towards next

More information

Client (meet lib) tac_firewall ch_firewall ag_vm. (3) Create ch_firewall. (5) Receive and examine (6) Locate agent (7) Create pipes (8) Activate agent

Client (meet lib) tac_firewall ch_firewall ag_vm. (3) Create ch_firewall. (5) Receive and examine (6) Locate agent (7) Create pipes (8) Activate agent Performance Issues in TACOMA Dag Johansen 1 Nils P. Sudmann 1 Robbert van Renesse 2 1 Department of Computer Science, University oftroms, NORWAY??? 2 Department of Computer Science, Cornell University,

More information

Wave-Indices: Indexing Evolving Databases. Stanford, CA fshiva, Abstract

Wave-Indices: Indexing Evolving Databases. Stanford, CA fshiva, Abstract Wave-Indices: Indexing Evolving Databases Narayanan Shivakumar, Hector Garcia-Molina Department of Computer Science Stanford, CA 94305. fshiva, hectorg@cs.stanford.edu Abstract In many applications, new

More information

2 Application Support via Proxies Onion Routing can be used with applications that are proxy-aware, as well as several non-proxy-aware applications, w

2 Application Support via Proxies Onion Routing can be used with applications that are proxy-aware, as well as several non-proxy-aware applications, w Onion Routing for Anonymous and Private Internet Connections David Goldschlag Michael Reed y Paul Syverson y January 28, 1999 1 Introduction Preserving privacy means not only hiding the content of messages,

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

Interme diate DNS. Local browse r. Authorit ative ... DNS

Interme diate DNS. Local browse r. Authorit ative ... DNS WPI-CS-TR-00-12 July 2000 The Contribution of DNS Lookup Costs to Web Object Retrieval by Craig E. Wills Hao Shang Computer Science Technical Report Series WORCESTER POLYTECHNIC INSTITUTE Computer Science

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Configuring Rapid PVST+ Using NX-OS

Configuring Rapid PVST+ Using NX-OS Configuring Rapid PVST+ Using NX-OS This chapter describes how to configure the Rapid per VLAN Spanning Tree (Rapid PVST+) protocol on Cisco NX-OS devices. This chapter includes the following sections:

More information

Cumulative Percentage of File Requests

Cumulative Percentage of File Requests An Analysis of Geographical Push-Caching James Gwertzman, Microsoft Corporation Margo Seltzer, Harvard University Abstract Most caching schemes in wide-area, distributed systems are client-initiated. Decisions

More information

Heap Management. Heap Allocation

Heap Management. Heap Allocation Heap Management Heap Allocation A very flexible storage allocation mechanism is heap allocation. Any number of data objects can be allocated and freed in a memory pool, called a heap. Heap allocation is

More information

Networks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598

Networks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598 Techniques for Eliminating Packet Loss in Congested TCP/IP Networks Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny ydepartment of EECS znetwork Systems Department University of Michigan

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc Extensions to RTP to support Mobile Networking Kevin Brown Suresh Singh Department of Computer Science Department of Computer Science University of South Carolina Department of South Carolina Columbia,

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland A Study of Query Execution Strategies for Client-Server Database Systems Donald Kossmann Michael J. Franklin Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 f kossmann

More information

DHCP for IPv6. Palo Alto, CA Digital Equipment Company. Nashua, NH mentions a few directions about the future of DHCPv6.

DHCP for IPv6. Palo Alto, CA Digital Equipment Company. Nashua, NH mentions a few directions about the future of DHCPv6. DHCP for IPv6 Charles E. Perkins and Jim Bound Sun Microsystems, Inc. Palo Alto, CA 94303 Digital Equipment Company Nashua, NH 03062 Abstract The Dynamic Host Conguration Protocol (DHCPv6) provides a framework

More information

Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing

Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing Zhi Li, Prasant Mohapatra, and Chen-Nee Chuah University of California, Davis, CA 95616, USA {lizhi, prasant}@cs.ucdavis.edu,

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs Concepts Introduced in Chapter 6 Warehouse-Scale Computers A cluster is a collection of desktop computers or servers connected together by a local area network to act as a single larger computer. introduction

More information

Configuring Spanning Tree Protocol

Configuring Spanning Tree Protocol Finding Feature Information, page 1 Restrictions for STP, page 1 Information About Spanning Tree Protocol, page 2 How to Configure Spanning-Tree Features, page 14 Monitoring Spanning-Tree Status, page

More information

Spanning Trees and IEEE 802.3ah EPONs

Spanning Trees and IEEE 802.3ah EPONs Rev. 1 Norman Finn, Cisco Systems 1.0 Introduction The purpose of this document is to explain the issues that arise when IEEE 802.1 bridges, running the Spanning Tree Protocol, are connected to an IEEE

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

A Client Oriented, IP Level Redirection Mechanism. Sumit Gupta. Dept. of Elec. Engg. College Station, TX Abstract

A Client Oriented, IP Level Redirection Mechanism. Sumit Gupta. Dept. of Elec. Engg. College Station, TX Abstract A Client Oriented, Level Redirection Mechanism Sumit Gupta A. L. Narasimha Reddy Dept. of Elec. Engg. Texas A & M University College Station, TX 77843-3128 Abstract This paper introduces a new approach

More information

Table of Contents. Cisco Introduction to EIGRP

Table of Contents. Cisco Introduction to EIGRP Table of Contents Introduction to EIGRP...1 Introduction...1 Before You Begin...1 Conventions...1 Prerequisites...1 Components Used...1 What is IGRP?...2 What is EIGRP?...2 How Does EIGRP Work?...2 EIGRP

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

How to Design and Maintain a Secure ICS Network

How to Design and Maintain a Secure ICS Network How to Design and Maintain a Secure ICS Network Support IEC 62443 Compliance with SilentDefense Authors Daniel Trivellato, PhD - Product Manager Industrial Line Dennis Murphy, MSc - Senior ICS Security

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information