USING THE WEB EFFICIENTLY: MOBILE CRAWLERS

Size: px

Start display at page:

Download "USING THE WEB EFFICIENTLY: MOBILE CRAWLERS"

Cory Scott
6 years ago
Views:

1 USING THE WEB EFFICIENTLY: MOBILE CRAWLERS Jan Fiedler and Joachim Hammer University of Florida Gainesville, FL ABSTRACT Search engines have become important tools for Web navigation. In order to provide powerful search facilities, search engines maintain comprehensive indices of documents available on the Web. The creation and maintenance of Web indices is done by Web s, which recursively traverse and download Web pages on behalf of search engines. Analysis of the collected information is performed after the data has been downloaded. In this research, we propose an alternative, more efficient approach to building Web indices based on mobile s. Our proposed s are transferred to the source(s) where the data resides in order to filter out any unwanted data locally before transferring it back to the search engine. This reduces network load and speeds up the indexing phase inside the search engine. Our approach to Web crawling is particularly well suited for implementing so-called smart crawling algorithms which determine an efficient crawling path based on the s of Web pages that have been visited so far. In order to demonstrate the viability of our approach we have built a prototype for mobile Web crawling that uses the University of Florida Intranet as its testbed. Based on this experimental prototype, we provide a cost-benefit analysis that outlines the advantages of mobile Web crawling. 1. Introduction The World Wide Web (Web) is a large distributed hypertext system, consisting of an estimated 1.6 million sites as of April Due to its distributed and decentralized structure, virtually anybody with access to the Web can add new documents, links, and even servers. For example, [KAH96] estimates that 40% of the Web s change within a month. However, the Web also lacks structure. Users navigate within this large information system by following hypertext links, which connect different resources with one another. One of the shortcomings of this navigation approach is that it requires the user to traverse a possibly significant portion of the Web before finding a particular resource (e.g., a document, which matches certain criteria). Considering the growth rate of the Web, locating relevant information in a timely manner is becoming increasingly more difficult. The current approach to Web searching is to create and maintain indices for the Web pages much like indices for a library catalog or access paths for tuples in a database. However, before the pages can be indexed they must first be collected and returned to the indexing engine. This is done by Web s which systematically traverse the Web using various crawling algorithms (e.g., breadth first, depth first). The pages are downloaded to a search engine, which parses the text and creates and stores the index. For examples of Web search engines see Google [BRI97], Altavista [AltaVista], Infoseek [Infoseek], etc. However, there are several problems associated with this method of indexing which are due to the rapidly growing number of Web pages that have to be indexed on one hand, and the relatively slow increase in network bandwidth on the other. To illustrate these problems, let us look at the following statistics. Within the last couple of years, search engine technology had to scale dramatically to keep up with the growing amount of information available on the Web: from 110,000 in 1994 to over 110 million pages in 1998 [SUL98]. This is an increase by factor 1,000 in 4 years only! And the Web is expected to Author s current address: Intershop Communications GmbH, Leutragraben 2-4, Jena, Germany.

2 continue to grow in a rapid fashion, doubling its size (in terms of number of pages) every months [KAH96]. By projecting this trend into the near future, we expect that a comprehensive Web index would have to contain about 1 billion pages by the year In addition, Kahle [KAH96] reports that the average online time of a page is only 75 days, which leads to an update rate of 600GB per month. Although updating pages does not necessarily affect the total index size, it nevertheless causes an index to age rapidly. Thus, in order to keep the indices of a search engine up to date, s must constantly retrieve Web pages as fast as possible. According to [SUL98], Web s of commercial search engines crawl up to 10 million pages per day. Assuming an average page size of 6K [BRI97] the crawling activities of a single commercial search engine add a daily retrieval load of 60GB to the ongoing Web activities. Given this explosive growth, we see the following specific problems with the way current search engines index the Web: Scaling. The concept of download-first-and-index-later will likely not scale given the limitations in the infrastructure and projected growth rate of the Web. Using the estimates for growth of Web indices provided in [SUL98], a Web running in the year 2000 would have to retrieve Web data at a rate of 45Mbit per second in order to download the estimated 480GB of pages per day that are necessary to maintain the index. Looking at the fundamental limitations of storage technology and communication networks, it is highly unlikely that Web indices of this size can be maintained efficiently. Efficiency. Current search engines add unnecessary traffic to the already overloaded Internet. While current approaches are the only alternative for general-purpose search engines trying to build a comprehensive Web index, there are many scenarios where it is more efficient to download and index only selected pages. We call these systems specialized search engines and justify their usefulness in this paper. Quality of Index. The results of Web searches are overwhelming and require the user to act as part of the query processor. Current commercial search engines maintain Web indices of up to 110 million pages [SUL98] and easily find several thousands of matches for an average query. Thus increasing the size of the Web index does not automatically improve the quality of the search results if it simply causes the search engine to return twice as many matches to a query as before. nce we cannot limit the number of pages on the Web, we have to find ways to improve the search results in such a way that can accommodate the rapid growth of the Web. Therefore, we expect a new generation of specialized search engines to emerge in the near future. These specialized engines may challenge the dominance of today's general search engines by providing superior search results for specific subject areas using a variety of new technologies from such areas as data mining, data visualization, and graph theory, for example. In addition, we believe that a new crawling approach is needed to improve the efficiency of data collection when used in combination with specialized search engines. The outline of the paper is as follows. We discuss the general idea behind our approach and describe its advantages using an example in Sec. 2. In Sec. 3 we analyze the costs and potential benefits of mobile crawling. While we do not claim that this approach is a solution for all search engines, we believe that many applications can benefit from mobile crawling. In Sec. 4 we outline our proposed architecture, which supports mobile Web crawling and describe the experimental prototype that we have built to verify our results. In Sec. 5 we describe relevant related research and conclude the paper in Sec. 6. 2

3 2. A Mobile Approach to Web Crawling In this paper, we propose an alternative approach to Web crawling based on mobile s. Crawler mobility allows for more sophisticated crawling algorithms [CHO97] and avoids some of the inefficiencies associated with the brute force strategies exercised by current s. We see mobile crawling as an efficient, scalable solution to establishing a specialized search index in the highly distributed, decentralized and dynamic environment of the Web. By looking at the general anatomy of traditional search engines, we realize that their architecture is strictly centralized while the data being accessed by search engine s is highly distributed. This centralized architecture is a mismatch for the distributed organization of the Web because it requires data to be downloaded before it can be processed. The shortcomings of the current approach become clear when we look at specialized search engines which are only interested in certain Web pages. If we use a stationary, we would download many pages, which are discarded immediately because they do not meet the subject area of the search engine. Obviously, this behavior is not very desirable because network bandwidth is wasted by downloading irrelevant information. The Web crawling approach described in this paper addresses this problem by making the data retrieval component, namely the Web, distributed. We define mobility in the context of Web crawling as the ability of a to transfer itself to each Web server of interest before collecting pages on that server. After completing the collection process on a particular server, the together with the collected data moves to the next server or to its home system. Mobile s are managed by a manager which supplies each with a list of target Web sites and monitors the location of each. This is necessary to intervene in case one or more s happen to interfere with each other (i.e., crawl the same Web space). However, the crawl strategy and path taken is controlled separately by each through the crawling algorithm. In addition, the manager provides the necessary functionality for extracting the collected data from the for use by the indexer (see Sec. 4). Figure 1 provides on overview of mobile Web crawling. Remote Host Search Engine Web HTTP Server Crawler Manager Remote Host Index Remote Host HTTP Server HTTP Server Figure 1: An overview of mobile Web crawling A Mobile Crawling Example In order to demonstrate the capabilities of mobile crawling, let us look at an example. Consider a search engine which wants to support high quality searches for a particular application domain, e.g., preventive health care, gardening, sports, etc. by building an index of relevant Web pages. The creation and maintenance of a suitable index to support such a specialized search engine using a traditional 3

4 crawling approach is highly inefficient. This is due to the fact that traditional s must download much more data than is effectively used (in the worst case scenario the whole Web). In contrast, in our approach, a mobile is sent to each Web source that is expected to contain relevant information for a local pre-selection of pages. Initially, the obtains a list of target locations from the manager. These addresses are referred to as seed URLs since they indicate the beginning of the crawling process. In addition, the manager also uploads the crawling strategy into the in form of a program. This program tells the which pages are considered relevant and should be collected. In addition, it also generates the path through the Web site. As we have mentioned before, a considerable amount of research has focused on optimal crawling strategies [CHO97]. We see our mobile crawling approach as a compliment to this work by providing an efficient infrastructure for implementing crawling strategies. Returning to our example, we now briefly describe the events that take place during the crawling process using a sample crawling algorithm as a guideline. A more detailed explanation can be found in the full version of this paper, which is available as a technical report [FIE98] for our ftp server at ftp.dbcenter.cise.ufl. Before the actual crawling begins, the must migrate to a specific remote site using one of the seed URL s as the target address. After the successfully migrated to the remote host, the crawling algorithm is executed. This part of mobile crawling is very similar to traditional crawling techniques since pages are retrieved and analyzed recursively. In fact, any breadth-first or depth-first crawling strategy can be employed. However, data collection can be speeded up considerably by using more sophisticated ( smart ) algorithms. nce these crawling strategies need access to the s of the crawled pages in order to optimize their crawl path, mobile s are particularly well suited for implementing smart crawling strategies. When the finishes it either returns to the manager ( home) or, in case the list of seed URLs is not empty, migrates to the next Web site on the list and continues. Once the mobile has successfully migrated back to its home, all pages retrieved by the are transferred to the search engine via the manager. Once the pages have been downloaded, the search engine can generate the index as before. The main difference is that the set of pages to be indexed is significantly smaller and only contains those pages that are relevant to the underlying search topic. Note a significant part of the transmission cost can be saved by compressing the pages as well as the code prior to migration. In case a mobile does not find any relevant information on a particular server, no data besides the code itself will be transmitted. Further note since the depends on the resources of the remote host for its processing power, it is not possible to predict how much memory is available to the. This means that a has to be able to dynamically interrupt its collection process whenever the available memory is exhausted and be able to transmit the collected pages back to the search engine to free up resources. We are currently in the process of implementing a second prototype version of our mobile that has the ability to transmit collected pages back to its home base whenever necessary. In addition, we are implementing a control interface that lets the programmer specify ahead of time how resource intensive the operates. Specifically, the programmer can specify that the never use more than a certain percentage of its available memory, for example, or that it should pause for t seconds in between accessing pages to reduce the load on the host server Issues Associated With Mobile Crawling So far, we have focused mainly on the advantages of migration and remote page processing. However, several issues still have to be resolved before mobile crawling can become widely used. These issues can be categorized roughly as policy issues. Although specific solutions to most of them are 4

5 beyond the scope of this work, we enumerate some of the important ones in an attempt to further the understanding of the underlying problems and to generate momentum for their solutions. Please note that mobile s are mainly a special case of mobile agents and thus some of the important issues related to mobile crawling should also be addressed in the broader context of mobile computing. The two most important issues with respect to our proposed solution are related to the fact that (1) a must have permission from the owner of a Web site to execute locally on the server and (2), our current version of the code needs a run-time environment to be present at each site before it can execute. As far as the permission problem is concerned (1), we currently see no short-time solution. As far as the run-time environment is concerned (2), we see this as an installation problem that can be alleviated in two ways: In the short-term, by simplifying the installation as much as possible by making the runtime environment a small server process which can be installed by the Web master of each participating Web site. In the long run, a better solution would be to standardize the runtime environment and make it an optional part of each Web server. Given the emergence of mobile agents, a standardized runtime environment could benefit other roaming agents besides our mobile s. Currently, we are avoiding the above problems by testing our s in secured intranets where we either have control over the participating Web servers or can obtain the necessary execution permission without problems. In a sense, our environment at the University is not unlike an Intranet within a large corporation, which can use mobile s for setting up search indexes with relatively little effort. 3. A Cost-Benefit Analysis of Mobile Web Crawling We have analyzed the behavior of a mobile using the following four parameters: Data Access: By migrating to a remote Web server, mobile s can access Web pages locally with respect to the server. This saves network bandwidth by eliminating request/response messages used for data retrieval. Remote Page Selection: By migrating to a remote Web server, mobile s can select only the relevant pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant information directly at the data source. Remote Page Filtering: By migrating to a remote Web server, mobile s can reduce the of Web pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant portions of the retrieved pages. Remote Page Compression: By migrating to a remote Web server, mobile s can compress the of Web pages before transmitting them over the network. This saves network bandwidth by reducing the size of the retrieved data. We now examinee the effects that each of these factors have on the efficiency in more detail Localized Data Access In the context of traditional search engines a stationary is an HTTP (Hypertext Transfer Protocol) client which tries to recursively download all documents managed by one or more Web servers. Due to the HTTP request/response paradigm [BER96], downloading the s from a Web server involves significant overhead due to request messages, which have to be sent separately for each Web page to be retrieved. Using a mobile, we reduce the HTTP overhead by transferring the to the source of the data. The can then issue all HTTP requests locally with respect to the HTTP server. This approach still requires one HTTP request per document but there is no need to transmit these requests 5

6 over the network anymore. Figure 2 summarizes the data retrieval process based on mobile s as introduced above. Search Engine Index Mobile Crawler Web Mobile Crawler HTTP Server HTML Figure 2: HTTP based data retrieval using mobile s. Naturally, this approach only pays off if the reduction in Web traffic due to local data access is more significant than the traffic caused by the initial transmission. The important question here is how soon the overhead due to the transmission of the mobile to the data source is less than the transmission overhead caused by HTTP request and response messages. To answer this question, we have derived the following formulas describing the network load L as a function of the number of crawled pages N. denotes the size of an object (e.g., message, Web page) in KB. Ltraditional = N ( request + response + ) L = N + 2 mobile ( ) The formula for the network load caused by a mobile assumes that the mobile is transmitted back to its home along with the retrieved pages. Based on these formulas we can derive another function describing the savings S in network load due to mobile s. N) = Ltraditional Lmobile = N + 2 ( request response ) By using averages for the size of the, the size of HTTP request and response messages, we can derive the savings as a linear function in the number of pages. = 0.1KB request response = 0.1KB = 1.5KB N) = 0.2KB N 3KB Figure 3 depicts the savings function as derived above. Note the exponential scale of the X-axis and the negative savings when crawling small sites. Based on the savings function S, we expect mobile s to operate more efficiently than traditional s if the number of pages to be crawled exceeds 15. Please note that this result is based on the assumption that we are considering no other factors besides localized data access Remote Page Selection Another advantage of our approach is remote page selection. Once a mobile has been transferred to a Web server, it can analyze each Web page before transmitting it to the search engine. This allows mobile s to determine whether the page is relevant with respect to a particular application domain. Web pages considered relevant are stored within the and are eventually transmitted over the network when the mobile returns to its home. 6

7 % Ã. ÃLQ W K LG Z G Q D ÃE G H Y D 6 1XPEHUÃRIÃSDJHV Figure 3: Localized data access saving function. By looking at remote page selection from a more abstract point of view, it compares favorably with classical approaches in database systems. If we consider the Web as a large remote database, the task of a is akin to querying this database in order to extract certain portions of it. In this context, the main difference between traditional and mobile s is the way queries are issued. Traditional s implement the data shipping approach of database systems because they download the whole database before they can issue a query, which identifies the relevant portion. If the query is very specific (e.g., establish an index of all English health care pages), a major part of the database has been downloaded without being useful in any way. In contrast to this, mobile s use the query shipping approach of database systems because all the information needed to identify the relevant data portion is transferred directly to the data source along with the mobile. After the query has been executed remotely, only the query result is transferred over the network and can be used to establish the desired index. nce the query shipping approach has proven to be more efficient in the context of database systems (e.g., SQL servers), we consider its application in the context of Web crawling to be superior with respect to the traditional approach. As done in the previous section lets look at the formulas describing the situation. The formulas given below focus on the remote page selection feature and neglect the saving in network load L due to localized data access as analyzed in the previous section. The percentage of pages considered relevant by the mobile s is expressed as an additional factor SF in the formulas. L L traditional mobile = N ( N, SF) = ( SF N) + 2 Note that the load function for mobile s is a function in N and SF since the network load depends on both. The selection factor SF in the second formula relates to N since it scales the number of pages transmitted over the network. Again, we are interested how soon mobile s start to outperform traditional s and derive the savings function ). This time, ) is a function in N and SF. N, SF) = L traditional L = (1 SF) N mobile ( N, SF) 2 Assuming the same average numbers for and page size, we can derive the following linear savings function ). 7

8 = 1.5KB = 6KB N, SF) = 6KB N ( 1 SF) 3KB Figure 4 depicts the savings function ) for some example values of SF. Note the exponential scale of the X-axis. The values chosen for SF are conservative in that they assume that 80-95% of the crawled pages is considered to be relevant. Depending on the degree of specialization of the underlying search engine for which the is working for, the actual values can be significantly smaller. 6) 6) 6) 6) % Ã. ÃLQ G H Y D ÃV W K LG Z G Q D % 1XPEHUÃSIÃSDJHV Figure 4: Remote page selection saving function. Based on Figure 4, we see that the break-even point (in terms of number of pages) for mobile s heavily depends on the percentage of pages considered relevant by the. For example, assuming the considers 95% of the crawled pages as relevant, a mobile will operate more efficient than a traditional as soon as the number of pages crawled exceeds 10. A mobile considering only 80% of the pages as relevant reaches this point as soon as more than 2 pages are crawled Remote Page Filtering Remote page filtering extends the concept of remote page selection to the s of a Web page. The idea behind remote page filtering is to allow the to control the granularity of the data it retrieves. With stationary s, the granularity of retrieved data is the Web page itself. This is because HTTP only allows page-level access. For this reason, stationary s always have to retrieve the entire page before the indexer can extract the relevant page portion. Depending on the ratio of relevant to irrelevant information, significant portions of network bandwidth are wasted by transmitting useless data. A mobile addresses this problem through its ability to operate directly at the data source. After retrieving a page, a mobile can filter out all irrelevant s keeping only information, which is relevant to the search engine for which the is working. To get an idea about how significant the potential savings are, we have derived the formulas that describe the situation. First, we established the network load functions L() for the traditional and the mobile crawling approach. For simplicity, the overhead due to HTTP request and response messages is neglected. As before, denotes object size. 8

9 L L traditional mobile = N ( N, FF) = N ( FF S ) + 2 As in the last section, the load function for the mobile crawling approach is a function in the number of crawled pages N and a factor FF (filter factor). Here, factor FF states the percentage of page used to represent the page. To estimate the potential savings due to mobile crawling we derive the savings function ). N, FF) = L traditional L = (1 FF) N mobile ( N, FF) 2 This savings function is very similar to the result of the last section. This is due to the fact that page filtering is represented in the formulas the same way as page selection. The difference is that page filtering scales the size of the page and not the number of pages (of course, this does not make a difference from the mathematical point of view). We set the parenthesis in the formulas to emphasis this. Using the values for page and size as before, we get the following linear savings function. = 1.5KB = 6KB N, SF) = N ((1 FF) 6KB) 3KB As in the last section, the savings due to remote page filtering depend heavily on the filter factor FF. nce the filter factor has the same impact on the saved bandwidth as the selection factor in the last section, we do not provide a separate figure for the calculated savings here. By replacing the selection factor SF with the filter factor FF, Figure 4 also depicts the savings function for remote page filtering for a search engine which extracts between 80 and 95% of the page for the index. As in the case of remote page selection, these factors are conservative estimates. To get a feel for the magnitude of the filter factor FF, consider a search engine index, which relies on the page URL, page title and a set of page keywords only. Assuming a page URL of 60 characters, page title length of 80 characters and a set of 15 keywords with an average size of 10 characters each, the mobile needs to keep only 290 bytes of the total page. This equals a filter factor of or about 5% when assuming an average page size of 6KB. With such a high filter degree, a mobile would always operate more efficiently than a traditional. Therefore, remote page filtering is especially useful for search engines, which use a specialized representation for Web pages (e.g., URL, title, modification date, keywords) instead of storing the complete page Remote Page Compression Remote page selection and filtering are two important characteristics that directly reduce the network traffic caused by Web s. Both perform well in the context of specialized search engines, which cover a certain portion of the Web only. However, the situation is different for a which is supposed to establish a comprehensive fulltext index of the Web. In this case, as many Web pages as possible need to be retrieved by the. Techniques like remote page selection and filtering are not applicable in such cases since every page is considered relevant. In order to reduce the amount of data that has to be transmitted back to the search engine, we introduce remote page compression as an additional feature of mobile s. Once a mobile has finished crawling a Web site, it has identified a set of relevant Web pages, which are kept in the s data repository. In order to reduce the bandwidth required to transfer the along with the data it contains back to the search engine, the mobile applies compression techniques to reduce its size prior to transmission. As before, we estimate the benefits by deriving load functions L() 9

10 describing the network load caused by traditional and mobile s. This time we focus on the effects of page compression and neglect the other features. L L traditional mobile = N S ( N, CR) = CR ( N S + 2 The load function for mobile s is a function in the number of crawled pages N and the achieved compression ratio CR. The load function states that the compression ratio relates to both, the page and the code. This is due to the fact that our prototype system compresses not only the crawled pages but also the code itself. To estimate the benefits of remote page compression we derive the savings function ). N, CR) = L traditional L = (1 CR) N mobile ( N, CR) 2 CR Using our average numbers for page and size, we get the following linear representation of the savings function. = 1.5KB = 6KB N, CR) = N (1 CR) 6KB CR 3KB ) Figure 5 depicts the savings function ) for some typical compression ratios. &5 &5 &5 % Ã. ÃLQ G H Y D ÃV W K LG Z G Q D % 1XPEHUÃRIÃSDJHV Figure 5: Remote page compression savings function. The diagram shows that remote page compression makes mobile crawling an attractive approach even for traditional search engines, which do not benefit from remote page selection and filtering due to their comprehensive fulltext-indexing scheme Combined Benefits So far, we examined the benefits of mobile crawling in an isolated fashion neglecting the fact the above parameters must be examined in combination in order to get a more realistic picture of the performance of mobile crawling. For example, remote page compression can always be activated and will further reduce the network load independent of any other features. The set of features, which can be used by a mobile to reduce the total network load, depends on the type of search engine for which the is working. For mobile s two characteristic features of search engines are 10

11 important: The first is whether the search engine tries to establish a comprehensive Web index. The second is whether the search engine relies on a fulltext index or a compact index. Table 1 summarizes applicable crawling features for some typical search engine types. Comprehensive Coverage Fulltext index Comprehensive coverage Specialized index Subject specific coverage Fulltext index Subject specific coverage Specialized index Localized Data Access Remote Selection Remove Filtering Remote compression Yes No No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Table 1: Applicable mobile crawling features by search engine type. Given these combinations of applicable features we derive general load functions L() based on the individual load functions identified in sections 3.1 through 3.4. The general load function L() will serve as the baseline for an analytical comparison of mobile and traditional crawling techniques. Ltraditional = N ( request + response + ) L ( N, CR, SF, FF) = CR SF N FF + 2 mobile ( ) ( ) ) Deriving our savings function ) from these load functions leads to the following equation: N, CR, SF, FF) = Ltraditional Lmobile ( N, CR, SF, FF) = N CR SF FF 2 CR ( request response ( ) ) Although this functions looks complicated, we can limit our analysis to the portion containing the filter, selection and compression factors which scale the size of the page to be transmitted. By neglecting the less significant terms request response and by assuming an average page size of 6KB, we can derive a simplified yet sufficiently exact version of the savings function. ( 1 CR SF FF) KB N, CR, SF, FF) = N 6 This simplified version of the savings function allows us to estimate the benefits of mobile crawling for the general search engine types identified in Table 1. Table 2 shows the benefits of mobile crawling in terms of bandwidth saved for some typical values of CR, SF, and FF. The results in Table 2 suggest that a significant portion of the network bandwidth required for Web crawling can be saved when using mobile instead of traditional s. Based on our analysis and the data we gathered using our prototype system, we claim that mobile s easily outperform traditional s in terms of network efficiency. Most importantly, this claim holds independent of the type of search engine for which the is working. The result of our analysis shows that mobile s do very well when used in the context of specialized search engines since all mobile crawling specific features are applicable at the same time. Here, mobility does not only have an impact in the domain space (i.e. network bandwidth saved) but also on the time domain (i.e. the time needed to finish crawling a site). This is due to the fact that the network is still a bottleneck when crawling the Web. By significantly reducing the amount of data to be transmitted over the network, the time needed for data transmission and therefore the time needed to finish crawling decreases. 11

12 SF (selection factor) FF (filter factor) CR (compression ratio) Bandwidth saved Comprehensive Coverage Fulltext index Comprehensive coverage Specialized index Subject specific coverage Fulltext index Subject specific coverage Specialized index : 3 66% : 3 90% : 3 90% : 3 97% Table 2: Bandwidth savings due to mobile s by search engine types. 4. An Architecture for Mobile Web Crawling Given the current Web infrastructure, mobile s are not applicable yet. This is due to the fact that the cornerstones of our approach migration and remote execution of code are not supported by the current Web infrastructure. In order to confirm the results of our analysis and to establish a proof of concept, we implemented a prototype system, which provides the infrastructure required by mobile s. The implemented prototype system extends the current Web architecture with a distributed runtime system for mobile s and provides some additional components for the management and control of mobile s. The overall system architecture and essential system components of the prototype system are depicted in Figure 6. Distributed Crawler Runtime Environment HTTP Server Virtual Machine Communication Subsystem Virtual Machine Communication Subsystem HTTP Server HTTP Server Virtual Machine Communication Subsystem Net Virtual Machine Communication Subsystem HTTP Server Communication Subsystem Outbox Inbox Crawler Spec Crawler Manager Archive Manager Query Engine Database Command Manager Connection Manager SQL DB Application Framework Architecture Figure 6: System architecture overview. The architecture depicted in Figure 6 consists of two major parts. The first part, the distributed runtime environment, provides the base functionality for the transfer and the execution of mobile code. The runtime environment establishes a distributed execution environment in which application specific mobile s can operate. The second major part of the system, the application framework architecture, serves as an application independent interface to the distributed runtime environment. The framework architecture provides functionality for mobile creation and 12

13 management. In addition, the framework provides a query interface, which allows applications to access the data retrieved by mobile s. nce a detailed discussion of the prototype s architecture is beyond the scope of this paper, we only provide an overview of the most essential system components. For an in-depth discussion of the system components and their implementation, refer to [FIE98] Mobile Crawlers In our prototype, mobile s serve as mobile containers for the crawling algorithm as well as for the collected data. To provide real mobility, a needs to be able to save its runtime state, transfer it over the network, and restore it at the remove location. For interoperability, s need to use a machine independent representation for their runtime state. nce this kind of interoperability is difficult to achieve, we decided to minimize the runtime state needed by our s as much as possible. As a result, we decided to specify programs based on rules and facts. The execution of a program is equivalent to applying rules upon the facts inside the s knowledge-base. The advantage of this approach is that rule based programs do not have a real runtime state. With carefully designed rules, the runtime state of the program can be represented by facts only. Thus, saving the runtime state of our s involves stopping the rule application process and saving the current fact base. Once this is finished, the can migrate since all relevant data (rules and facts) are now represented as simple ASCII strings within the. The object, which carries the rules and facts, migrates using the object serialization facilities built into the Java language Virtual Machine The virtual machine is the heart of the distributed runtime environment. Its main purpose is to provide an environment in which code received through the network can be executed. nce programs are specified based on rules, we can model our virtual machine using an inference engine, which takes care of the rule application process. To start the execution of a we initialize the inference engine with the rules and facts of the to be executed. Starting the rule application process of the inference engine is equivalent to starting execution. Once the rule application has finished (either because there is no rule which is applicable or due to an external signal), the rules and facts now stored in the inference engine are extracted and stored back in the. Thus, an inference engine establishes a virtual machine with respect to the. The concrete implementation of our virtual machine uses an extended version of the Jess inference engine [FRI97] which in turn is basically a Java port of the well knows CLIPS system [GIA97] Query Engine The query engine is part of the application framework architecture and responsible for the communication between and application. nce our mobile s are application independent, they have no information about the semantics of the data they retrieve. In order to use the retrieved data within an application, the data needs to be extracted from the s information base. To provide efficient access to this information, we implemented a query engine, which evaluates application specific queries upon the information base. The query result is represented as structured data tuples very similar to relational database systems. nce the information base consist of facts generated by the rule based program, the query engine implementation is based on the same inference engine used for the virtual machine implementation. Application specific queries are translates into special query rules, which identify matching facts within the information base Prototype and Lessons Learned The prototype system based on the architecture outlined in the previous sections has been implemented and is fully operational within an experimental testbed at the University of Florida at 13

14 Gainesville. The prototype is used as a proof of concept and to further evaluate our approach to Web crawling. For details, refer to [FIE98] One of the most important requirements for our prototype implementation was the ability to run on multiple host platforms (compute interoperability). Specifically, the runtime system has to provide a common environment to mobile s while running on different platforms, operating systems, and Web servers. To achieve the required platform independence we implemented the prototype using Java. So far, the and runtime environment have been successfully tested on Unix and Windows machines. For example, within the University of Florida Intranet, mobile s successfully migrated between host servers running the Unix and Windows operating system and collected data sets managed by more than ten different Web servers across campus. We are currently in the process of extending our prototype system with new components, which address the critical issues, identified in Section 2.2. Our focus here is on improving the security and stability of the distributed runtime environment. We identified this as the most crucial point to be addressed prior to using mobile s in a real network environment. Once this is done, we plan to install our runtime environment on numerous Web servers outside of the University of Florida in order to evaluate our crawling approach in a broader context using mobile enabled servers. We will report on the results of these tests in future reports. 5. Related Research Previous work related to this paper falls into four categories. Search Engine Technology Due to the short history of search engines, there has been little time to research this technology area. One of the first papers in this area introduced the architecture of the World Wide Web Worm [MCB94] (one of the first search engines for the Web) and was published in Between 1994 and 1997, the first experimental search engines were followed by larger commercial engines such as WebCrawler [PIN94, WebCrawler], Lycos [Lycos, MAU97], Altavista [AltaVista], Infoseek [Infoseek], Excite [Excite] and HotBot [HotBot]. Due to their commercial orientation, there is very little information available about these search engines and their underlying technology. Only two papers about architectural aspects of WebCrawler and Lycos are publicly available on the Web. The Google project [BRI97] at Stanford University recently brought large-scale search engine research back into the academic domain. Web Crawling Research A good source of information on Web crawling techniques is the Stanford Google project [BRI97], which we used as our primary architecture model. Based on the Google project, researchers at Stanford compared the performance of different crawling algorithms and the impact of URL ordering [CHO97] on the crawling process. A comprehensive Web directory providing information about s developed for different research projects, can be found on the robots homepage [KOS97]. Another project, which investigates Web crawling and Web indices in a broader context is the Harvest project [BOW95]. Harvest supports resource discovery through topic-specific indexing made possible by an efficient distributed information gathering architecture. Harvest can therefore be seen as a base architecture upon which different resource discovery tools (e.g., search engines) can be built. A major goal of the Harvest project is the reduction of network and server load associated with the creation of Web indices. To address this issue, Harvest uses distributed s (called gatherers) which can be installed at the site of the information provider to create and maintain a provider specific index. The indices of different providers are then made available to external resource discovery systems by so called brokers which can use multiple gatherers (or even other brokers) as their information base. 14

15 Beside technical aspects, there is a social aspect to Web crawling too. A Web consumes significant network resources by accessing Web documents at a fast pace. More importantly, by downloading the complete of a Web server, a might significantly hurt the performance of the server. For this reason, Web s have earned a bad reputation and their usefulness is sometimes questioned as discussed by Koster [KOS95]. To address this problem, a set of guidelines for developers has been published [KOS93]. In addition to these general guidelines a specific Web crawling protocol, the Robot Exclusion Protocol [KOS96], has been proposed by the same author. This protocol enables webmasters to specify to s which pages not to crawl. However, this protocol is not yet enforced and Web s implement it on a voluntary basis only. Rule Based Systems An example for a rule based system is CLIPS [GIA97] (C Language Integrated Production System) which is a popular expert system developed by the Software Technology Branch at the NASA/Lyndon B. Johnson Space Center. CLIPS allows the development of software systems which model human knowledge and expertise by specifying rules and facts. Rule based software programs do not need an explicit static control structure because rules are used to dynamically reason about facts in order to respond appropriately to the current situation. In the context of our prototype system we use a Java version of CLIPS called Jess (Java Expert System Shell) [FRI97]. Jess provides the core CLIPS functionality and is implemented at the Sandia National Laboratories. The main advantage of Jess is that it can be used on any platform which provides a Java virtual machine which is ideal for our purposes. Mobile Code Mobile code has become popular in the last couple of years especially due to the development of Java [GOS96]. The best examples are Java applets, which are small pieces of code, downloadable from a Web server for execution on a client. The form of mobility introduced by Java applets is usually called remote execution, since the mobile code is executed completely once it has been downloaded. nce Java applets do not return to the server, there is no need to preserve the state of an applet during the transfer. Thus, remote execution is characterized by stateless code transmission. Another form of mobile code, called code migration, is due to mobile agent research. With code migration it is possible to transfer the dynamic execution state along with the program code to a different location. This allows mobile agents to change their location dynamically without affecting the progress of the execution. Initial work in this area has been done by General Magic [WHI96]. Software agents are an active research area with lots of publications focusing on different aspects of agents such as agent communication, code interoperability and agent system architecture. Some general information about software agents can be found in papers from Harrison [HAR96], Nwana [NWA96] and Wooldridge [WOO95]. Different aspects and categories of software agents are discussed by Maes ([MAE94] and [MAE95]). Communication aspects of mobile agents are the main focus of a paper by Finin [FIN94] 6. Conclusion We have introduced an alternative approach to Web crawling based on mobile s. The proposed approach surpasses the centralized architecture of the current Web crawling systems by distributing the data retrieval process across the network. In particular, using mobile s we are able to perform remote operations such as data analysis and data compression at the data source before the data is transmitted over the network. This allows for more intelligent crawling techniques and addresses the needs of applications, which are only interested in certain subsets of the available data. We have developed and implemented an application framework, which demonstrates our mobile Web crawling approach and allows applications to take advantage of mobile crawling. 15

16 The performance results of our approach are very promising. Mobile s can reduce the network load caused by s significantly by reducing the amount of data transferred over the network. Mobile s achieve this reduction in network traffic by performing data analysis and data compression at the data source. Therefore, mobile s transmit only relevant information in compressed form over the network. The prototype implementation of our mobile framework provides an initial step towards mobile Web crawling. We have identified several issues, which need to be addressed before mobile crawling can be used in a larger scale: Security. Crawler migration and remote execution of code causes severe security problems because a mobile might contain harmful code. We suggest introducing an identification mechanism for mobile s based on digital signatures. Based on this identification scheme a system administrator would be able to grant execution permission to certain s only, excluding s from unknown (and therefore unsafe) sources. In addition to this, the virtual machine needs to be secured such that s cannot get access to critical system resources. This is already implemented in part due to the execution of mobile s within the Jess inference engine. By restricting the functionality of the Jess inference engine, a secure sandbox scheme (similar to Java) can be implemented relatively easily. Integration of the mobile virtual machine into the Web. The availability of a mobile virtual machine on as many Web servers as possible is crucial for the effectiveness of mobile crawling. This integration can be achieved through Java Servlets, for example, which extend Web server functionality with special Java programs. We realize of course, that before this can be done, some effort has to be spent on standardizing the functionalities of such runtime environments. Research in mobile crawling algorithms. None of the current crawling algorithms have been designed with mobility in mind. For this reason, it seems worthwhile to spend some effort in the development of new algorithms, which take advantage of mobility. In particular these algorithms have to deal with the loss of centralized control over the crawling process due to mobility. References [AltaVista] AltaVista, AltaVista Search Engine, WWW, [BEL97] BellCore, Netsizer Internet Growth Statistics Tool, Bell Communication Research, 1997, [BER96] Berners-Lee, T., Hypertext Transfer Protocol HTTP/1.0, RFC 1945, Network Working Group, [BOW95] Bowman, C. M., Danzig, P. B., Hardy, D. R., Manber, U., Schwartz, M. F., Wessels, D. P., Harvest: A Scalable, Customizable Discovery and Access System, Technical Report, University of Colorado, Boulder, Colorado, USA, [BRI97] Brin, S., Page, L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, Stanford University, Stanford, CA, USA, [CHO97] Cho, J., Garcia-Molina, H., Page, L., Efficient Crawling Through URL Ordering, Computer Science Department, Stanford University, Stanford, CA, USA, [Excite] Excite, Excite Search Engine, WWW, [FIE98] J. Fiedler and J. Hammer, Using the Web Efficiently: Mobile Crawlers, University of Florida, Gainesville, FL, Technical Report, November 1998, ftp://ftp.dbcenter.cise.ufl.edu/pub/publications/mobile- Crawling.pdf. 16

Distributed Indexing of the Web Using Migrating Crawlers

Distributed Indexing of the Web Using Migrating Crawlers Odysseas Papapetrou cs98po1@cs.ucy.ac.cy Stavros Papastavrou stavrosp@cs.ucy.ac.cy George Samaras cssamara@cs.ucy.ac.cy ABSTRACT Due to the tremendous