USING THE WEB EFFICIENTLY: MOBILE CRAWLERS

Size: px
Start display at page:

Download "USING THE WEB EFFICIENTLY: MOBILE CRAWLERS"

Transcription

1 USING THE WEB EFFICIENTLY: MOBILE CRAWLERS Jan Fiedler and Joachim Hammer University of Florida Gainesville, FL ABSTRACT Search engines have become important tools for Web navigation. In order to provide powerful search facilities, search engines maintain comprehensive indices of documents available on the Web. The creation and maintenance of Web indices is done by Web s, which recursively traverse and download Web pages on behalf of search engines. Analysis of the collected information is performed after the data has been downloaded. In this research, we propose an alternative, more efficient approach to building Web indices based on mobile s. Our proposed s are transferred to the source(s) where the data resides in order to filter out any unwanted data locally before transferring it back to the search engine. This reduces network load and speeds up the indexing phase inside the search engine. Our approach to Web crawling is particularly well suited for implementing so-called smart crawling algorithms which determine an efficient crawling path based on the s of Web pages that have been visited so far. In order to demonstrate the viability of our approach we have built a prototype for mobile Web crawling that uses the University of Florida Intranet as its testbed. Based on this experimental prototype, we provide a cost-benefit analysis that outlines the advantages of mobile Web crawling. 1. Introduction The World Wide Web (Web) is a large distributed hypertext system, consisting of an estimated 1.6 million sites as of April Due to its distributed and decentralized structure, virtually anybody with access to the Web can add new documents, links, and even servers. For example, [KAH96] estimates that 40% of the Web s change within a month. However, the Web also lacks structure. Users navigate within this large information system by following hypertext links, which connect different resources with one another. One of the shortcomings of this navigation approach is that it requires the user to traverse a possibly significant portion of the Web before finding a particular resource (e.g., a document, which matches certain criteria). Considering the growth rate of the Web, locating relevant information in a timely manner is becoming increasingly more difficult. The current approach to Web searching is to create and maintain indices for the Web pages much like indices for a library catalog or access paths for tuples in a database. However, before the pages can be indexed they must first be collected and returned to the indexing engine. This is done by Web s which systematically traverse the Web using various crawling algorithms (e.g., breadth first, depth first). The pages are downloaded to a search engine, which parses the text and creates and stores the index. For examples of Web search engines see Google [BRI97], Altavista [AltaVista], Infoseek [Infoseek], etc. However, there are several problems associated with this method of indexing which are due to the rapidly growing number of Web pages that have to be indexed on one hand, and the relatively slow increase in network bandwidth on the other. To illustrate these problems, let us look at the following statistics. Within the last couple of years, search engine technology had to scale dramatically to keep up with the growing amount of information available on the Web: from 110,000 in 1994 to over 110 million pages in 1998 [SUL98]. This is an increase by factor 1,000 in 4 years only! And the Web is expected to Author s current address: Intershop Communications GmbH, Leutragraben 2-4, Jena, Germany.

2 continue to grow in a rapid fashion, doubling its size (in terms of number of pages) every months [KAH96]. By projecting this trend into the near future, we expect that a comprehensive Web index would have to contain about 1 billion pages by the year In addition, Kahle [KAH96] reports that the average online time of a page is only 75 days, which leads to an update rate of 600GB per month. Although updating pages does not necessarily affect the total index size, it nevertheless causes an index to age rapidly. Thus, in order to keep the indices of a search engine up to date, s must constantly retrieve Web pages as fast as possible. According to [SUL98], Web s of commercial search engines crawl up to 10 million pages per day. Assuming an average page size of 6K [BRI97] the crawling activities of a single commercial search engine add a daily retrieval load of 60GB to the ongoing Web activities. Given this explosive growth, we see the following specific problems with the way current search engines index the Web: Scaling. The concept of download-first-and-index-later will likely not scale given the limitations in the infrastructure and projected growth rate of the Web. Using the estimates for growth of Web indices provided in [SUL98], a Web running in the year 2000 would have to retrieve Web data at a rate of 45Mbit per second in order to download the estimated 480GB of pages per day that are necessary to maintain the index. Looking at the fundamental limitations of storage technology and communication networks, it is highly unlikely that Web indices of this size can be maintained efficiently. Efficiency. Current search engines add unnecessary traffic to the already overloaded Internet. While current approaches are the only alternative for general-purpose search engines trying to build a comprehensive Web index, there are many scenarios where it is more efficient to download and index only selected pages. We call these systems specialized search engines and justify their usefulness in this paper. Quality of Index. The results of Web searches are overwhelming and require the user to act as part of the query processor. Current commercial search engines maintain Web indices of up to 110 million pages [SUL98] and easily find several thousands of matches for an average query. Thus increasing the size of the Web index does not automatically improve the quality of the search results if it simply causes the search engine to return twice as many matches to a query as before. nce we cannot limit the number of pages on the Web, we have to find ways to improve the search results in such a way that can accommodate the rapid growth of the Web. Therefore, we expect a new generation of specialized search engines to emerge in the near future. These specialized engines may challenge the dominance of today's general search engines by providing superior search results for specific subject areas using a variety of new technologies from such areas as data mining, data visualization, and graph theory, for example. In addition, we believe that a new crawling approach is needed to improve the efficiency of data collection when used in combination with specialized search engines. The outline of the paper is as follows. We discuss the general idea behind our approach and describe its advantages using an example in Sec. 2. In Sec. 3 we analyze the costs and potential benefits of mobile crawling. While we do not claim that this approach is a solution for all search engines, we believe that many applications can benefit from mobile crawling. In Sec. 4 we outline our proposed architecture, which supports mobile Web crawling and describe the experimental prototype that we have built to verify our results. In Sec. 5 we describe relevant related research and conclude the paper in Sec. 6. 2

3 2. A Mobile Approach to Web Crawling In this paper, we propose an alternative approach to Web crawling based on mobile s. Crawler mobility allows for more sophisticated crawling algorithms [CHO97] and avoids some of the inefficiencies associated with the brute force strategies exercised by current s. We see mobile crawling as an efficient, scalable solution to establishing a specialized search index in the highly distributed, decentralized and dynamic environment of the Web. By looking at the general anatomy of traditional search engines, we realize that their architecture is strictly centralized while the data being accessed by search engine s is highly distributed. This centralized architecture is a mismatch for the distributed organization of the Web because it requires data to be downloaded before it can be processed. The shortcomings of the current approach become clear when we look at specialized search engines which are only interested in certain Web pages. If we use a stationary, we would download many pages, which are discarded immediately because they do not meet the subject area of the search engine. Obviously, this behavior is not very desirable because network bandwidth is wasted by downloading irrelevant information. The Web crawling approach described in this paper addresses this problem by making the data retrieval component, namely the Web, distributed. We define mobility in the context of Web crawling as the ability of a to transfer itself to each Web server of interest before collecting pages on that server. After completing the collection process on a particular server, the together with the collected data moves to the next server or to its home system. Mobile s are managed by a manager which supplies each with a list of target Web sites and monitors the location of each. This is necessary to intervene in case one or more s happen to interfere with each other (i.e., crawl the same Web space). However, the crawl strategy and path taken is controlled separately by each through the crawling algorithm. In addition, the manager provides the necessary functionality for extracting the collected data from the for use by the indexer (see Sec. 4). Figure 1 provides on overview of mobile Web crawling. Remote Host Search Engine Web HTTP Server Crawler Manager Remote Host Index Remote Host HTTP Server HTTP Server Figure 1: An overview of mobile Web crawling A Mobile Crawling Example In order to demonstrate the capabilities of mobile crawling, let us look at an example. Consider a search engine which wants to support high quality searches for a particular application domain, e.g., preventive health care, gardening, sports, etc. by building an index of relevant Web pages. The creation and maintenance of a suitable index to support such a specialized search engine using a traditional 3

4 crawling approach is highly inefficient. This is due to the fact that traditional s must download much more data than is effectively used (in the worst case scenario the whole Web). In contrast, in our approach, a mobile is sent to each Web source that is expected to contain relevant information for a local pre-selection of pages. Initially, the obtains a list of target locations from the manager. These addresses are referred to as seed URLs since they indicate the beginning of the crawling process. In addition, the manager also uploads the crawling strategy into the in form of a program. This program tells the which pages are considered relevant and should be collected. In addition, it also generates the path through the Web site. As we have mentioned before, a considerable amount of research has focused on optimal crawling strategies [CHO97]. We see our mobile crawling approach as a compliment to this work by providing an efficient infrastructure for implementing crawling strategies. Returning to our example, we now briefly describe the events that take place during the crawling process using a sample crawling algorithm as a guideline. A more detailed explanation can be found in the full version of this paper, which is available as a technical report [FIE98] for our ftp server at ftp.dbcenter.cise.ufl. Before the actual crawling begins, the must migrate to a specific remote site using one of the seed URL s as the target address. After the successfully migrated to the remote host, the crawling algorithm is executed. This part of mobile crawling is very similar to traditional crawling techniques since pages are retrieved and analyzed recursively. In fact, any breadth-first or depth-first crawling strategy can be employed. However, data collection can be speeded up considerably by using more sophisticated ( smart ) algorithms. nce these crawling strategies need access to the s of the crawled pages in order to optimize their crawl path, mobile s are particularly well suited for implementing smart crawling strategies. When the finishes it either returns to the manager ( home) or, in case the list of seed URLs is not empty, migrates to the next Web site on the list and continues. Once the mobile has successfully migrated back to its home, all pages retrieved by the are transferred to the search engine via the manager. Once the pages have been downloaded, the search engine can generate the index as before. The main difference is that the set of pages to be indexed is significantly smaller and only contains those pages that are relevant to the underlying search topic. Note a significant part of the transmission cost can be saved by compressing the pages as well as the code prior to migration. In case a mobile does not find any relevant information on a particular server, no data besides the code itself will be transmitted. Further note since the depends on the resources of the remote host for its processing power, it is not possible to predict how much memory is available to the. This means that a has to be able to dynamically interrupt its collection process whenever the available memory is exhausted and be able to transmit the collected pages back to the search engine to free up resources. We are currently in the process of implementing a second prototype version of our mobile that has the ability to transmit collected pages back to its home base whenever necessary. In addition, we are implementing a control interface that lets the programmer specify ahead of time how resource intensive the operates. Specifically, the programmer can specify that the never use more than a certain percentage of its available memory, for example, or that it should pause for t seconds in between accessing pages to reduce the load on the host server Issues Associated With Mobile Crawling So far, we have focused mainly on the advantages of migration and remote page processing. However, several issues still have to be resolved before mobile crawling can become widely used. These issues can be categorized roughly as policy issues. Although specific solutions to most of them are 4

5 beyond the scope of this work, we enumerate some of the important ones in an attempt to further the understanding of the underlying problems and to generate momentum for their solutions. Please note that mobile s are mainly a special case of mobile agents and thus some of the important issues related to mobile crawling should also be addressed in the broader context of mobile computing. The two most important issues with respect to our proposed solution are related to the fact that (1) a must have permission from the owner of a Web site to execute locally on the server and (2), our current version of the code needs a run-time environment to be present at each site before it can execute. As far as the permission problem is concerned (1), we currently see no short-time solution. As far as the run-time environment is concerned (2), we see this as an installation problem that can be alleviated in two ways: In the short-term, by simplifying the installation as much as possible by making the runtime environment a small server process which can be installed by the Web master of each participating Web site. In the long run, a better solution would be to standardize the runtime environment and make it an optional part of each Web server. Given the emergence of mobile agents, a standardized runtime environment could benefit other roaming agents besides our mobile s. Currently, we are avoiding the above problems by testing our s in secured intranets where we either have control over the participating Web servers or can obtain the necessary execution permission without problems. In a sense, our environment at the University is not unlike an Intranet within a large corporation, which can use mobile s for setting up search indexes with relatively little effort. 3. A Cost-Benefit Analysis of Mobile Web Crawling We have analyzed the behavior of a mobile using the following four parameters: Data Access: By migrating to a remote Web server, mobile s can access Web pages locally with respect to the server. This saves network bandwidth by eliminating request/response messages used for data retrieval. Remote Page Selection: By migrating to a remote Web server, mobile s can select only the relevant pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant information directly at the data source. Remote Page Filtering: By migrating to a remote Web server, mobile s can reduce the of Web pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant portions of the retrieved pages. Remote Page Compression: By migrating to a remote Web server, mobile s can compress the of Web pages before transmitting them over the network. This saves network bandwidth by reducing the size of the retrieved data. We now examinee the effects that each of these factors have on the efficiency in more detail Localized Data Access In the context of traditional search engines a stationary is an HTTP (Hypertext Transfer Protocol) client which tries to recursively download all documents managed by one or more Web servers. Due to the HTTP request/response paradigm [BER96], downloading the s from a Web server involves significant overhead due to request messages, which have to be sent separately for each Web page to be retrieved. Using a mobile, we reduce the HTTP overhead by transferring the to the source of the data. The can then issue all HTTP requests locally with respect to the HTTP server. This approach still requires one HTTP request per document but there is no need to transmit these requests 5

6 over the network anymore. Figure 2 summarizes the data retrieval process based on mobile s as introduced above. Search Engine Index Mobile Crawler Web Mobile Crawler HTTP Server HTML Figure 2: HTTP based data retrieval using mobile s. Naturally, this approach only pays off if the reduction in Web traffic due to local data access is more significant than the traffic caused by the initial transmission. The important question here is how soon the overhead due to the transmission of the mobile to the data source is less than the transmission overhead caused by HTTP request and response messages. To answer this question, we have derived the following formulas describing the network load L as a function of the number of crawled pages N. denotes the size of an object (e.g., message, Web page) in KB. Ltraditional = N ( request + response + ) L = N + 2 mobile ( ) The formula for the network load caused by a mobile assumes that the mobile is transmitted back to its home along with the retrieved pages. Based on these formulas we can derive another function describing the savings S in network load due to mobile s. N) = Ltraditional Lmobile = N + 2 ( request response ) By using averages for the size of the, the size of HTTP request and response messages, we can derive the savings as a linear function in the number of pages. = 0.1KB request response = 0.1KB = 1.5KB N) = 0.2KB N 3KB Figure 3 depicts the savings function as derived above. Note the exponential scale of the X-axis and the negative savings when crawling small sites. Based on the savings function S, we expect mobile s to operate more efficiently than traditional s if the number of pages to be crawled exceeds 15. Please note that this result is based on the assumption that we are considering no other factors besides localized data access Remote Page Selection Another advantage of our approach is remote page selection. Once a mobile has been transferred to a Web server, it can analyze each Web page before transmitting it to the search engine. This allows mobile s to determine whether the page is relevant with respect to a particular application domain. Web pages considered relevant are stored within the and are eventually transmitted over the network when the mobile returns to its home. 6

7 % Ã. ÃLQ W K LG Z G Q D ÃE G H Y D 6 1XPEHUÃRIÃSDJHV Figure 3: Localized data access saving function. By looking at remote page selection from a more abstract point of view, it compares favorably with classical approaches in database systems. If we consider the Web as a large remote database, the task of a is akin to querying this database in order to extract certain portions of it. In this context, the main difference between traditional and mobile s is the way queries are issued. Traditional s implement the data shipping approach of database systems because they download the whole database before they can issue a query, which identifies the relevant portion. If the query is very specific (e.g., establish an index of all English health care pages), a major part of the database has been downloaded without being useful in any way. In contrast to this, mobile s use the query shipping approach of database systems because all the information needed to identify the relevant data portion is transferred directly to the data source along with the mobile. After the query has been executed remotely, only the query result is transferred over the network and can be used to establish the desired index. nce the query shipping approach has proven to be more efficient in the context of database systems (e.g., SQL servers), we consider its application in the context of Web crawling to be superior with respect to the traditional approach. As done in the previous section lets look at the formulas describing the situation. The formulas given below focus on the remote page selection feature and neglect the saving in network load L due to localized data access as analyzed in the previous section. The percentage of pages considered relevant by the mobile s is expressed as an additional factor SF in the formulas. L L traditional mobile = N ( N, SF) = ( SF N) + 2 Note that the load function for mobile s is a function in N and SF since the network load depends on both. The selection factor SF in the second formula relates to N since it scales the number of pages transmitted over the network. Again, we are interested how soon mobile s start to outperform traditional s and derive the savings function ). This time, ) is a function in N and SF. N, SF) = L traditional L = (1 SF) N mobile ( N, SF) 2 Assuming the same average numbers for and page size, we can derive the following linear savings function ). 7

8 = 1.5KB = 6KB N, SF) = 6KB N ( 1 SF) 3KB Figure 4 depicts the savings function ) for some example values of SF. Note the exponential scale of the X-axis. The values chosen for SF are conservative in that they assume that 80-95% of the crawled pages is considered to be relevant. Depending on the degree of specialization of the underlying search engine for which the is working for, the actual values can be significantly smaller. 6) 6) 6) 6) % Ã. ÃLQ G H Y D ÃV W K LG Z G Q D % 1XPEHUÃSIÃSDJHV Figure 4: Remote page selection saving function. Based on Figure 4, we see that the break-even point (in terms of number of pages) for mobile s heavily depends on the percentage of pages considered relevant by the. For example, assuming the considers 95% of the crawled pages as relevant, a mobile will operate more efficient than a traditional as soon as the number of pages crawled exceeds 10. A mobile considering only 80% of the pages as relevant reaches this point as soon as more than 2 pages are crawled Remote Page Filtering Remote page filtering extends the concept of remote page selection to the s of a Web page. The idea behind remote page filtering is to allow the to control the granularity of the data it retrieves. With stationary s, the granularity of retrieved data is the Web page itself. This is because HTTP only allows page-level access. For this reason, stationary s always have to retrieve the entire page before the indexer can extract the relevant page portion. Depending on the ratio of relevant to irrelevant information, significant portions of network bandwidth are wasted by transmitting useless data. A mobile addresses this problem through its ability to operate directly at the data source. After retrieving a page, a mobile can filter out all irrelevant s keeping only information, which is relevant to the search engine for which the is working. To get an idea about how significant the potential savings are, we have derived the formulas that describe the situation. First, we established the network load functions L() for the traditional and the mobile crawling approach. For simplicity, the overhead due to HTTP request and response messages is neglected. As before, denotes object size. 8

9 L L traditional mobile = N ( N, FF) = N ( FF S ) + 2 As in the last section, the load function for the mobile crawling approach is a function in the number of crawled pages N and a factor FF (filter factor). Here, factor FF states the percentage of page used to represent the page. To estimate the potential savings due to mobile crawling we derive the savings function ). N, FF) = L traditional L = (1 FF) N mobile ( N, FF) 2 This savings function is very similar to the result of the last section. This is due to the fact that page filtering is represented in the formulas the same way as page selection. The difference is that page filtering scales the size of the page and not the number of pages (of course, this does not make a difference from the mathematical point of view). We set the parenthesis in the formulas to emphasis this. Using the values for page and size as before, we get the following linear savings function. = 1.5KB = 6KB N, SF) = N ((1 FF) 6KB) 3KB As in the last section, the savings due to remote page filtering depend heavily on the filter factor FF. nce the filter factor has the same impact on the saved bandwidth as the selection factor in the last section, we do not provide a separate figure for the calculated savings here. By replacing the selection factor SF with the filter factor FF, Figure 4 also depicts the savings function for remote page filtering for a search engine which extracts between 80 and 95% of the page for the index. As in the case of remote page selection, these factors are conservative estimates. To get a feel for the magnitude of the filter factor FF, consider a search engine index, which relies on the page URL, page title and a set of page keywords only. Assuming a page URL of 60 characters, page title length of 80 characters and a set of 15 keywords with an average size of 10 characters each, the mobile needs to keep only 290 bytes of the total page. This equals a filter factor of or about 5% when assuming an average page size of 6KB. With such a high filter degree, a mobile would always operate more efficiently than a traditional. Therefore, remote page filtering is especially useful for search engines, which use a specialized representation for Web pages (e.g., URL, title, modification date, keywords) instead of storing the complete page Remote Page Compression Remote page selection and filtering are two important characteristics that directly reduce the network traffic caused by Web s. Both perform well in the context of specialized search engines, which cover a certain portion of the Web only. However, the situation is different for a which is supposed to establish a comprehensive fulltext index of the Web. In this case, as many Web pages as possible need to be retrieved by the. Techniques like remote page selection and filtering are not applicable in such cases since every page is considered relevant. In order to reduce the amount of data that has to be transmitted back to the search engine, we introduce remote page compression as an additional feature of mobile s. Once a mobile has finished crawling a Web site, it has identified a set of relevant Web pages, which are kept in the s data repository. In order to reduce the bandwidth required to transfer the along with the data it contains back to the search engine, the mobile applies compression techniques to reduce its size prior to transmission. As before, we estimate the benefits by deriving load functions L() 9

10 describing the network load caused by traditional and mobile s. This time we focus on the effects of page compression and neglect the other features. L L traditional mobile = N S ( N, CR) = CR ( N S + 2 The load function for mobile s is a function in the number of crawled pages N and the achieved compression ratio CR. The load function states that the compression ratio relates to both, the page and the code. This is due to the fact that our prototype system compresses not only the crawled pages but also the code itself. To estimate the benefits of remote page compression we derive the savings function ). N, CR) = L traditional L = (1 CR) N mobile ( N, CR) 2 CR Using our average numbers for page and size, we get the following linear representation of the savings function. = 1.5KB = 6KB N, CR) = N (1 CR) 6KB CR 3KB ) Figure 5 depicts the savings function ) for some typical compression ratios. &5 &5 &5 % Ã. ÃLQ G H Y D ÃV W K LG Z G Q D % 1XPEHUÃRIÃSDJHV Figure 5: Remote page compression savings function. The diagram shows that remote page compression makes mobile crawling an attractive approach even for traditional search engines, which do not benefit from remote page selection and filtering due to their comprehensive fulltext-indexing scheme Combined Benefits So far, we examined the benefits of mobile crawling in an isolated fashion neglecting the fact the above parameters must be examined in combination in order to get a more realistic picture of the performance of mobile crawling. For example, remote page compression can always be activated and will further reduce the network load independent of any other features. The set of features, which can be used by a mobile to reduce the total network load, depends on the type of search engine for which the is working. For mobile s two characteristic features of search engines are 10

11 important: The first is whether the search engine tries to establish a comprehensive Web index. The second is whether the search engine relies on a fulltext index or a compact index. Table 1 summarizes applicable crawling features for some typical search engine types. Comprehensive Coverage Fulltext index Comprehensive coverage Specialized index Subject specific coverage Fulltext index Subject specific coverage Specialized index Localized Data Access Remote Selection Remove Filtering Remote compression Yes No No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Table 1: Applicable mobile crawling features by search engine type. Given these combinations of applicable features we derive general load functions L() based on the individual load functions identified in sections 3.1 through 3.4. The general load function L() will serve as the baseline for an analytical comparison of mobile and traditional crawling techniques. Ltraditional = N ( request + response + ) L ( N, CR, SF, FF) = CR SF N FF + 2 mobile ( ) ( ) ) Deriving our savings function ) from these load functions leads to the following equation: N, CR, SF, FF) = Ltraditional Lmobile ( N, CR, SF, FF) = N CR SF FF 2 CR ( request response ( ) ) Although this functions looks complicated, we can limit our analysis to the portion containing the filter, selection and compression factors which scale the size of the page to be transmitted. By neglecting the less significant terms request response and by assuming an average page size of 6KB, we can derive a simplified yet sufficiently exact version of the savings function. ( 1 CR SF FF) KB N, CR, SF, FF) = N 6 This simplified version of the savings function allows us to estimate the benefits of mobile crawling for the general search engine types identified in Table 1. Table 2 shows the benefits of mobile crawling in terms of bandwidth saved for some typical values of CR, SF, and FF. The results in Table 2 suggest that a significant portion of the network bandwidth required for Web crawling can be saved when using mobile instead of traditional s. Based on our analysis and the data we gathered using our prototype system, we claim that mobile s easily outperform traditional s in terms of network efficiency. Most importantly, this claim holds independent of the type of search engine for which the is working. The result of our analysis shows that mobile s do very well when used in the context of specialized search engines since all mobile crawling specific features are applicable at the same time. Here, mobility does not only have an impact in the domain space (i.e. network bandwidth saved) but also on the time domain (i.e. the time needed to finish crawling a site). This is due to the fact that the network is still a bottleneck when crawling the Web. By significantly reducing the amount of data to be transmitted over the network, the time needed for data transmission and therefore the time needed to finish crawling decreases. 11

12 SF (selection factor) FF (filter factor) CR (compression ratio) Bandwidth saved Comprehensive Coverage Fulltext index Comprehensive coverage Specialized index Subject specific coverage Fulltext index Subject specific coverage Specialized index : 3 66% : 3 90% : 3 90% : 3 97% Table 2: Bandwidth savings due to mobile s by search engine types. 4. An Architecture for Mobile Web Crawling Given the current Web infrastructure, mobile s are not applicable yet. This is due to the fact that the cornerstones of our approach migration and remote execution of code are not supported by the current Web infrastructure. In order to confirm the results of our analysis and to establish a proof of concept, we implemented a prototype system, which provides the infrastructure required by mobile s. The implemented prototype system extends the current Web architecture with a distributed runtime system for mobile s and provides some additional components for the management and control of mobile s. The overall system architecture and essential system components of the prototype system are depicted in Figure 6. Distributed Crawler Runtime Environment HTTP Server Virtual Machine Communication Subsystem Virtual Machine Communication Subsystem HTTP Server HTTP Server Virtual Machine Communication Subsystem Net Virtual Machine Communication Subsystem HTTP Server Communication Subsystem Outbox Inbox Crawler Spec Crawler Manager Archive Manager Query Engine Database Command Manager Connection Manager SQL DB Application Framework Architecture Figure 6: System architecture overview. The architecture depicted in Figure 6 consists of two major parts. The first part, the distributed runtime environment, provides the base functionality for the transfer and the execution of mobile code. The runtime environment establishes a distributed execution environment in which application specific mobile s can operate. The second major part of the system, the application framework architecture, serves as an application independent interface to the distributed runtime environment. The framework architecture provides functionality for mobile creation and 12

13 management. In addition, the framework provides a query interface, which allows applications to access the data retrieved by mobile s. nce a detailed discussion of the prototype s architecture is beyond the scope of this paper, we only provide an overview of the most essential system components. For an in-depth discussion of the system components and their implementation, refer to [FIE98] Mobile Crawlers In our prototype, mobile s serve as mobile containers for the crawling algorithm as well as for the collected data. To provide real mobility, a needs to be able to save its runtime state, transfer it over the network, and restore it at the remove location. For interoperability, s need to use a machine independent representation for their runtime state. nce this kind of interoperability is difficult to achieve, we decided to minimize the runtime state needed by our s as much as possible. As a result, we decided to specify programs based on rules and facts. The execution of a program is equivalent to applying rules upon the facts inside the s knowledge-base. The advantage of this approach is that rule based programs do not have a real runtime state. With carefully designed rules, the runtime state of the program can be represented by facts only. Thus, saving the runtime state of our s involves stopping the rule application process and saving the current fact base. Once this is finished, the can migrate since all relevant data (rules and facts) are now represented as simple ASCII strings within the. The object, which carries the rules and facts, migrates using the object serialization facilities built into the Java language Virtual Machine The virtual machine is the heart of the distributed runtime environment. Its main purpose is to provide an environment in which code received through the network can be executed. nce programs are specified based on rules, we can model our virtual machine using an inference engine, which takes care of the rule application process. To start the execution of a we initialize the inference engine with the rules and facts of the to be executed. Starting the rule application process of the inference engine is equivalent to starting execution. Once the rule application has finished (either because there is no rule which is applicable or due to an external signal), the rules and facts now stored in the inference engine are extracted and stored back in the. Thus, an inference engine establishes a virtual machine with respect to the. The concrete implementation of our virtual machine uses an extended version of the Jess inference engine [FRI97] which in turn is basically a Java port of the well knows CLIPS system [GIA97] Query Engine The query engine is part of the application framework architecture and responsible for the communication between and application. nce our mobile s are application independent, they have no information about the semantics of the data they retrieve. In order to use the retrieved data within an application, the data needs to be extracted from the s information base. To provide efficient access to this information, we implemented a query engine, which evaluates application specific queries upon the information base. The query result is represented as structured data tuples very similar to relational database systems. nce the information base consist of facts generated by the rule based program, the query engine implementation is based on the same inference engine used for the virtual machine implementation. Application specific queries are translates into special query rules, which identify matching facts within the information base Prototype and Lessons Learned The prototype system based on the architecture outlined in the previous sections has been implemented and is fully operational within an experimental testbed at the University of Florida at 13

14 Gainesville. The prototype is used as a proof of concept and to further evaluate our approach to Web crawling. For details, refer to [FIE98] One of the most important requirements for our prototype implementation was the ability to run on multiple host platforms (compute interoperability). Specifically, the runtime system has to provide a common environment to mobile s while running on different platforms, operating systems, and Web servers. To achieve the required platform independence we implemented the prototype using Java. So far, the and runtime environment have been successfully tested on Unix and Windows machines. For example, within the University of Florida Intranet, mobile s successfully migrated between host servers running the Unix and Windows operating system and collected data sets managed by more than ten different Web servers across campus. We are currently in the process of extending our prototype system with new components, which address the critical issues, identified in Section 2.2. Our focus here is on improving the security and stability of the distributed runtime environment. We identified this as the most crucial point to be addressed prior to using mobile s in a real network environment. Once this is done, we plan to install our runtime environment on numerous Web servers outside of the University of Florida in order to evaluate our crawling approach in a broader context using mobile enabled servers. We will report on the results of these tests in future reports. 5. Related Research Previous work related to this paper falls into four categories. Search Engine Technology Due to the short history of search engines, there has been little time to research this technology area. One of the first papers in this area introduced the architecture of the World Wide Web Worm [MCB94] (one of the first search engines for the Web) and was published in Between 1994 and 1997, the first experimental search engines were followed by larger commercial engines such as WebCrawler [PIN94, WebCrawler], Lycos [Lycos, MAU97], Altavista [AltaVista], Infoseek [Infoseek], Excite [Excite] and HotBot [HotBot]. Due to their commercial orientation, there is very little information available about these search engines and their underlying technology. Only two papers about architectural aspects of WebCrawler and Lycos are publicly available on the Web. The Google project [BRI97] at Stanford University recently brought large-scale search engine research back into the academic domain. Web Crawling Research A good source of information on Web crawling techniques is the Stanford Google project [BRI97], which we used as our primary architecture model. Based on the Google project, researchers at Stanford compared the performance of different crawling algorithms and the impact of URL ordering [CHO97] on the crawling process. A comprehensive Web directory providing information about s developed for different research projects, can be found on the robots homepage [KOS97]. Another project, which investigates Web crawling and Web indices in a broader context is the Harvest project [BOW95]. Harvest supports resource discovery through topic-specific indexing made possible by an efficient distributed information gathering architecture. Harvest can therefore be seen as a base architecture upon which different resource discovery tools (e.g., search engines) can be built. A major goal of the Harvest project is the reduction of network and server load associated with the creation of Web indices. To address this issue, Harvest uses distributed s (called gatherers) which can be installed at the site of the information provider to create and maintain a provider specific index. The indices of different providers are then made available to external resource discovery systems by so called brokers which can use multiple gatherers (or even other brokers) as their information base. 14

15 Beside technical aspects, there is a social aspect to Web crawling too. A Web consumes significant network resources by accessing Web documents at a fast pace. More importantly, by downloading the complete of a Web server, a might significantly hurt the performance of the server. For this reason, Web s have earned a bad reputation and their usefulness is sometimes questioned as discussed by Koster [KOS95]. To address this problem, a set of guidelines for developers has been published [KOS93]. In addition to these general guidelines a specific Web crawling protocol, the Robot Exclusion Protocol [KOS96], has been proposed by the same author. This protocol enables webmasters to specify to s which pages not to crawl. However, this protocol is not yet enforced and Web s implement it on a voluntary basis only. Rule Based Systems An example for a rule based system is CLIPS [GIA97] (C Language Integrated Production System) which is a popular expert system developed by the Software Technology Branch at the NASA/Lyndon B. Johnson Space Center. CLIPS allows the development of software systems which model human knowledge and expertise by specifying rules and facts. Rule based software programs do not need an explicit static control structure because rules are used to dynamically reason about facts in order to respond appropriately to the current situation. In the context of our prototype system we use a Java version of CLIPS called Jess (Java Expert System Shell) [FRI97]. Jess provides the core CLIPS functionality and is implemented at the Sandia National Laboratories. The main advantage of Jess is that it can be used on any platform which provides a Java virtual machine which is ideal for our purposes. Mobile Code Mobile code has become popular in the last couple of years especially due to the development of Java [GOS96]. The best examples are Java applets, which are small pieces of code, downloadable from a Web server for execution on a client. The form of mobility introduced by Java applets is usually called remote execution, since the mobile code is executed completely once it has been downloaded. nce Java applets do not return to the server, there is no need to preserve the state of an applet during the transfer. Thus, remote execution is characterized by stateless code transmission. Another form of mobile code, called code migration, is due to mobile agent research. With code migration it is possible to transfer the dynamic execution state along with the program code to a different location. This allows mobile agents to change their location dynamically without affecting the progress of the execution. Initial work in this area has been done by General Magic [WHI96]. Software agents are an active research area with lots of publications focusing on different aspects of agents such as agent communication, code interoperability and agent system architecture. Some general information about software agents can be found in papers from Harrison [HAR96], Nwana [NWA96] and Wooldridge [WOO95]. Different aspects and categories of software agents are discussed by Maes ([MAE94] and [MAE95]). Communication aspects of mobile agents are the main focus of a paper by Finin [FIN94] 6. Conclusion We have introduced an alternative approach to Web crawling based on mobile s. The proposed approach surpasses the centralized architecture of the current Web crawling systems by distributing the data retrieval process across the network. In particular, using mobile s we are able to perform remote operations such as data analysis and data compression at the data source before the data is transmitted over the network. This allows for more intelligent crawling techniques and addresses the needs of applications, which are only interested in certain subsets of the available data. We have developed and implemented an application framework, which demonstrates our mobile Web crawling approach and allows applications to take advantage of mobile crawling. 15

16 The performance results of our approach are very promising. Mobile s can reduce the network load caused by s significantly by reducing the amount of data transferred over the network. Mobile s achieve this reduction in network traffic by performing data analysis and data compression at the data source. Therefore, mobile s transmit only relevant information in compressed form over the network. The prototype implementation of our mobile framework provides an initial step towards mobile Web crawling. We have identified several issues, which need to be addressed before mobile crawling can be used in a larger scale: Security. Crawler migration and remote execution of code causes severe security problems because a mobile might contain harmful code. We suggest introducing an identification mechanism for mobile s based on digital signatures. Based on this identification scheme a system administrator would be able to grant execution permission to certain s only, excluding s from unknown (and therefore unsafe) sources. In addition to this, the virtual machine needs to be secured such that s cannot get access to critical system resources. This is already implemented in part due to the execution of mobile s within the Jess inference engine. By restricting the functionality of the Jess inference engine, a secure sandbox scheme (similar to Java) can be implemented relatively easily. Integration of the mobile virtual machine into the Web. The availability of a mobile virtual machine on as many Web servers as possible is crucial for the effectiveness of mobile crawling. This integration can be achieved through Java Servlets, for example, which extend Web server functionality with special Java programs. We realize of course, that before this can be done, some effort has to be spent on standardizing the functionalities of such runtime environments. Research in mobile crawling algorithms. None of the current crawling algorithms have been designed with mobility in mind. For this reason, it seems worthwhile to spend some effort in the development of new algorithms, which take advantage of mobility. In particular these algorithms have to deal with the loss of centralized control over the crawling process due to mobility. References [AltaVista] AltaVista, AltaVista Search Engine, WWW, [BEL97] BellCore, Netsizer Internet Growth Statistics Tool, Bell Communication Research, 1997, [BER96] Berners-Lee, T., Hypertext Transfer Protocol HTTP/1.0, RFC 1945, Network Working Group, [BOW95] Bowman, C. M., Danzig, P. B., Hardy, D. R., Manber, U., Schwartz, M. F., Wessels, D. P., Harvest: A Scalable, Customizable Discovery and Access System, Technical Report, University of Colorado, Boulder, Colorado, USA, [BRI97] Brin, S., Page, L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, Stanford University, Stanford, CA, USA, [CHO97] Cho, J., Garcia-Molina, H., Page, L., Efficient Crawling Through URL Ordering, Computer Science Department, Stanford University, Stanford, CA, USA, [Excite] Excite, Excite Search Engine, WWW, [FIE98] J. Fiedler and J. Hammer, Using the Web Efficiently: Mobile Crawlers, University of Florida, Gainesville, FL, Technical Report, November 1998, ftp://ftp.dbcenter.cise.ufl.edu/pub/publications/mobile- Crawling.pdf. 16

Distributed Indexing of the Web Using Migrating Crawlers

Distributed Indexing of the Web Using Migrating Crawlers Distributed Indexing of the Web Using Migrating Crawlers Odysseas Papapetrou cs98po1@cs.ucy.ac.cy Stavros Papastavrou stavrosp@cs.ucy.ac.cy George Samaras cssamara@cs.ucy.ac.cy ABSTRACT Due to the tremendous

More information

Aglets: a good idea for Spidering?

Aglets: a good idea for Spidering? Aglets: a good idea for Spidering? Nick Craswell Jason Haines Brendan Humphreys Chris Johnson Paul Thistlewaite [ANU] 1. Introduction Many individuals and businesses now rely on the Web for promulgating

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University. Parallel Crawlers Junghoo Cho University of California, Los Angeles cho@cs.ucla.edu Hector Garcia-Molina Stanford University cho@cs.stanford.edu ABSTRACT In this paper we study how we can design an effective

More information

Symantec NetBackup 7 for VMware

Symantec NetBackup 7 for VMware V-Ray visibility into virtual machine protection Overview There s little question that server virtualization is the single biggest game-changing trend in IT today. Budget-strapped IT departments are racing

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Module 1: Internet Basics for Web Development (II)

Module 1: Internet Basics for Web Development (II) INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of

More information

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1 A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC-03-08 College of Computing Atlanta,Georgia {ankit,abhi,lingliu}@cc.gatech.edu In this paper we present

More information

Keywords: web crawler, parallel, migration, web database

Keywords: web crawler, parallel, migration, web database ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Design of a Parallel Migrating Web Crawler Abhinna Agarwal, Durgesh

More information

PeerApp Case Study. November University of California, Santa Barbara, Boosts Internet Video Quality and Reduces Bandwidth Costs

PeerApp Case Study. November University of California, Santa Barbara, Boosts Internet Video Quality and Reduces Bandwidth Costs PeerApp Case Study University of California, Santa Barbara, Boosts Internet Video Quality and Reduces Bandwidth Costs November 2010 Copyright 2010-2011 PeerApp Ltd. All rights reserved 1 Executive Summary

More information

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client. (Published in WebNet 97: World Conference of the WWW, Internet and Intranet, Toronto, Canada, Octobor, 1997) WebView: A Multimedia Database Resource Integration and Search System over Web Deepak Murthy

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Next Generation Backup: Better ways to deal with rapid data growth and aging tape infrastructures

Next Generation Backup: Better ways to deal with rapid data growth and aging tape infrastructures Next Generation Backup: Better ways to deal with rapid data growth and aging tape infrastructures Next 1 What we see happening today. The amount of data businesses must cope with on a daily basis is getting

More information

Wireless Network Policy and Procedures Version 1.5 Dated November 27, 2002

Wireless Network Policy and Procedures Version 1.5 Dated November 27, 2002 Wireless Network Policy and Procedures Version 1.5 Dated November 27, 2002 Pace University reserves the right to amend or otherwise revise this document as may be necessary to reflect future changes made

More information

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Microsoft SharePoint Server 2013 Plan, Configure & Manage Microsoft SharePoint Server 2013 Plan, Configure & Manage Course 20331-20332B 5 Days Instructor-led, Hands on Course Information This five day instructor-led course omits the overlap and redundancy that

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

SEVEN Networks Open Channel Traffic Optimization

SEVEN Networks Open Channel Traffic Optimization SEVEN Networks Open Channel Traffic Optimization Revision 3.0 March 2014 The Open Channel family of software products is designed to deliver device-centric mobile traffic management and analytics for wireless

More information

Measuring VDI Fitness and User Experience Technical White Paper

Measuring VDI Fitness and User Experience Technical White Paper Measuring VDI Fitness and User Experience Technical White Paper 3600 Mansell Road Suite 200 Alpharetta, GA 30022 866.914.9665 main 678.397.0339 fax info@liquidwarelabs.com www.liquidwarelabs.com Table

More information

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS TUTORIAL: WHITE PAPER VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS 1 1. Introduction The Critical Mid-Tier... 3 2. Performance Challenges of J2EE Applications... 3

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian SONG 1,a* and Geoffrey TSO 2,b

Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian SONG 1,a* and Geoffrey TSO 2,b 2017 3rd International Conference on E-commerce and Contemporary Economic Development (ECED 2017) ISBN: 978-1-60595-446-2 Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian

More information

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours Advanced Solutions of Microsoft SharePoint Server 2013 Course 20332 36 Contact Hours Course Overview This course examines how to plan, configure, and manage a Microsoft SharePoint Server 2013 environment.

More information

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...

More information

A Firewall Architecture to Enhance Performance of Enterprise Network

A Firewall Architecture to Enhance Performance of Enterprise Network A Firewall Architecture to Enhance Performance of Enterprise Network Hailu Tegenaw HiLCoE, Computer Science Programme, Ethiopia Commercial Bank of Ethiopia, Ethiopia hailutegenaw@yahoo.com Mesfin Kifle

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

RSA INCIDENT RESPONSE SERVICES

RSA INCIDENT RESPONSE SERVICES RSA INCIDENT RESPONSE SERVICES Enabling early detection and rapid response EXECUTIVE SUMMARY Technical forensic analysis services RSA Incident Response services are for organizations that need rapid access

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Chapter 2. Application Layer. Chapter 2: Application Layer. Application layer - Overview. Some network apps. Creating a network appication

Chapter 2. Application Layer. Chapter 2: Application Layer. Application layer - Overview. Some network apps. Creating a network appication Mobile network Chapter 2 The Yanmin Zhu Department of Computer Science and Engineering Global ISP Home network Regional ISP Institutional network CSE Department 1 CSE Department 2 Application layer - Overview

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

The Transition to Networked Storage

The Transition to Networked Storage The Transition to Networked Storage Jim Metzler Ashton, Metzler & Associates Table of Contents 1.0 Executive Summary... 3 2.0 The Emergence of the Storage Area Network... 3 3.0 The Link Between Business

More information

Messaging Service Management and Analysis

Messaging Service Management and Analysis Messaging Service Management and Analysis Hypersoft OmniAnalyser delivers to enterprise customers detailed and comprehensive information on the quality of service, workload, usage and performance of their

More information

Is IPv4 Sufficient for Another 30 Years?

Is IPv4 Sufficient for Another 30 Years? Is IPv4 Sufficient for Another 30 Years? October 7, 2004 Abstract TCP/IP was developed 30 years ago. It has been successful over the past 30 years, until recently when its limitation started emerging.

More information

Advanced Solutions of Microsoft SharePoint Server 2013

Advanced Solutions of Microsoft SharePoint Server 2013 Course Duration: 4 Days + 1 day Self Study Course Pre-requisites: Before attending this course, students must have: Completed Course 20331: Core Solutions of Microsoft SharePoint Server 2013, successful

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Directory Search Engines Searching the Yahoo Directory

Directory Search Engines Searching the Yahoo Directory Searching on the WWW Directory Oriented Search Engines Often looking for some specific information WWW has a growing collection of Search Engines to aid in locating information The Search Engines return

More information

Benchmarking results of SMIP project software components

Benchmarking results of SMIP project software components Benchmarking results of SMIP project software components NAILabs September 15, 23 1 Introduction As packets are processed by high-speed security gateways and firewall devices, it is critical that system

More information

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software 177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179

More information

Advanced Solutions of Microsoft SharePoint 2013

Advanced Solutions of Microsoft SharePoint 2013 Course 20332A :Advanced Solutions of Microsoft SharePoint 2013 Page 1 of 9 Advanced Solutions of Microsoft SharePoint 2013 Course 20332A: 4 days; Instructor-Led About the Course This four-day course examines

More information

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Around the Web in Six Weeks: Documenting a Large-Scale Crawl Around the Web in Six Weeks: Documenting a Large-Scale Crawl Sarker Tanzir Ahmed, Clint Sparkman, Hsin- Tsang Lee, and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering

More information

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho Stanford InterLib Technologies Information Overload Service Heterogeneity Interoperability Economic Concerns Information

More information

Lotus Sametime 3.x for iseries. Performance and Scaling

Lotus Sametime 3.x for iseries. Performance and Scaling Lotus Sametime 3.x for iseries Performance and Scaling Contents Introduction... 1 Sametime Workloads... 2 Instant messaging and awareness.. 3 emeeting (Data only)... 4 emeeting (Data plus A/V)... 8 Sametime

More information

How To Construct A Keyword Strategy?

How To Construct A Keyword Strategy? Introduction The moment you think about marketing these days the first thing that pops up in your mind is to go online. Why is there a heck about marketing your business online? Why is it so drastically

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

A scalable lightweight distributed crawler for crawling with limited resources

A scalable lightweight distributed crawler for crawling with limited resources University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 A scalable lightweight distributed crawler for crawling with limited

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands

Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands Capacity Planning for Next Generation Utility Networks (PART 1) An analysis of utility applications, capacity drivers and demands Utility networks are going through massive transformations towards next

More information

CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL CRAWLERS USING FUZZY LOGIC

CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL CRAWLERS USING FUZZY LOGIC CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL CRAWLERS USING FUZZY LOGIC 6.1 Introduction The properties of the Internet that make web crawling challenging are its large amount of

More information

DATABASE SCALABILITY AND CLUSTERING

DATABASE SCALABILITY AND CLUSTERING WHITE PAPER DATABASE SCALABILITY AND CLUSTERING As application architectures become increasingly dependent on distributed communication and processing, it is extremely important to understand where the

More information

Analysis of the effects of removing redundant header information in persistent HTTP connections

Analysis of the effects of removing redundant header information in persistent HTTP connections Analysis of the effects of removing redundant header information in persistent HTTP connections Timothy Bower, Daniel Andresen, David Bacon Department of Computing and Information Sciences 234 Nichols

More information

= a hypertext system which is accessible via internet

= a hypertext system which is accessible via internet 10. The World Wide Web (WWW) = a hypertext system which is accessible via internet (WWW is only one sort of using the internet others are e-mail, ftp, telnet, internet telephone... ) Hypertext: Pages of

More information

1 Connectionless Routing

1 Connectionless Routing UCSD DEPARTMENT OF COMPUTER SCIENCE CS123a Computer Networking, IP Addressing and Neighbor Routing In these we quickly give an overview of IP addressing and Neighbor Routing. Routing consists of: IP addressing

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

Performance Modeling and Evaluation of Web Systems with Proxy Caching

Performance Modeling and Evaluation of Web Systems with Proxy Caching Performance Modeling and Evaluation of Web Systems with Proxy Caching Yasuyuki FUJITA, Masayuki MURATA and Hideo MIYAHARA a a Department of Infomatics and Mathematical Science Graduate School of Engineering

More information

HONORS ACTIVITY #2 EXPONENTIAL GROWTH & DEVELOPING A MODEL

HONORS ACTIVITY #2 EXPONENTIAL GROWTH & DEVELOPING A MODEL Name HONORS ACTIVITY #2 EXPONENTIAL GROWTH & DEVELOPING A MODEL SECTION I: A SIMPLE MODEL FOR POPULATION GROWTH Goal: This activity introduces the concept of a model using the example of a simple population

More information

Immidio White Paper Things You Always Wanted To Know About Windows Profile Management

Immidio White Paper Things You Always Wanted To Know About Windows Profile Management Immidio White Paper Things You Always Wanted To Know About Windows Profile Management Abstract Why are Windows user profiles so critically important for corporate IT environments and how can they be managed

More information

How Turner Broadcasting can avoid the Seven Deadly Sins That. Can Cause a Data Warehouse Project to Fail. Robert Milton Underwood, Jr.

How Turner Broadcasting can avoid the Seven Deadly Sins That. Can Cause a Data Warehouse Project to Fail. Robert Milton Underwood, Jr. How Turner Broadcasting can avoid the Seven Deadly Sins That Can Cause a Data Warehouse Project to Fail Robert Milton Underwood, Jr. 2000 Robert Milton Underwood, Jr. Page 2 2000 Table of Contents Section

More information

Computer Fundamentals : Pradeep K. Sinha& Priti Sinha

Computer Fundamentals : Pradeep K. Sinha& Priti Sinha Computer Fundamentals Pradeep K. Sinha Priti Sinha Chapter 18 The Internet Slide 1/23 Learning Objectives In this chapter you will learn about: Definition and history of the Internet Its basic services

More information

Turbo King: Framework for Large- Scale Internet Delay Measurements

Turbo King: Framework for Large- Scale Internet Delay Measurements Turbo King: Framework for Large- Scale Internet Delay Measurements Derek Leonard Joint work with Dmitri Loguinov Internet Research Lab Department of Computer Science Texas A&M University, College Station,

More information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Implementation of Enhanced Web Crawler for Deep-Web Interfaces Implementation of Enhanced Web Crawler for Deep-Web Interfaces Yugandhara Patil 1, Sonal Patil 2 1Student, Department of Computer Science & Engineering, G.H.Raisoni Institute of Engineering & Management,

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Postgres Plus and JBoss

Postgres Plus and JBoss Postgres Plus and JBoss A New Division of Labor for New Enterprise Applications An EnterpriseDB White Paper for DBAs, Application Developers, and Enterprise Architects October 2008 Postgres Plus and JBoss:

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

LECTURE 11: Applications

LECTURE 11: Applications LECTURE 11: Applications An Introduction to MultiAgent Systems http://www.csc.liv.ac.uk/~mjw/pubs/imas 11-1 Application Areas Agents are usefully applied in domains where autonomous action is required.

More information

Tips and Guidance for Analyzing Data. Executive Summary

Tips and Guidance for Analyzing Data. Executive Summary Tips and Guidance for Analyzing Data Executive Summary This document has information and suggestions about three things: 1) how to quickly do a preliminary analysis of time-series data; 2) key things to

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Categorizing Migrations

Categorizing Migrations What to Migrate? Categorizing Migrations A version control repository contains two distinct types of data. The first type of data is the actual content of the directories and files themselves which are

More information

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO?

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO? TABLE OF CONTENTS INTRODUCTION CHAPTER 1: WHAT IS SEO? CHAPTER 2: SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO? CHAPTER 3: PRACTICAL USES OF SHOPIFY SEO CHAPTER 4: SEO PLUGINS FOR SHOPIFY CONCLUSION INTRODUCTION

More information

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V WHITE PAPER Create the Data Center of the Future Accelerate

More information

A Novel Architecture of Ontology based Semantic Search Engine

A Novel Architecture of Ontology based Semantic Search Engine International Journal of Science and Technology Volume 1 No. 12, December, 2012 A Novel Architecture of Ontology based Semantic Search Engine Paras Nath Gupta 1, Pawan Singh 2, Pankaj P Singh 3, Punit

More information

3 Media Web. Understanding SEO WHITEPAPER

3 Media Web. Understanding SEO WHITEPAPER 3 Media Web WHITEPAPER WHITEPAPER In business, it s important to be in the right place at the right time. Online business is no different, but with Google searching more than 30 trillion web pages, 100

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 20 Concurrency Control Part -1 Foundations for concurrency

More information

The Internet Advanced Research Projects Agency Network (ARPANET) How the Internet Works Transport Control Protocol (TCP)

The Internet Advanced Research Projects Agency Network (ARPANET) How the Internet Works Transport Control Protocol (TCP) The Internet, Intranets, and Extranets 1 The Internet The Internet is a collection of interconnected network of computers, all freely exchanging information. These computers use specialized software to

More information

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management WHITE PAPER: ENTERPRISE AVAILABILITY Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management White Paper: Enterprise Availability Introduction to Adaptive

More information

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience World Library and Information Congress: 71th IFLA General Conference and Council "Libraries - A voyage of discovery" August 14th - 18th 2005, Oslo, Norway Conference Programme: http://www.ifla.org/iv/ifla71/programme.htm

More information

RSA INCIDENT RESPONSE SERVICES

RSA INCIDENT RESPONSE SERVICES RSA INCIDENT RESPONSE SERVICES Enabling early detection and rapid response EXECUTIVE SUMMARY Technical forensic analysis services RSA Incident Response services are for organizations that need rapid access

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

WHITE PAPER Cloud FastPath: A Highly Secure Data Transfer Solution

WHITE PAPER Cloud FastPath: A Highly Secure Data Transfer Solution WHITE PAPER Cloud FastPath: A Highly Secure Data Transfer Solution Tervela helps companies move large volumes of sensitive data safely and securely over network distances great and small. We have been

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

JXTA for J2ME Extending the Reach of Wireless With JXTA Technology

JXTA for J2ME Extending the Reach of Wireless With JXTA Technology JXTA for J2ME Extending the Reach of Wireless With JXTA Technology Akhil Arora Carl Haywood Kuldip Singh Pabla Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA 94303 USA 650 960-1300 The Wireless

More information

Introduction to the Internet and Web

Introduction to the Internet and Web Introduction to the Internet and Web Internet It is the largest network in the world that connects hundreds of thousands of individual networks all over the world. The popular term for the Internet is

More information

CHAPTER. The Role of PL/SQL in Contemporary Development

CHAPTER. The Role of PL/SQL in Contemporary Development CHAPTER 1 The Role of PL/SQL in Contemporary Development 4 Oracle PL/SQL Performance Tuning Tips & Techniques When building systems, it is critical to ensure that the systems will perform well. For example,

More information

THE HISTORY & EVOLUTION OF SEARCH

THE HISTORY & EVOLUTION OF SEARCH THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

CTW in Dasher: Summary and results.

CTW in Dasher: Summary and results. CTW in Dasher: Summary and results. After finishing my graduation thesis Using CTW as a language modeler in Dasher, I have visited the Inference group of the Physics department of the University of Cambridge,

More information

TN3270 AND TN5250 INTERNET STANDARDS

TN3270 AND TN5250 INTERNET STANDARDS 51-10-55 DATA COMMUNICATIONS MANAGEMENT TN3270 AND TN5250 INTERNET STANDARDS Ed Bailey INSIDE Enterprise Data and Logic; User Productivity and Confidence; Newer Platforms and Devices; How Standardization

More information

Part I: Future Internet Foundations: Architectural Issues

Part I: Future Internet Foundations: Architectural Issues Part I: Future Internet Foundations: Architectural Issues Part I: Future Internet Foundations: Architectural Issues 3 Introduction The Internet has evolved from a slow, person-to-machine, communication

More information

Assignment #2. Csci4211 Spring Due on March 6th, Notes: There are five questions in this assignment. Each question has 10 points.

Assignment #2. Csci4211 Spring Due on March 6th, Notes: There are five questions in this assignment. Each question has 10 points. Assignment #2 Csci4211 Spring 2017 Due on March 6th, 2017 Notes: There are five questions in this assignment. Each question has 10 points. 1. (10 pt.) Design and describe an application-level protocol

More information

Quest Central for DB2

Quest Central for DB2 Quest Central for DB2 INTEGRATED DATABASE MANAGEMENT TOOLS Supports DB2 running on Windows, Unix, OS/2, OS/390 and z/os Integrated database management components are designed for superior functionality

More information