Integrating VVVVVV Caches and Search Engines*

Size: px

Start display at page:

Download "Integrating VVVVVV Caches and Search Engines*"

Solomon Harvey
6 years ago
Views:

1 Global Internet: Application and Technology Integrating VVVVVV Caches and Search Engines* W. Meira Jr. R. Fonseca M. Cesario N. Ziviani Department of Computer Science Universidade Federal de Minas Gerais Belo Horizonte - MG - Brazil {meira, rfonseca. magc, nivio}@dcc.ufmg,br Abstract In this paper we propose the concept of cache plugins, which are customized programs that run WWW cache servers and perform some of the search engine tasks. We describe a prototype implementation of cache plugin to answer client requests directed to a large search engine, using a nearby cache server to store static objects. Experimental results using actual logs show a significant improvement on the quality of service of the search engine, doubling its predictability, improving its availability by a factor of 24, and reducing both its response time by 8% and the network traffic by a factor of Introduction The rapid and uncontrolled growth of the WWW has increased the importance of search engines as a means of finding relevant information, documents and services. These engines - have become the actual gateways - through which start navigation, and a measure Of their popularity and massive use is the cost of advertising in their sites, starting from a few thousand dollars a month. Because of their widespread use and relevance, efficient and fast searches are a need. However, the search servers' performance depends not only on the number of requests, but also on their complexity. The amount of data returned varies greatly among different requests, imposing a variable load to the servers and the interconnection networks between these and the clients. Another recent trend that has been gaining popular- ity is the use of meta-search engines [l], which aim at minimizing the users' effort by simultaneously querying several search engines and combining the results in a single page. Howerver, this can worsen the problem of performance degradation, since it multiplies the traffic generated by a single user. Standard caching strategies [4, 51, employed to minimize both response time and network traffic, are not adequate to cache responses to queries to search engines, which are intrinsically dynamic. These strategies only provide support to static objects, i.e., those which do not change frequently. Moreover, the replication of pages containing advertising material is avoided by content providers, since they can no longer control banner replacement policies, nor can they account for individual accesses to their pages. Even more elaborated strategies proposed to cache dynamic content i did not perform satisfactorily, increasing the client response time by up to 400% [3]. In this paper we propose cache as a new strategy for handling requests for non-static documents. Cache plugins are small programs that execute on hosts that act as WWW cache servers, enabling the cache to store static information from which the non-static pages can be built. Furthermore, they can act as concentrators for page view accounting. The paper is organized as follows. Section 2 describes the cache plugin architecture. In Section 3 introduces the Miner Family of Web agents as a good candidate for being improved in terms of its quality of service Meira Jr.). future work. Global Telecommunications Conference - Globecom' /99/ IEEE 1763

2 2 Cache Plugin Architecture In this section we describe the cache plugin architecture in detail and an implementation using Squid [6]. Cache plugins are programs that run either on the cache server or on a nearby (connectionwise) machine. Plugins answer requests for dynamic objects (i.e., objects that change frequently) that are normally ignored by WWW caches, being forwarded to remote servers. A busy content provider or search engine may implement a cache plugin for its site and make it available so that it can be installed in cache servers close to users. Once operational, the plugin is able to perform several tasks on behalf of the represented server, such as page generation and page view accounting, lowering both the load on the server and on the network. In our prototype implementation, requests reach the plugin through the redirection capability provided by Squid. Redirectors are scripts that rewrite URLs from requests, as specified by the cache administrator. In order to install a plugin we simply need to enable a redirection clause that will cause the cache server to redirect requests originally targetting a represented server to the respective local plugin. Upon receiving the request, the plugin parses it and determines all static objects, such as banner images, that are required to build the response. These objects are then requested to the cache server, which treats them as ordinary requests: if a local copy exists in the cache, it is returned; otherwise, the object is fetched from the remote host, sent to the plugin, and cached locally for future requests. Once all the required objects are available, the plugin combines them appropriately, building the response page, which is sent to the cache server and finally to the client. Furthermore, the plugin can gather information, such as accounting of page accesses and clicks in advertising banners, and report it periodically to the server it represents. This process is completely transparent to the client, and is summarized in Figure 1. The numbers in the figure refer to the following seven steps: 1) client requests a non-static page from a cache server; 2) cache server redirects the request to the cache plugin; 3) cache plugin parses the request and request static objects from the cache server; 4) static objects may be misses and are requested from original servers; 5) plugin composes response page and answers cache server s request; 6) cache server returns page to client; and 7) cache plugin notifies server about accesses and may retrieve new procedures for response page generation. It is important to note that each content provider implements and distributes its own plugin, which is a I Client I Figure 1: General Plugin Architecture therefore tailored to understand requests directed to its represented server, being able to determine what data to request from the server, and how to interpret the data. This can involve the use of encryption to enforce security,for example. Moreover, step 7 in Figure 1, the plugin reporting back to the server, can involve any kind of communication, since the protocol is defined by the content provider. The freshness of the objects in the cache can be controlled by ordinary TTL (time to live) mechanisms. The server can set the TTL of a static object returned based on how often it is uptdated. The use of the cache plugins is beneficial for both the remote server, which experiences a reduction in processor and network load, and the client making the request, which has its response time lowered. Obviously, gains from employing cache plugins depend on the reference locality in the request stream to which the plugin is subject. 3 The Miner Family of Web Agents An example of an application that can benefit from the cache plugin architecture is the Miner Family of Web Agents [l]. The Miner Family is a collection of individual programs, called agents, that can act both as searching utilities and an electronic catalogs, and can also provide brokerage services. The Miner Family was developed mainly for Portuguese language-based ser Global Telecommunications Conference - Globecom 99

3 vices. The search utility services provided by the Miner Family include MetaMiner, which is a meta-search engine that uses Brazilian and international search engines, among others. The Miner Family was coded in Java and comprises about 23,000 lines of code that run on a Netscape Enterprise Server. All members of the Miner Family work similarly and the main steps to answer a request can be summarized as follows: (1) a user submits a query; (2) the Miner server gets the query and dispatches its agents; (3) each agent queries its target engine, store, or site; (4) each agent receives and parses the query results; and (5) the server unifies, formats, and sends the results to the user. The opportunity for caching lies in the fact that all agents, in some point, combine static objects to form a request. Each query to an agent of the Miner Family is specified by of one or more words and a few additional parameters. Using the Meta Miner agent as an example, if the server executing the agent holds lists of URLs for each of the words making up the query, building the response to the query would just require the agent to appropriately merge the lists together. These lists can also be sent to a remote plugin, cached and combined locally for future requests. 4 Experimental Results In order to verify the efficiency of the cache plugin architecture we evaluated the gains observed in terms of quality of service of the search engine, using a workload derived from MetaMiner logs. This workload is used to drive an experimental environment consisting of clients that submit requests to a cache server in which a cache plugin is running. This cache plugin represents a search engine server that runs in a separate machine. Both the cache plugin and the search engine server build the response to a query similarly: they merge the URL lists corresponding to the words contained in the query. Since these lists remain unchanged for a relatively long period, they are considered to be static here, and can be cached normally. To obtain the workload for our experiments we combined data from two different sources: (1) logs from the MetaMiner engine, which held a record of each request submitted, with all options the user chose at the moment of submitting it, and (2) text documents retrieved from the Web (those with Mime type text/*) that were stored in the disks of POP-MG, a large proxy cache server in Brazil. The MetaMiner logs span 2 br months and comprised a total of 925,042 requests to the engine. We noticed a high degree of reference locality among search terms, with the hit ratio for a period of 24 hours averaging 80%. The requests had an average of 1.63 words, with 52% consisting of only one word and the largest query being 12-word long. We processed the logs and isolated all the words that made up the queries, for a total of 125,342 unique words. We searched for each of these words in the database of text documents retrieved from the POP- MG cache server, and built a list containing all documents where that word appeared. We retrieved 85,426 documents from the cache, that comprise a total of megabytes of text data. The average size of the URL lists for each word is 38,90has 6967 URLs. By using this approach to generate the workload, we garanteed a realistic distribution in the size of the lists and in the frequency of requisition for each list, since both the popularity of a term in actual documents and in actual searches are taken into account. The experiments consisted of several clients submitting requests to the cache server. In each run, the requests are a subset of the requests present in the MetaMiner logs, and are formed by one or more words. Our experimental environment consists of Pentium machines running Solaris and FreeBSD in the same switched LAN. The machine that acts as a search engine server is a Pentium Pro SMP with four processors] 128 Mb of memory, and Ultra-wide SCSI disks. The cache server is a dual Pentium Pro SMP, 128 Mb of memory, and Ultra-wide SCSI disks. Finally, the clients are Pentium 166Mhz with 64Mb of memory. The clients are processes that emulate a set of browsers performing queries. Each client queries a single cache server, keeping a configurable number of requests open. The clients are able to handle efficiently several simultaneous connections by using asynchronous communication primitives, as proposed in [2], The server is a Perl script that parses the queries, retrieves the lists of URLs for each word that compose the query, joins these lists identifying the URLs that are present in all lists, and generates a page to be returned to the client. The cache redirector and the plugin are also Perl scripts. The redirector is quite simple and just translates server requests to requests to the plugin. The plugin parses the request and determines which lists of URLs should be requested to the cache server and later combined to generate the response to the client. The generated pages, in contrast to URL lists, are not cached, in order to optimize the cache disk utilization. We evaluated the efficacy of cache plugins through Global Telecommunications Conference - Globecorn

4 Globol Internet: Application and Technology Measures Average Response Time (Sec.) Response Time Relative Variance Errors Server CPU Load (usr/sys/wio) Server Disk Load Cache to Server Requests Bvtes Transferred /Sec. Plugin No Plugin /0.01/ /0.17/ Table 1: Profiles four metrics that quantify the quality of service of a search engine: response time, availability, predictability, and scalability. In order to evaluate these metrics, we performed two experiments that differ only by the use or not of cache plugins. There were two client processes per experiment; each client kept 20 connections open simultaneously and performed 80,000 requests from the workload described earlier in this section. The cache server keeps 20 instances of the cache plugin running. Table 1 shows some performance measures that we discuss next. We can observe that the use of cache plugins reduced the client response time by 8%. This relatively small improvement is explained by the prototypical implementation of plugins, which performs several forks, operations that are well known for being computationally expensive. Moreover, both response time variance and number of errors (i.e., requests that timed out without response) are much higher for the nonplugin c6nfiguration, indicating its lower predictability of service latency and service availability, respectively. In Figure 2 we show the average response time over the course of the experiments for one of the clients. One can observe the higher variance and the slightly higher average response time for the non-plugin configuration. Regarding server load, both CPU and disk loads decreased by one order of magnitude when employing cache plugins. The two last rows in the table present the gains in terms of the number of requests to the search engine server, which reduced by a factor of 10, and network traffic (measured by bytes transferred per second), which reduced by a factor of 20. Finally, we evaluated the service scalability by variating the number of simultaneous requests that are submitted by clients. We performed similar experiments with just 5 simultaneous requests per client and compared the response times variations. The average response time when we used the cache plugin was 1.35 seconds, compared to 1.82 seconds without the plugin, an improvement rate of 24%. The lower improvement rate pro- vided under heavier workloads is explained by the CPU saturation experienced by the cache server. Note that although the search engine server is twice as powerful as the cache server, we were still able to perform significantly better. We believe that the gains would be even higher in real environments, where there may be several cache servers representing a single search engine, and worse network conditions, making the access to the search engine more difficult. Recall that the gains just presented were a consequence of the reference locality present in the stream of requests, which can be further exploited by prefetching popular lists of URLs. Prefetching would allow not only faster response, but also congestion avoidance, since it can be performed during off-peak time periods. In order to evaluate the applicability of prefetching, we ran additional experiments in which each client warmed up the cache server with the 16,000 most popular URL lists from the first experiment, and then requested a different stream of 80,000 requests from the workload. The results are presented in the graph of Figure 2 ( Plugin with Prefetching ), where we can see that prefetching further improved the average response time by lo%, decreasing it to 1.22 seconds. 5 Conclusions and Future Work In this paper we presented Cache Plugins, a novel strategy to improve the quality of service of searches in the WWW, by integrating search engines and cache servers. Using the proposed architecture, result pages can be effectively cached, as well as any dynamic content page, normally not cacheable. This strategy reduces both the response time seen by client, and the server and network load. The implementation of such cache plugins is straightforward, and they also allow accounting of page visits and advertising clickthrough, one of the main restrictions for caching certain pages with dynamic content. We are to investigate the benefits of the architecture with data from a real search Global Telecommunications Conference - Globecom 99

5 Average Client Response Time I I I I 1 I I 1 j Plugin - Plugin withiprefetching No Plugin._.._... 0' I I I 1 I I I I Requests(Thousands) Figure 2: Average client response time engine, which has larger lists of URLs, and the implementation of more sophisticated protocols between the plugin and the server, so that it can decide whether to request the dynamic page directly or the static objects necessary to form the response. we also intend to investigate the use of cache plugins in other scenarios, such as electronic commerce. References [4] Calos Maltzahn, Kathy J. Richardson, and Dirk Grunwald. Performance Issues of Enterprise Level Web Proxies. ACM Sigmetn'cs '97, [5] C. Roadknight and I. Marshall. Variations in cache behavior. In Proceedings of WWW7, (61 Duane Wessels and K. Claffy. Squid Internet Object Cache, http : //www. nlanr. net/squid. [I] V. Almeida, W. Meira Jr., V. Ftibeiro, and N. Ziviani. Efficiency analysis of e-brokers in the' electronic marketplace. In Proceedings of WWWB, [2] Gaurav Banga and Peter Druschel. Measuring the Capacity of a Web Server. In Usenix Symposium on Internet Technologies and Systems, Monterey, December [3] P. Cao, J. Zhang, and Kevin Beach. Active cache: Caching dynamic contents on the web. In Proc. of IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, pages ,1998. Global Telecommunications Conference - Globecom'

E-representative: a scalability scheme for e-commerce

E-representative: a scalability scheme for e-commerce Wagner Meira Jr. y Daniel Menascé z Virgílio Almeida y Rodrigo Fonseca y y Dept. of Computer Science z Dept. of Computer Science Universidade Federal