Collaborative Web Caching Based on Proxy Affinities

Collaborative Web Caching Based on Proxy Affinities Jiong Yang T J Watson Research Center IBM jiyang@usibco Wei Wang T J Watson Research Center IBM ww1@usibco Richard Muntz Coputer Science Departent UCLA untz@csuclaedu ABSTRACT With the exponential growth of hosts and trac workloads on the Internet, collaborative web caching has been recognized as an ecient solution to alleviate web page server bottlenecks and reduce trac However, cache discovery, ie, locating where a page is cached, is a challenging proble, especially in the fast growing World Wide Web environent, where the nuber of participating proxies can be very large In this paper, we propose a new schee which eploys proxy anities to aintain a dynaic distributed collaborative caching infrastructure Web pages are partitioned into clusters according to proxy reference patterns All proxies which frequently access soe page(s) in the sae web page cluster for an \inforation group" When web pages belonging to a web page cluster are deleted fro or added into a proxy's cache, only proxies in the associated inforation group are notied This schee can be shown to greatly reduce the nuber of essages and other overhead on individual proxies while aintaining a high cache hit rate Finally, we eploy trace driven siulation to evaluate our web caching schee using three web access trace logs to verify that our caching structure can provide signicant benets on real workloads 1 INTRODUCTION As trac on the Internet continues to grow, how to iprove Internet perforance has becoe a challenging issue Recent research to tackle this proble falls into three categories server load balancing It is well known that the accesses over Internet are not uniforly distributed A coon approach adopted by popular web sites to alleviate server bottleneck is to provide a virtual URL interface and use a distributed server architecture underneath Soe internal echanis to dynaically assign client requests to the web servers is required to achieve scalability and transparency to the clients This assignent decision can be taken at the IP level or at the doain nae syste level In addition to their individual drawbacks [5], both schees require the set up of additional server(s) at local or reote sites to handle request overloads This approach can be taken to avoid request overload at web sites which are continuously popular (eg, DBLP bibliography web site) or predictably popular during soe period (eg, Olypic gae web site) However, due to its static nature, this echanis can not handle unforeseeable bursty requests eectively intra-net collaborative caching Within a ediu area network (MAN) (eg, capus wide network), several proxies are usually connected via high speed links Web pages fetched fro the content server by one proxy can be used to potentially fulll later requests (of the sae web page) fro other proxies within the sae MAN If ultiple requests of the sae web page are subitted fro the sae MAN within a short period of tie, only the rst one will go through the content server Since the proxy is usually uch \closer" to the client than the content server, the average latency of web page retrieval is reduced In addition, only a fraction of all requests need to be handled by the server, thus the server workload and Internet traf- c are also reduced Suary cache [11] is an exaple of this type of technique However, this approach still can not fully eliinate the server bottleneck caused by soe unforeseeable Internet-wide bursty accesses since a cached copy of a web page is only shared by clients within the sae MAN inter-net collaborative caching Based on the following observations, inter-net collaborative caching has been recognized as another feasible approach to iprove Internet perforance 1 Retrieving data fro a nearby proxy usually is uch faster than fetching the data fro a ore distant server (less latency and, ost likely, higher bandwidth) 2 A proxy that fetched a web page could potentially serve as the provider of this web page at a later tie (as long as it still caches an up-to-date copy of the web page) However, the nuber of proxies connected via the Internet is large Without incurring large overhead, how to identify a potential cache provider and oreover,

if ultiple providers exist, how to choose an optial one, are very crucial for the scalable perforance of the caching schee Harvest [9] [6] uses a proxy hierarchy to conduct the search for a potential provider This hierarchy is fored statically and independent of the proxy anity for web pages Thus, it is likely that the cache providers are far apart (in the hierarchy) fro the requester In such a case, large overhead will be incurred for cache discovery [19] also tries to solve this proble by aintaining a dynaic routing table In our proposed approach, instead of eploying a xed hierarchy, proxy anities are used to group proxies dynaically Intuitively, fro the viewpoint of a proxy, each web page request falls into one of the following three categories 1 The proxy locally caches an up-to-date version of the requested web page 2 An up-to-date version of the requested web page is cached by soe other nearby proxy 3 The requested web page has to be obtained fro the content server Web pages in the rst and third categories are easy to handle since the operations are local to each individual proxy However, collaborations aong proxies is required to locate cached web pages in the second category A carefully designed caching schee is crucial in order to avoid large overhead In our approach, proxies that have siilar anities will cooperate and share their cache content with each other in such a anner that little overhead is required to achieve a relatively high cache hit ratio Whenever a proxy fetches a web page, it will serve as a cache provider until this copy is obsolete Our ultiate goal is to iniize the average response tie of web page retrievals Two aspects are involved (1) What should be cached and where? (2) How does a proxy locate a reote cache containing a desired page? The rst question can be addressed by algoriths for prefetching [14] [12] Due to space liitations, we will not address this issue but rather focus on the second one The second issue is how to discover which proxy has a cached copy of a particular web page Basically, there are two approaches to cache discovery: pull and push The pull technique is that, when a proxy tries to locate a cache, it rst nds which proxy caches the web page, and then retrieves the page This usually increases the average response tie With the push technique, when the cache contents of a proxy changes, the proxy will tell other proxies about the change However, the push technique can counicate a large aount of unnecessary inforation This generates a signicant overhead not only on the network, but also on the proxies due to essage processing In addition, to locate a cache for a particular web page, a proxy has to aintain the inforation about other proxy's cache contents This can consue a signicant aount of resources when the nuber of proxies is large Therefore, a proxy should be selective concerning which proxies it should send the updates of its cache contents and how often The trend of increasing diversity of web page popularity [2] also indicates that a static global caching structure regardless of individual proxy anities would becoe less eective In fact, an optial cache schee should have good adaptability to diverse proxy anities and their changes In our schee, a dynaic distributed collaborative caching infrastructure is built according to the proxy anities so that unnecessary counication between proxies of distinct anities can be avoid and changes of proxy anities can be adapted to easily and quickly Moreover, scalability is an iportant aspect of a web caching schee since the nuber of proxies and servers could range up to hundreds of thousands in the Internet environent It will be shown in a later section that our approach can scale well when the nuber of proxies is large In this paper, we propose a schee using proxy proles and inforation groups which is based on web page access patterns The goal of this schee is to reduce the average nuber of essages aong proxies for updating the cache status while aintaining a high cache hit rate In our schee, a proxy prole contains a list of URLs of the web pages that are frequently accessed by the proxy, while an inforation group is essentially a ulticast group that links proxies which are collaboratively caching web pages that are of coon interest We group web pages into a set of overlapping web page clusters A proxy ay choose to join an inforation group if the intersection of the associated page cluster and the proxy's prole is signicant Generally, a proxy joins enough inforation groups to cover its ost frequently access pages Since a proxy's prole ay evolve over tie, the web page cluster generation and inforation group foration are perfored periodically (and increentally) to adapt to changes in the proxy's access pattern When a host wants to access a web page, it sends a request to its local proxy If the requested web page is not cached in the local proxy, then the proxy checks whether other proxies in the inforation group(s) (for this page) currently have this page in their cache The request will be sent to the \nearest" site which has this page Once there are enough changes in the cache contents of a proxy, the proxy will notify other proxies in the sae inforation group(s) of the changes Therefore, in order to achieve optial perforance, each inforation group should be of oderate size because a large group size incurs signicant overhead for bookkeeping while a sall group suers fro low cache hit ratio The reainder of this paper is organized as follows We discuss soe related work in Section 2 Section 3 presents the objective odel for our caching schee The siulation set up is presented in Section 4 Our page clustering algorith and inforation group foration algorith are presented in Section 5 and 6, respectively In Section 7 and 8, we discuss the web page retrieval process and optial inforation group size estiation, respectively Finally, conclusions are drawn in Section 9 2 RELATED WORK In this section, we discuss soe other web caching schees proposed recently We note that ost previous cooperative caching schees [6] [19] [11] focused on how caches cooperate with each other They did not take into account the

correlation aong proxy anities How to best utilize reference pattern inforation to iprove the Web perforance has becoe a topic of increasing interest [8] One contribution of our work is a schee to eciently analyze reference patterns to organize proxies into collaborative caching groups High cache hit rate can still be aintained at the cost of uch fewer inter-proxy essages as copared with previous schees 21 Caching in Harvest The caching schee in Harvest is presented in [9] [6] Caches are organized in a hierarchy Each cache in the hierarchy independently decides whether to fetch the reference fro the object's hoe site or fro its parent or sibling caches using a resolution protocol If the URL contains any of a congurable list of substrings, then the object is fetched directly fro the object's hoe, rather than through the cache hierarchy This feature is used to force the cache to resolve non-cacheable URLs (eg, cgi-bin) fored pages and local URLs directly fro the object's hoe Otherwise, when a cache receives a request for a URL that isses, it perfors an ICP(Internet Caching Protocol) to all of its siblings and parents, checking if the URL hits any sibling or parent Of all the sites which report having the object, the object is retrieved fro the site with the lowest easured latency 22 Adaptive Web Caching An adaptive web caching schee is proposed in [19] Cache servers are self-organizing and for a tight esh of overlapping ulticast groups and adapt as necessary to changing conditions This esh of overlapping groups fors a scalable, iplicit hierarchy that is used to eciently diuse popular web content towards the deand Adaptive web caches exchange a description of their content state with other ebers of their cache groups to eliinate the delay and unnecessary use of resources of explicit cache probing For each request, a cache rst deterines whether a copy of the requested web page is cached by one of its group ebers If not, the request will be forwarded to another cache (in the web caching infrastructure) which is signicantly ore likely to have the data This routing is accoplished using a URL routing table aintained at each web cache The inforation in the URL routing table is learned fro the source-based and content-based entries fro other web caches 23 Suary Cache Suary cache is a scalable wide area web cache sharing protocol proposed in [11] In this protocol, each proxy keeps a suary of the cache directory of each participating proxy When a user request isses in the local cache, the proxy checks these suaries for potential hits If a hit occurs, the proxy sends out requests to the relevant proxies to fetch the docuent Otherwise, the proxy sends the request directly to the web server In order to reduce the network overhead, a proxy does not update the copy of its suary stored with other proxies upon every odication of its directory: it rather waits until a certain percentage of its cached docuents are not reected in other proxies In addition, bloo lters [3] are used to build suaries in order to keep each individual suary sall Experients have shown that suary cache reduces the nuber of inter-cache protocol essages, the bandwidth consuption, the protocol CPU overhead signicantly while aintaining alost the sae cache hit ratio as the ICP However, when the nuber of proxies is large, the network trac for updating cache contents could also be large Thus suary cache ay have poor scalability 24 Web Caching Based on Dynaic Access Patterns A \local" caching algorith based on dynaic access patterns was presented in [18] This caching algorith exibly adapts its paraeters (such as the nuber of docuents, the size of cache, and the actual docuents in the cache, etc) based on the observation that there is a strong relationship between frequency and probability of access The analysis is based upon a odel fro psychological research on huan eory The probability of future docuent access is estiated based on frequency of prior docuent accesses However, this algorith did not address the issue of cooperative caching 25 Server Volues and Proxy Filters [8] proposed an end-to-end approach to iprove Web perforance using server volues and proxy proles The server groups related resources into volues based on access patterns (captured by conditional probability) and the le syste's directory structure This inforation is piggybacked in the servers response essage to the requesting proxy The proxy generated lter, which indicates the type of inforation of interest to the proxy, is used to tailor the piggyback inforation This piggyback inforation can be used to iprove the eectiveness of a variety of proxy policies such as cache coherency, prefetching, cache replaceent, and so on Experiental results show that probability-based volues can achieve higher prediction rate with a lower piggyback size than the directory-based structure Further reductions in processing and eory overhead are possible by liiting the calculation of probability iplications to pairs of resources that have the sae directory prex, at the expense of issing associations between resources in dierent directories 3 OBJECTIVE MODEL The objective of our web caching schee is to iniize the average latency (or response tie) of web page retrieval The following is a list of paraeters for estiating the average response tie : local cache hit ratio The ratio of the nuber of requests which are served by local proxy over the total nuber of requests!: reote cache hit ratio The ratio of the nuber of requests which are served by reote proxies over the nuber of outgoing requests, ie, the requests which can not be satised locally Local Cost: the average response tie of serving a request by the local proxy Reote Cost: the average response tie to serve a request through a reote proxy

Server Cost: the average response tie to serve a request fro the content server Locating Cost: the average tie to nd whether (and where) a cached web page exists P ush Cost: the average overhead (per outgoing request) incurred by ulticasting the changes of cache content to reote proxies Usually, the volue of valid web pages cached at each proxy reains the sae roughly if the syste is stable In other words, the web page expiration rate is coensurate with the fetch-in rate Therefore, the change of cache content can be easured by the aount of new web pages fetched in Assue that whenever the cache content changes by %, the proxy will ulticast the description of the changes to other proxies, and assue that this % change is the result of q (q > 1) outgoing requests on average, which take a period of tie of length t q on average Then, each proxy ulticasts, on average, once for every t q length of tie Assue that a proxy receives ulticast essages (fro other proxies) on average between two consecutive ulticasts Then, a proxy needs to spend t s + t r tie to process ulticast essages, where t s and t r are the average tie to generate and send out a ulticast essage to other proxies and the average tie to receive a ulticast essage sent by other proxy and update its local inforation, respectively Thus, the total push cost for all proxies over a period of tie t q is N p (ts + t r) + N p tn where N p, t n, and N p tn are the total nuber of proxies, the average additional delay to other network trac due to the transission of such a ulticast essage 1, and the overall overhead incurred due to the delivery all ulticast essages sent out by all proxies during t q, respectively Since the inforation ulticasted is used to facilitate the reote cache discovery process, we should factor in this cost when we estiate the search cost of each outgoing request Since there is on average of N p q outgoing requests overall (during the period of t q), the average overhead for each outgoing request is P ush Cost = ts+tr+tn q Search Cost: P ush Cost + Locating Cost Cost: average response tie of retrieval of a web page Cost = Local Cost + (1? ) Search Cost +(1? )! Reote Cost +(1? ) (1?!) Server Cost To siplify the proble, we assue that a web caching algorith can only control the following paraeters: Cost, Locating Cost, P ush Cost, Search Cost,!, but does not have control over the rest of paraeters; so they can be viewed as constant To iniize Cost, we therefore need to 1 Transitting this ulticast essage would consue certain network bandwidth and hence ight cause soe delay in the delivery of other packets iniize the following function: Search Cost +! Reote Cost +(1?!) Server Cost + Locating Cost +! Reote Cost +(1?!) Server Cost (1) = ts+tr+tn q where Server Cost > Reote Cost Each of! and Search Cost can be viewed as a function of the nuber of proxies in collaboration (denoted by ) The ore proxies that are involved, the higher the reote cache hit ratio and the search cost Thus, a oderate value of should be chosen to balance the tradeo between the reote cache hit ratio and the search cost Moreover, an optial strategy will be a schee which has a higher reote cache hit ratio for a given search cost, or a lower search cost for a given reote cache hit ratio It follows that during the push process, the update inforation should only be sent to those proxies which have an anity to the involved web pages By siilar reasoning, during the reote cache discovery process, only those proxies that have an anity for the requested page need to be queried In our proposed schee, proxies that share soe coon anities will for an inforation group which serves as the basis of collaboration All other paraeters except in Function 1 can be either collected or estiated at run tie Then, the optial size of an inforation group (ie, the nuber of proxies in collaboration) can be calculated accordingly In addition, instead of enforcing a unique inforation group size, we eploy two paraeters I in and I ax to indicate the range of optial value of and require that the size of each inforation group is within the range [I in; I ax] This not only accoodates the evaluated error during the estiation of, but also provides soe exibility to the construction of inforation group (without sacricing the quality of collaboration) Values of I in and I ax will be discussed in Section 8 4 TRACES AND SIMULATIONS In this paper, we use three sets of traces of HTTP requests to verify our design: DEC traces [10]: Digital Equipent Corporation Web Proxy Server traces fro Aug 29 to Sep 4, 1996 ClarkNet traces [7]: ClarkNet is a full Internet access provider for the Metro Baltiore-Washington DC area The traces contain all requests to ClarkNet server fro August 24 to Septeber 10, 1995 L traces [13]: an 100 days (during Aug31 to Oct7, 1997) worth of logs of HTTP requests fro about 3000 distinct clients It contains a total of 10 illion requests There are a total 12,000 servers and 300,000 distinct web pages represented in the trace Since the inforation in the trace logs is insucient for the purpose of siulation, eg, the trace logs do not reveal the connectivities of these hosts, we had to approxiate the needed inforation for the detailed siulation To build the topology of these hosts, we rst need to nd the nuber

of servers We assue that web pages served by the sae server all have the sae rst level URL coponent For exaple, the URLs of all web pages served by the ACM server starts with \wwwacorg" By exaining the IP address of each user/host, we can discover the nuber of subnets Furtherore, we assue that hosts in a subnet are connected by a 100 MBit/s fast Ethernet and there is a proxy for each subnet Then we generate a rando connected graph in which each node corresponds to a server or a subnet The edges of the graph correspond to the counication links between the server and client subnets The bandwidth on these links is 10MBit/s We generate a background workload which consues half of the bandwidth on each counication link on average In this paper we assue that the average lifetie of a web page is 7 days, and after a proxy fetches a web page, this copy of the web page will reside in the proxy's cache until it becoes obsolete 2 Last but not least, the page size follows a log-noral distribution with an average 8KB[2] Table 1 shows soe statistical inforation on the above three traces 5 PAGE CLUSTER Web pages are identied by their URLs Intuitively, different sets of web pages ay have dierent popularity, ie, soe web pages are hot spots whereas soe others are rarely accessed Experiental results have shown that ajority of the references go to a sall percentage of the web pages[4] It is uch ore benecial to iprove the cache hit rate for these hot web pages Therefore, we focus on the set of frequently accessed web pages for the following reasons: (1) If a page is not frequently accessed by any proxy, then it is unlikely that an up-to-date copy of this page is cached by soe proxy When this page is needed by a proxy, the proxy has to fetch it fro the server in ost cases Therefore, there is no incentive to analyze the access patterns of this set of web pages because we can rarely benet fro it (2) The analysis of access patterns has a certain overhead which is reduced quickly by eliinating fro consideration all infrequently referenced pages Each proxy keeps a log of the web page reference history of local clients, which is a sequence of web page references Given a web page and a trace log, the ratio of the nuber of accesses to this web page over the total nuber of accesses within the trace log is referred to as the frequency of the web page within the trace log For exaple, if there are 100 web page accesses overall and a web page is referenced 30 ties, then the frequency of that web page is 30% in this trace We use a threshold (specied as a percentage) to deterine whether a page is considered to be frequently accessed or not A page is considered frequently accessed if and only if the reference frequency of that web page is at least Intuitively, the frequency can be viewed as an indicator of popularity of that page Table 2 shows the eects of varying the value of For exaple, when is chosen as 001% (ie, a page is considered frequently accessed only if it is accessed at least once per 10,000 references), then 4% of all pages are considered to be frequently accessed, and 40% of all requests are to these 4% of the pages referenced in the DEC trace (These values are in the page coverage and request 2 This is feasible since all cached web pages can be stored on disk coverage coluns, respectively) We will discuss the eect of the value of on the cache hit ratio, average nuber of essages, average size of a essage, etc in Section 7 Page clusters (of frequently referenced web pages) can be fored according to the cobined access patterns of all proxies Grouping web pages into clusters based on these logs consists of two steps: 1 Each proxy sends its prole (ie, local frequently accessed web pages) to a central site S 2 An optial or near optial partition of frequently accessed web pages is generated Each proxy aintains a prole which consists of URLs of its locally frequently accessed web pages This can be easily done by exaining the local trace log We will explain later that this prole is used not only by the central site to group the page clusters but also by the proxy to guide itself to join appropriate inforation groups At the site S all reported web pages are grouped into a set of page clusters (See Figure 1) If a proxy frequently accesses at least one page in a page cluster, then we say that it has an anity for that page cluster An inforation group will be fored for each cluster for collaborative caching A proxy ay choose to join an inforation group if it has an anity for the corresponding cluster However, due to the overlaps aong page clusters, a proxy ight not join an inforation group even though the proxy has an anity for the corresponding page cluster since the proxy ay choose to join another inforation group that also accounts for those pages Let I ax be the axiu nuber of proxies allowed in an inforation group The size of each page cluster is deterined in such a way that the nuber of proxies that join the corresponding inforation group is at ost I ax (eg, I ax = 3 in Figure 1) 3 As entioned before, our goal is to reduce both the average nuber of essages processed by each proxy for updating other proxy's cache status and the size of these essages while aintaining the cache hit rate Moreover, we want to iniize the average nuber of inforation groups a proxy ay join so that the bookkeeping overhead can be iniized Replications are allowed during grouping (ie, two page clusters ight contain soe coon web pages) for such a purpose However, allowing a large aount of replication would cause another proble: the nuber of collaborating proxies for a web page tends to be sall 4 As a result, the cache hit ratio could be ipacted severely In order to prevent such an occurrence, soe echanis has to be eployed to restrict the nuber of replicas allowed for a page during the web page clustering procedure Let I in be the iniu nuber of collaborating proxies required to aintain an acceptable cache hit ratio for a web page (I in = 2 in Figure 1) Then, the proble becoes how to guarantee that a web page A can be allowed to replicate in a page cluster P G i only if it would not cause the size of any inforation group to be less than I in 3 The optial value of I ax is uch larger in reality 4 We will explain it later this section

Table 1: Statistical Inforation of Trace Logs DEC Traces ClarkNet Traces L Traces Nuber of Requests 354 illion 33 illion 10 illion Nuber of Proxies 16 2000 546 Nuber of Clients/Hosts 10,000 21,000 3000 Nuber of Servers 70,000 1 12,000 Table 2: Page/Request Coverage for Dierent Frequencies DEC Traces ClarkNet Traces L Traces page request page request page request coverage coverage coverage coverage coverage coverage 0005% 9% 60% 10% 58% 10% 62% 001% 4% 40% 4% 39% 4% 41% 005% 0:4% 17% 0:4% 16% 0:4% 17% Proxy1 Proxy2 Proxy3 Proxy4 Proxy(n-1) Proxy(n) 3 A C E G 1 B D 4 F 2 Proxy Affinity Page Cluster Web Page G F A B 1 2 C Proxy Affinity n 3 Proxy Web Page 4 n-1 D E Figure 2: Map to a Hyper-graph Proxies Frequently Accessed Web Pages Figure 1: General Scenario The inforation group sizes is crucial to the overall perforance We will analyze the eects of dierent inforation group sizes and how to deterine the optial size in later sections General Approach This is a hyper-graph partition proble with replication as follows Each frequently accessed web page is apped to a vertex; each proxy's prole (ie, anity set) is apped to a hyper-link aong corresponding vertices (See Figure 2) The objective is to iniize the average nuber of inforation groups a proxy ay join in order to cover all its frequently accessed web pages A ove-based algorith [1] can be eployed to for the page clusters Let N p and n p be the nuber of proxies and the average nuber of \frequently accessed" web pages per proxy, respectively We rst group all web pages into N c Npnp (Iax+I in )=2 e clusters without any replication where N c = d if no page cluster exists previously Otherwise, the previously page clusters will be taken as the initial grouping Beginning fro this initial grouping, a series of passes are ade to iprove the quality of grouping by either oving pages aong clusters, replicating pages to soe clusters, or reov- ing replicas fro soe clusters During each pass, pages are successively exained until each page has been exained exactly once Given a current grouping P 0, the previously unexained page with the highest gain (dened shortly) is exained and the corresponding action (oving, replicating, or unreplicating) to incur such gain is taken to odify the current grouping Given the current grouping P 0, the gain of an action Ac is dened as the reduction of average nuber of inforation groups a proxy has to join if action Ac is taken Note that the gain can be either positive or negative After each pass, the best grouping observed during this pass 5 becoes the initial grouping for a new pass This procedure terinates when a pass fails to iprove the quality of its initial grouping Proxy Bucket proxy ID list of frequently accessed pages list of interested clusters (a) Page Bucket page ID list of frequently accessing proxies list of participating clusters gainarray of all possible actions (b) Figure 3: Data Structure Cluster Bucket cluster ID cluster size list of pages list of frequently accessing proxies 5 Note that an individual action during a pass ight not always iprove the grouping quality But it ay create an opportunity for a better grouping to be obtained later (c)

Soe necessary data structures are aintained to facilitate the partition process by a scan of all proxies' essages A proxy table, a page table, and a cluster table are built for all proxies, for all web pages frequently accessed by soe proxies, and for all web page clusters, respectively Without loss of generality, we assue that each proxy (page / cluster) can be uniquely identied by its ID which is a nuber between 1 and N p (N w / N c), where N p (N w / N c) is the total nuber of proxies (frequently accessed web pages / web page clusters) 6 In the proxy table, each slot, referred to as a proxy bucket (Figure 3(a)), consists of three coponents: proxy ID, a linked list of web pages this proxy frequently accesses, and a linked list of web page clusters to which this proxy has an anity In the page table, each slot, referred to as a page bucket (Figure 3(b)), aintains the following inforation: 1 page ID 2 a linked list of proxies which frequently access this page 3 a linked list of web page clusters which contain this page 4 gainarray: an array of buckets, each of which contains its gain on soe action The action on a page can be one of the following ove to another cluster replicate in another cluster reove replica fro this cluster Since the size of the inforation group corresponding to a cluster can not exceed I ax, when a cluster P G i grows to such a level that its estiated inforation group size reaches I ax, pages in other clusters are not allowed to ove to P G i or to be replicated in P G i until soe page in P G i is reoved fro P G i Every tie an action is taken on a page A, the gains of relevant pages (those pages which are also frequently accessed by the proxies which frequently access A) are updated For exaple, page B and C are relevant to A in Figure 1 since they are also frequently accessed by Proxy 1 that accesses A frequently In general, as shown in Figure 4(a), when a web page A oves fro cluster P G i to P G j, the inforation group for P G j tends to grow since proxies which have an anity for A ay join this inforation group On the other hand, the inforation group for P G i tends to shrink since proxies which have an anity for A but not for other web pages in P G i will withdraw Siilarly, given that A 2 P G i1 \P G i2 \ \P G ik (Figure 4(b)), replicating A to P G j (j 6= i1; i2; : : : ; ik) causes the inforation group for P G j to grow and the inforation groups for P G i1; P G i2; : : : ; P G ik to shrink for the following two reasons 6 In a general scenario where this assuption does not hold, a hash table can always be eployed to provide a fast apping between a proxy ID (page ID / cluster ID) to its associated inforation PG1 PGi A PGj PGi A PGj (a) ove A fro PGi to PGj PG2 A PGi PG1 PGj PGj (b) replicate A to PGj PG2 A PGi Figure 4: Moving and Replicating a Page 1 Before the replication, all proxies which have an anity only for A and soe other pages in the set P G j \ (P G i1 [ [ P G ik ) (light shaded area in Figure 4(b)) will join at least one of the inforation groups P G i1, P G i2, : : :, P G ik After the replication, soe of these proxies ay join P G j instead Note that, in this case, the average nuber of proxies in each inforation group would not be changed since the replication only causes soe proxies to ove fro inforation groups for P G i1, P G i2, : : :, P G ik to that for P G j 2 Those proxies which have an anity for A and P G j but not for any other web pages in P G i1 [ [ P G ik (dark shaded area in Figure 4(b)) will not join the inforation groups for P G i1, P G i2, : : :, P G ik any longer They just need to join the inforation group for P G j now In other words, the inforation groups for P G i1, P G i2, : : :, P G ik will shrink while the size of inforation group for P G j keeps the sae in this case As a consequence, the average nuber of proxies in an inforation group decreases Therefore, replication tends to reduce the average nuber of proxies in each inforation group Besides the estiated inforation group size of P G j, another criteria to deterine if A should be replicated in a cluster P G j is that the estiated inforation group size of P G i1; P G i2; : : : ; P G ik or P G j ust reain at least I in if the replication is perfored Perforance easureents of our proposed web page partitioning algorith is shown in Table 3 Here the centralized coputing site S is a UltraSparc Workstation which has a 200MB ain eory and one single 366MHz CPU Note that the eciency coes partially fro the trick that new partition is always generated increentally by taking the previous partition as the initial partition This partition process does not signicantly ipact the perforance of other workload on the central site and can be further optiized in the following ways The entire partitioning process can be done oine or taken as a background process If the central site is heavily loaded, it can postpone the coputation of its frequently access web pages until it is idle or lightly loaded

The gainarray coputation can be done parallel Therefore, ultiple achines can be used to share the workload of the partitioning process After the page clusters are fored, a server will be chosen to be the coordinator for each inforation group by S Usually, the coordinator of an inforation group is the server which owns the ost pages in the corresponding page cluster Finally, the content of all page clusters and their coordinators are broadcast to all proxies 8 Each proxy will atch its prole with the page clusters and send essage(s) to the coordinator(s) of page cluster(s), for which they have an anity, to join the inforation group(s) Since the content of all page clusters could be large, we use a Bloo Filter [3] [15] to encode the contents of page clusters A Bloo Filter is a ethod for representing a set A = fa 1; a 2; : : : ; a ng of n keys to frequency ebership queries The idea is to allocate a vector v of bits, initially set to 0, and then choose k independent hash functions, h 1; h 2; : : : ; h k, with range f1; : : : ; g For each key a 2 A, the bits h 1(a); h 2(a); : : : ; h k (a) of v are set to 1 Given a query key b, we check the bits h 1(b); h 2(b); : : : ; h k (b) If any of these bits are 0, then b is not in the set A Otherwise, we conjecture that b is in the set although there is a sall probability that we are wrong [11] reports that when the nuber of hash functions is ve and ten bits are used to encode an entry, the error (false positive) is about 09% 6 INFORMATION GROUP MAINTENANCE Inforation groups are fored based on page clusters Each inforation group is associated with one page cluster A proxy ay join ultiple inforation groups based on its pro- le and the contents of these page clusters In the previous section, we described a periodically executed algorith that globally fors page clusters The anager broadcasts the page clustering inforation (ie, the coordinator for each cluster and the content of each cluster) to all participating proxies Then each proxy, based on its proxy prole, deterines which page cluster(s) it has an anity for A proxy ay join the inforation groups for interesting page clusters There ay be several cobinations of inforation groups a proxy can join The choice of which inforation groups to join is based on the nuber of atches between a proxy's prole and the content of a page cluster We eploy a greedy algorith to deterine which inforation groups a proxy should join First, a proxy nds a page cluster which contains the axiu nuber of web pages in the proxy's prole The proxy joins the inforation group for that page cluster Next, the proxy nds another page cluster which contains the axiu nuber of web pages in the reainder of the proxy's prole The proxy joins the inforation group for the second page cluster This process continues until the proxy joins the inforation groups for all web pages in its prole 8 An alternative would be that the central site S deterines the inforation groups a proxy ay join and sends to the proxy only related inforation However, the proxy will lose the opportunity to join other inforation groups when its prole changes because of lack of inforation about other available web page clusters and their associated inforation groups Note that the proxy prole ay change dynaically to re- ect the changes of its client access anity Then every tie the prole changes, the proxy ay choose to withdraw fro soe inforation groups and/or join soe other inforation groups to accoodate the change of the access anity It is possible that, at a given tie, an inforation group ay contain ore than I ax or less than I in proxies due to the changes of proxy anities In such a case, a local reorganization procedure can be perfored on web pages in the web page clusters associated with the oversize and/or undersize inforation group(s) Due to space liitations, we will not elaborate on this in this paper If a proxy wants to join an inforation group, then the proxy sends a essage to the coordinator of that inforation group The coordinator keeps a record of which proxies are in the inforation group It then sends to the new proxy a list of the ebers of the inforation group And the new proxy sends the intersection of its cache content and the corresponding page cluster to all ebers in the inforation group In turn, all old proxies in the inforation group also tell the new proxy their cache contents which intersect the corresponding cluster (These essages can be sent via ulticast) If a proxy wants to withdraw fro an inforation group, then the proxy sends a ulticast essage to all ebers (including the coordinator) of this inforation group to notify this change Figure 5 shows the average cache hit ratio as a function of the inforation group size In our siulations, when the contents of a proxy's cache for a page cluster changes by ore than 10%, the proxy ulticasts the changes to other proxies in the inforation group for that cluster We choose 10% as the threshold because it is shown fro experients that the stale hit rate will be large otherwise [11] For the suary cache, when the overall cache content in a proxy changes over 10%, the proxy then broadcasts the updates to all proxies In our proposed schee the updates are only ulticasted to the proxies in the associated inforation group(s) When decreases, the cache hit ratio increases because ore pages are clustered and proxies exchange inforation on ore web pages It is ore likely, when a proxy tries to retrieve a web page, it can fetch the page fro another proxy instead of the server Moreover, with = 0:005% and a reasonable inforation group size (ie, 10 for DEC trace and 60 for L trace and ClarkNet trace), the cache hit rate for our schee is siilar to that of suary cache because the web pages for about 60% of the requests are analyzed and clustered The rest of the web pages are not \hot", and therefore, the requests for these non-clustered pages will likely result in a cache iss even in the suary cache In addition, for those \hot" pages, collaborative caching aong a relatively sall nuber of proxies is sucient to aintain a high cache hit rate Coordinating a large nuber of proxies in such a process (ie, the scenario in suary cache schee) would incur large overhead but little iproveent on cache hit rate We explore this issue ore in the next section 7 WEB PAGE RETRIEVAL In this section, we rst discuss how the content of the cache at one proxy propagates to other proxies, then we will present the detailed procedure for actual retrieval of web pages fro

Table 3: Web Page Partition Cost CPU Tie Average CPU Tie Average Message on S (in) on Each Proxy 7 (in) size (KB) 0005% 30 10 122 DEC 001% 20 10 47 005% 6 1 3 0005% 42 75 146 Clark 001% 24 4 51 005% 5 05 4 0005% 38 5 101 L 001% 20 3 33 005% 4 035 2 Hit Ratio 05 04 03 02 DEC Traces L Traces ClarkNet Traces Suary Cache β=0005% β=001% β=005% 01 3 6 9 12 15 Figure 5: Cache Hit Rate versus the Inforation Group Size the server or another proxy A proxy ay join several inforation groups When the cache contents of one proxy for one page cluster has changed over a threshold (easured as a percentage),, then the proxy will ulticast the changes to all proxies in the inforation group (for that page cluster) using a Bloo Filter This procedure is siilar to that used in the suary cache by Fang, et al [11] except that in our approach the scope of consideration is each individual page cluster and its corresponding inforation group When the cache contents for one page cluster changes by ore than at a proxy, the proxy will ulticast the changes (related to that page cluster) to other proxies in the associated inforation group In addition, this ulticast essage can be delivered at a lower priority than other packets to reduce its ipact to the noral Internet trac 9 When 1% 10%, the stale cache hit ratios of suary cache and our schee (with = 0:005%; 0:01%; 0:05%) are very siilar, which are between 9% to 12% for the three traces A siilar result is also reported in [11] In the suary cache schee, each proxy needs to aintain inforation of all other proxies' cache contents, whereas in our schee, a proxy only needs to keep inforation of collaborating proxies' cache contents Figure 6 shows the dierent size of cache status inforation aintained by the suary cache and our schee For both schees, the pages are encoded with a Bloo Filter and each entry consues 8 bits It is clear that the suary cache would require uch ore aintenance eort than our caching schee especially when the nuber of proxies is large ClarkNet traces have the largest nuber of proxies and the size of cache status 9 We nd epirically that the incurred sall delay in updating cache status inforation has little ipact on the overall cache hit ratio inforation of other proxies ranges up to several GB in suary cache and it could severely ipact the proxy's ability to serve pages to its clients In contrast, with the increase of the nuber of proxies or the cache size on each proxy, the size of cache status inforation in our proposed schee does not increase as fast as the suary cache This is due to the fact that each proxy only receives the cache status of other proxies in the sae inforation group(s) and only cache status of web pages in the corresponding page cluster(s) is aintained Now we explain how a page is retrieved Let's assue a client attached to proxy P R X sends a request for a particular web page to proxy P R X If P R X can not nd the web page in its local cache, then it will decide fro which reote proxy or the server it will fetch the web page During this process, P R X rst nds which page cluster the web page belongs to by exaining the Bloo Filters If the page does not belong to any cluster, then P R X will directly fetch the page fro the server Otherwise, P R X nds the inforation group(s) for the cluster(s) (The page ay belong to ultiple clusters) Then P R X ranks all proxies in the inforation group(s) and the server of that particular web page based on distance We use the round-trip latency to estiate the distance between two proxies [17] [16] This does not require extra overhead because the round-trip tie can be obtained when the updates of cache contents are sent out and the acknowledge of the updates are received P R X forwards the request to the "nearest" site (proxy or the server) which has the page If the reote proxy does not have the web page because of (1) cache invalidation (stale hit) or (2) error of the Bloo Filter (false hit), then P R X will forward the request to the second \nearest" site (proxy or the server) This process continues until the web page is

size of cache status inforation (MB) 100000 DEC Traces L Traces ClarkNet Traces 10000 1000 100 10 Suary Cache β=0005% β=001% β=005% 20 40 60 20 40 60 20 40 60 Cache Size (GB) in Cache Size (GB) in Cache Size (GB) in each proxy ( =10) each proxy ( =10) each proxy ( =10) Figure 6: Size of cache status inforation (Note the y-axis is in the log scale) retrieved It is guaranteed that this process will terinate eventually because the server always has the web page Figure 7 and 8 show the average nuber of network essages per user request and the average size of network essages per user request Here each essage processed by a proxy counts as one essage For exaple, if a ulticast essage is sent to 20 proxies, then the essage is counted 20 ties Our proposed schee has not only a uch saller nuber of network essages per user request, but also the essage size is uch saller especially when the nuber of proxies is large (L and ClarkNet traces) due to the cache update inforation Figure 9 shows the average latency for a user request In DEC traces, there are total 16 proxies, our proposed schee does not have a signicant advantage or disadvantage over suary cache However, when the total nuber of proxies is large, our proposed schee could have a signicant bene- t over suary cache due to the less network trac seen by each proxy The average latency of our proposed schee varies signicantly as a function of the inforation group size When is too sall, the cache hit rate will be low but the network trac will also be low because each proxy only need to notify a sall nuber of proxies On the other hand, when is large, the cache hit ratio is high but the network trac is also high Fro the experient, we found that 50 to 70 would be the appropriate range for for the trace data in our test Moreover, fro Figure 9, we can see that the scalability of our proposed schee is better than that of suary cache The response tie of suary cache schee increases fro L trace to ClarkNet trace because of the increase of the proxies (about four folds) However, the response tie of our schee accurately decrease because of the utilization of the proxy anity! 8 ESTIMATION OF INFORMATION GROUP SIZE It is clear that the inforation group size plays a decisive role in our schee and is a prerequisite to partition page clusters In this section, we propose a ethod to estiate the optial inforation group size As we explained in Section 3, besides, the following paraeters also participate in the cost function : Every tie a proxy's cache content changes by %, the proxy will ulticast the description of the changes to other proxies in the inforation group We set = 10% q: the average nuber of outgoing requests (fro a proxy) which would cause this % change t s: the average tie for a proxy to generate the description of its cache changes and send out as a ulticast essage t n: the average network overhead incurred by transitting this ulticast essage Because our schee requires a uch saller nuber (and size) of ulticast essages for updating cache status changes and these essages are delivered at lower priority, no signicant delay for delivery of other packets will be thereby incurred Thus we will oit it in the cost function (ie, t n = 0) t r: the average tie for a proxy to receive this ulticast essage and update its local inforation Locating Cost: the average tie to nd whether (and where) an up-to-date copy of the requested web page exists in the inforation group!: reote cache hit ratio It is the probability that an outgoing request fro a proxy will be served by another proxy in the inforation group Reote Cost: the average tie to retrieve a cached web page fro another proxy in the inforation group Server Cost: the average tie to serve a request fro the content server Therefore, we need to obtain the value of these paraeters in order to calculate the optial Fortunately, all of the can be easily collected by each proxy and reported to the central site S together with the proxy prole except for the rst tie the web page clusters are constructed At the rst tie, since no inforation group exists previously, soe paraeters have to be estiated fro other available inforation Each proxy can still collect the value of q, t s, t r, and Server Cost because introducing inforation group will not have signicant inuence to these paraeters However, the value of Locating Cost,!, and Reote Cost depends heavily on the foration of the inforation group We eploy soe heuristics to estiate the For each page