Technical papers Web caches - PDF Free Download

Technical papers Web caches Web caches What is a web cache? In their simplest form, web caches store temporary copies of web objects. They are designed primarily to improve the accessibility and availability of this type of data to end users. Caching is not an alternative to increased connectivity, but instead optimises the usage of available bandwidth. After the initial access/download, schools can access a single locally stored copy of the content rather than repeatedly requesting the same content from the origin server. Content delivery works on the principle of delivering content to the local network before it is required, rather than the on-demand approach of normal caching. This technical paper focuses on the practical issues surrounding caches; it will look at hardware and software solutions and advanced features that provide shared services to users over a network. How will a cache benefit my institution? Caching minimises the number of times an identical web object is transferred from its host server by retaining copies of requested objects in a database or repository. Requests for previously cached objects result in the cached copy of the object being returned to the user from the local repository rather than from the host server. This results in little or no extra network traffic over the external link and increases the speed of delivery. Caches are limited by the amount of disk space when a cache is full, older objects are removed and replaced with newer content. Some systems may implement 'persistency' measures, however, to preserve certain types of content at the discretion of the administrator. Example: A school has a 10Mbps local area network (LAN) and a 128Kbps ISDN connection to the Internet, where the local network is 80 times faster than the Internet connection. Consider a class situation where a suite of computers is trying to download a large graphic, perhaps 256KB in size. This would take each computer in the suite 16 seconds to download across the 128Kbps connection (128Kbps = 16 KBps). If a cache is implemented on the local network, the cache computer will download a single copy of the graphic at a maximum speed of 128Kbps, and then pass this on to each computer over the high-speed LAN connection at 10Mbps. Across a 10Mbps connection (10Mbps = 640 KBps), the transfer would take approximately half a second. In practice, transfer rates will be lower than these figures which allow for network overhead. How does a web cache work? The flowchart below illustrates what happens when a user requests a web page. The thicker lines represent the normally higher-speed local connections between the client and the cache, while the thinner lines represent the slower connection speeds over the Internet. Becta 2004 Valid at September 2004 page 1 of 8

Where are web caches used? Caches may be installed in different locations on networks for a variety of reasons: Local caches are the most common type; they sit on the edge of the LAN just before the Internet connection. All outbound web requests are directed through them in an effort to fulfil web requests locally before passing traffic over the Internet connection. ISP caches are used on the networks of most Internet Service Providers (ISPs). They provide customers with improved performance and conserve bandwidth on their own external connections to the Internet. Reverse caches are used to reduce the workload of content provider s web servers. They position the cache between the web server and its internet connection, so that when a remote user requests a web page, the request must first pass through the cache before reaching the web server. If the cache has a stored copy of the requested item, it delivers it direct rather than passing the request through to the web server. Becta 2004 Valid at September 2004 page 2 of 8

This document concentrates on local caches, although most of the information applies to all caches. The diagram below shows the different positions that caches can occupy on networks. As a request for information passes from the LAN to the content provider it passes through several caches, each trying to fulfil the request from their own repositories. Sometimes a request will never reach the content provider s host web server, instead being fulfilled by a cache somewhere en route. What are the advantages and disadvantages of caching? Advantages: Fast performance on cached content if content is already in the cache it is returned more quickly, even for multiple users wanting to access the same content. Improved user perception and productivity quicker delivery of content means less waiting time and increased user satisfaction with the performance of the system. Less bandwidth used if content is cached locally on the LAN, web requests do not consume Internet connection bandwidth. User monitoring and logging if a cache manages all web requests (behaving in some ways like a proxy), a centralised log can be kept of all user access. Care must be taken that any information held is in accordance with appropriate privacy regulations and the institution's policy. Caching benefits both the single end user and the content providers ISPs and other users of the same infrastructure all benefit greatly from the reduction in bandwidth usage. Disadvantages: Slower performance if an object is not cached an extra layer is added to the process, which adds time. Becta 2004 Valid at September 2004 page 3 of 8

Subscription sites may become confused some subscription services use IP addresses for authentication. The advent of dynamic client bypass technology, which passes the user s original IP address to the host server, coupled with an increase in the use of other methods of authentication by content providers mean this is becoming less common, however. Additional hardware or expertise may be required any new system will potentially require extra hardware and software resources, with ongoing support needed after installation. Dynamically generated content cannot be cached the results of CGI scripts or certain types of database content are increasingly common on the World Wide Web, but cannot be cached. What is the difference between transparent and non-transparent caches? Caches can differ according to their so-called transparency, which will affect the degree of configuration required for a network during device installation. Transparent caches do not require any settings to be changed on individual client machines. Instead, the network router or switch is configured to forward all requests automatically through to the cache. This has the advantage of allowing a cache to be easily introduced and removed without reconfiguring the client computers. However, it can generate confusing error messages if a page is not found and make finding the location of any problems difficult. Non-transparent caches require the settings on each client computer to be changed to point at the appropriate cache. In this case, error messages will normally show clearly if a problem is with the cache itself. However should a change of cache server be required, perhaps for maintenance reasons, the clients may have to be reconfigured with the new cache s information. How well will a cache work in a classroom situation? Caches can enhance the ways in which the Internet is used in the classroom. Teachers can pre-load a cache with particular web sites in advance of a lesson, either by simply visiting the required sites with a computer that uses the cache or by having the content pre-positioned into the cache by a management system. For example: if a school buys content from a commercial provider they might opt to pre-load or copy it to their local cache in advance of using it in lessons. When teachers and pupils wanted to use this content they would then be accessing it from the LAN rather than from the Internet. This would give fast, high-quality access without delays and without large numbers of students having to share an Internet connection. The worst case scenario in a cached environment is that the first user to request a page will experience a slightly longer delay than normal. Some solutions, however, do not have the capability to cache multimedia content such as real time streaming media files. It is possible for more advanced cache solutions to reduce bandwidth by only requesting one stream from the host web server and then splitting that stream to many computers on the LAN. Multimedia content stored in static files where the whole file must be downloaded before it can be played will in most cases be cached as normal. How do I install a cache? Installing a web cache to a LAN is relatively straightforward. An additional computer system or dedicated appliance is connected to the LAN, and the clients or router are configured, if required, to access this system. The cache itself is installed through a software program Becta 2004 Valid at September 2004 page 4 of 8

executed either on its own dedicated hardware or as one of many programs running on a shared server. Microsoft's Internet Security and Acceleration (ISA) Server is based on Windows 2000 and can run on its own or on a Windows 2000 server with other software. Similar arrangements are possible with Linux-based systems running software such as Squid. The main alternative would be a discrete hardware-based solution optimised for this specific role. Examples of such a solution include Volera or a Cisco unit. The appropriateness of each solution depends on a number of factors, including the number of simultaneous users, available bandwidth and available resources. Functionality can also vary developments in this area are concerned with moving away from just caching static HTML pages towards accelerating the whole web experience. An institution should be able to function adequately with a single server of reasonable specification (for Linux, a Pentium II with sufficient memory; Versions of Squid are available for NT, and the server specification should be increased according to the suppliers instructions). Specialist solutions including multiple servers need only be considered for LEAwide services and larger. The National JANET Web Cache Service has an article on sizing servers using the Squid proxy on Linux. This service runs approximately 40 servers for the HE & FE community; and requests regularly exceed one million a day. Example: In a single small or medium-sized institution, a basic web cache system could be easily implemented on a Pentium II processor with 64MB-256MB of memory and two GB-20GB of hard disk storage using Linux and the Squid caching software. The minimum requirements for the Microsoft Internet Security and Acceleration Server are a Pentium III processor with 256MB of RAM. For larger or more complex installations, it is sensible to consult a network system specialist. It is possible to connect caches together to improve the efficiency of the service and provide multiple layers. To do this, the onward ISP should be consulted. What are the costs and cost savings of a cache? It is difficult to put a precise cost on the benefits of a cache service, as its success will depend on the nature of the users. For example, in a sixth-form environment, where many students are looking at different pages, the benefits are less obvious than in a school environment, where groups or classes access the same material simultaneously. In the latter example, it could be said that cache solutions multiply bandwidth and, it follows, provide a kind of cost saving as they are providing users with a service equivalent to a higher bandwidth connection. The costs of implementing a solution are equally variable and will include purchase of hardware and software, installation and maintenance. The cheapest cache solution in terms of capital cost is likely to be a Linux-based solution, which uses free software and can run on reasonably low-specification hardware. The total cost of ownership should be considered with any system although most server software is reasonably reliable and will run for long periods without any attention, costs of maintenance and support may be a factor. Some vendors provide systems offering caching facilities on a rental basis where, instead of purchasing hardware, an annual fee is payable to cover installation and ongoing support for the service. Implementing a cache with good management and reporting facilities can identify usage patterns, cache effectiveness and bandwidth consumption. If the administrator uses these Becta 2004 Valid at September 2004 page 5 of 8

reports effectively they will show how much of the Internet connection s bandwidth is being used and whether the current connection is meeting demand for that site. What other functions do caches have? Caches have progressed from being merely software applications that control a store of information to being managed appliances designed specifically for content delivery. Some of the advanced functions are described below: Stream splitting is when a stream of data from a host server is divided at the cache for transmission to multiple LAN computers. If five users request a one Mb stream of video from, say, a BBC web site without a stream splittingenabled cache this would take up five Mb of the Internet connection bandwidth. With stream splitting this would only take up one Mb of the Internet connection bandwidth. Content filtering functions can be integrated with cache software so that a cache can block access to certain web sites depending on their content. Filtering normally adds to the cost of the cache, but does reduce bandwidth consumed, however, by not allowing access to inappropriate sites. WCCP (Web Cache Communication Protocol) is a protocol that transparently routes all web traffic to the local cache before it leaves the LAN. It also provides extra features such as load balancing and multicasting, as well as certain security functions. Scalability of caches and linking caches together can improve performance greatly. If one cache knows what another cache has in its repository it can redirect requests to that cache as and when required. Overload bypass is a feature that allows the cache to pass traffic that it is too busy to deal with to web servers rather than have those requests for information held in a queue at the cache. As caches increase in intelligence and complexity they offer increased report and management functionality. Logs and reports can be produced for each user, for each web site visited, the time of visit etc. These reports can be used to assist in the efficient running of a cached environment and optimise available bandwidth. Pre-positioning is the downloading of content to the cache appliance before the user requests it. This becomes more crucial when accessing video and rich media clips. The burden is taken off the network during peak usage hours if this download can be scheduled to occur out of hours How does a cache differ from a proxy? A cache server is not the same as a proxy server. Cache servers have a proxy function with regard to requests for certain content from the World Wide Web. When a client passes all their requests for web objects via a cache, this cache is effectively acting as a proxy server. Caching is a common function of proxy servers. Proxy servers perform a number of other functions, too, mainly centred on security and administrative control. Broadly speaking, a proxy server sits between a number of clients and Becta 2004 Valid at September 2004 page 6 of 8

the Internet. Any requests made to the Internet from a LAN computer are forwarded to the proxy server which will then make the requests itself. The key differences between a proxy and caches are: A proxy server will handle more requests than just those for web content. A proxy server does not by default cache any data that passes through it. There are certain security benefits based on the fact that proxy servers hide other computers on the network from the Internet making it is impossible for individual machines to be targeted for attack. The requirement for 'public' IP addresses is also removed, so that any number of computers can share one public address that is configured to the proxy rather than each computer needing a unique IP address. This has implications for video conferencing and other point-to-point applications which might require some additional resource or configuration. What standards are there for caches? CERN is a standard for application aware proxy services over HTTP-based client/server communications. A CERN server is slow and not suitable for heavy traffic. ICP is the Internet Caching Protocol that exchanges data between caches about the existence of stored information. WCCP is a Cisco router control protocol that transparently routes TCP port 80 packets to cache appliances and incorporates value-added features such as load balancing, security features and multicasting. What about storing multiple copies of content in a cache? Storing multiple copies of the same file on multiple caches can create licensing problems. Although CERN caching has been around since 1993 different content providers have varying opinions on the caching of their material. Caching servers do not encourage or assist in breaking copyright. Multiple caches may be combined and managed in order to provide a more efficient and scaleable content delivery service. The details of this are outside the scope of this information sheet, however the principles are worth a mention. In multiple, but non co-ordinated caches, each cache operates independently by checking its own repository and then passing requests on towards the actual host server. It would not be unreasonable to assume that the first request for content would be delayed at a number of stages on its route between client and host server, as different caches compare their own stores, pass requests on, and make a copy of any information returned. A fully managed system makes it possible for the LAN cache to query a number of other caches or a management appliance for the requested content. If this content is found on another cache the download can take place from this location rather than the host server. This works well where multiple sites are linked to one central site that has combined caches, as all the central caches will be queried for the content rather than just one. What should I be looking for when purchasing caching equipment? For effective content delivery in schools it is essential that the content delivery and caching service in a school is compatible with the equipment installed by the LEA or the RBC. Schools should consult Becta s Content Delivery Network Application Profile, which can be found at [http://getconnected.ngfl.gov.uk/index.php?s=network]. Becta 2004 Valid at September 2004 page 7 of 8

Other sources of information Software, hardware and service providers Akamai [http://www.akamai.com/] A commercial multimedia cache company which hosts content around the world for content providers. Equiinet [http://www.equiinet.co.uk/] Producers of integrated gateway devices that include cache functionality Microsoft Internet Security and Acceleration Server [http://www.microsoft.com/isaserver/default.asp] This has information on the current Microsoft solution. Squid [http://www.squid-cache.org/] Cache software. Cisco [http://www.cisco.com/uk] Provides information on cache and content delivery solutions. Volera [http://www.volera.com/] Content acceleration and networking solutions. RM SmartCache/RM SmartTracker [http://www.rm.com/primary/products/product.asp?cref=pd45222] Product specifically for schools. RM SmartTracker is a service that sits on RM SmartCache and allows schools to keep a close eye on individual users' web-browsing activities. Reports and further reading Cashing in on Caching [http://www.ariadne.ac.uk/issue4/caching/] Ariadne article. Report on web caching now archived [http://www.jisc.ac.uk/acn/caching.html] JISC Report. Web Caching and Content Delivery Resources [http://www.web-caching.com/] Developing institution policies. Sizing Squid Caches [http://hermes.wwwcache.ja.net/servers/sizingsquid.html] JANET Server sheet. Becta 2004 Valid at September 2004 page 8 of 8