An analysis of Web Server performance with an accelerator

Size: px

Start display at page:

Download "An analysis of Web Server performance with an accelerator"

Sharyl O’Brien’
6 years ago
Views:

1 An analysis of Web Server performance with an accelerator S.Jha, F. Sabrina, B. Samuels, and M. Hassan School of Computer Science and Engineering The University of New South Wales Sydney, NSW 2052 Australia {sjha, farizas, ABSTRACT One of the most vexing questions facing researchers interest in the World Wide Web is how the performance of Web Server could be improved. In this paper we discuss the results of our investigation on potential of httpd accelerator (phhttpd) to improve web server performance. This paper presents comparative results for running the Web server with and without an accelerator, and measures the effectiveness of the accelerator. As there is little quantitative study of httpd accelerator available, this paper provides a good insight into the potential of the httpd accelerator for improving web server performance by performing empirical study. 1.INTRODUCTION With the explosive growth in the use of World Wide Web, improving the performance of the web has been the subject of much recent research. Server performance has become a critical issue for improving the Quality of Service (QoS) on the World Wide Web. A good understanding of Web Server s behavior and performance issues are necessary to improve the performance of Web server. The performance of Web servers on traditional Unix systems is limited by the underlying operating system s forking process model. This model is very resource intensive, and Web server accelerators or httpd accelerators came as a result to improve performance [2]. On Unix systems when a request is received, it must switch into kernel mode to get access to the hardware. The buffer from the device driver in the kernel needs to be copied to the user process. Efficient TCP/IP stack implementations, buffer copy is done either once through the OS or none at all. Less efficient systems may perform buffer copy through several layers to the user process. On top of this, the process needs some CPU time to execute the request. Recently various httpd accelerators have been used with the web servers to improve performance. The accelerator acts as a front-end for Web servers, initially handling all incoming requests. Requests that cannot be handled by accelerators are forwarded a backing Web server. When a request is received, the accelerator examines its cache for the file, returning the file immediately if found. If there is a cache miss, it can do a system call to fetch the file or pass the request onto a backing Web server to service. A benefit of this model is, that by having some agent examine requests, intelligent decisions can be made on how to best service these requests. The httpd accelerator runs in the kernel as a loadable module. This is very attractive since those requests can be serviced in kernel mode without having to switch to another process. Provos et al [3,5] describe that POSIX Real signal can improve the scalability of server. POSIX Real-time Signals enable network server applications to respond immediately to network requests. An added benefit of RT signals is that they can be queued in the kernel and delivered to an application one at a time, in order, leaving an application free to pick up I/O completion events when convenient. Phhttpd [3,5,7], an httpd accelerator that uses the POSIX RT signal mechanism, thus can improve the scalability of web server. The prior work on performance of phhttpd [3] was only a brief analysis. This paper presents a comprehensive quantitative measurement of performance of phhttpd. Section 2 describes the experimental setup. Measurement and analysis of experiments are discussed in section 3. Conclusion and future work is described in section 4.

2 A Architecture 2. EXPERIMENTAL SETUP The set up of experiments follows the framework outlined in the WAWM project [1]. In Figure 1, the components at the left are located in local network. The Web server (OZ) was the only machine servicing HTTP requests and the two clients were generating HTTP requests. The experiments are controlled from another host, acting as the Webmaster, which also acts as a data store for collection and analysis of data. All hosts in the local network connect via a Mbps switch. Active measurements are taken from the Webmaster, as well as from the remote hosts. Active measurements inject a packet into the network and the response to these packets is measured. Examples include ping and traceroute. Due to resource constraints, there is limited access to remote hosts; so two sites shown on the right in Figure 1 have been used. These sites offer public access to measurement tools running from their site. The Web server performs passive measurements (which gather relatively large amounts of data and record data at a node or endpoint in the network.) of the network using tcpdumpv3.4 and operating system monitoring through vmstat utility. When conducting experiments, simultaneous measurements are performed over different parts of the network. Accurate measurement of end-to-end delay requires synchronized clocks on the client and server. The difficulties of synchronizing clocks in a widely distributed environment led to the development of network time protocol (NTP) daemon, which assures that clocks are synchronized on the order of tens of milliseconds. When comparing measurements all host clocks are set to be synchronized every 30 minutes with a nearby NTP server, ntp.cse.unsw.edu.au by using ntpdate program. Linux also comes with the vmstat utility. This utility monitors operating system activity such as user time, system time, free memory, context switches, interrupts and several more, though the /proc file system. Reading from /proc can incur high overheads as several system calls might be used, and large blocks of data may be copied from the kernel to user space. For our purposes we use vmstat as it is readily available and simple to use. To limit the overheads associated with /proc reads, we only sample at periodic intervals of approximately 10 seconds. We create a shell script on the server to run vmstat at periodic intervals and log the data to a predetermined location for later retrieval by the Webmaster. We also gather process statistics on Apache by referring to the URL, on the Web server. A script is created to log this page at periodic intervals. Figure 1: Logical architecture of experiments B Server Configuration The Web Server runs RedHat Linux v7.0 (Guinness), using Kernel v on a 800 MHz Pentium III CPU and 256 MB of RAM PC with two Mbs Ethernet ports. Light performance tuning is done on the Linux system by removing unnecessary services, and tweaking system parameters. A modified version [11] of Apache v is used as Web server software that supports tunneling, allowing it to talk to an httpd accelerator if needed. Notably we set the pool of initial processes waiting for connections to a moderate size of 20. This value should not be too small in that, many extra httpd processes need to be added to the existing pool to cope with demand, nor should it be too large so that it consumes a lot of memory. A busy httpd process can consume in the order of a few megabytes of memory. For taking passive measurements, we install tcpdump v3.4. A shell script is created to execute tcpdump on both Ethernet ports, logging packet information to disk. C Client Configuration Both client machines have the same hardware specifications with Linux RedHat? v7.0 installed on Pluto and Linux RedHat? v6.2 on Saturn. The PCs were configured with 800 MHz Pentium III CPU and 128 MB of RAM. The clients do not need any other software, besides the workload generator program. For WebStone the binary file webclient, generates load and is distributed at run time. It requires our clients to have rexec and rsh daemons running in order to allow communication with the Webmaster. D Workload generator Workload generators are the tools used to generate HTTP requests on the Web server. There are essentially two types of generators, synthetic workload generators and trace driven generators. Synthetic workload generators are all based on making repeated requests as fast as possible or some at predetermined rate. In our

3 experiments we use the WebBench [6] and WebStone [4] benchmarking tools. E Webmaster configuration The Web clients are controlled by a single Webmaster, which combines the performance results from all clients into a single summary report. The Webmaster (Neptune) computer runs Linux Redhat v6.2. The main WebStone program is installed on this computer and parameters for the experiments can be configured. It was also set up to take ping measurements during experiments and fetch status information from the Web server periodically. These activities are automated by the use of shell scripts, with little user intervention required. For WebBench experiments the Webmaster also has Window? 98 installed. The controller program is installed on this machine. F Phhttpd Phhttpd [7] is an httpd accelerator used in our experiments and can run on the same system where the Web server resides. It features a small I/O core and an aggressive content cache. Currently it is limited to servicing static file requests, passing slower dynamic requests to a backing Web server. Networking is done by using non-blocking system calls, and allows a single execution context to handle as many clients, without having to do a process switch. G Benchmarking procedure The benchmarking process is divided in three phases. Phase 1: When the benchmark begins, the Webmaster distributes the load-generating program to the web clients, along with parameters of the experiment. The web clients will immediately begin generating requests. Scripts that will be executing measurements will be in the sleep mode. Phase 2: The system is well into the benchmark and approaches a steady state. Measurements on the system begin. Phase 3: At the end of the benchmark, the web clients transfer their individual statistics to the Webmaster. The Webmaster provides a summary report. It copies logs from the web server and remote hosts to a central location, for ease of analysis. 3. RESULT In this section we report on the results obtained using the experimental set up described in Section 4. In addition to the information described, we also add one host godzilla.zeta.org.au, located 10 hops away from the Web server, to measure the latency in fetching a file from the server. The duration of these experiments run for four minutes. We allow one minute for the system to start and move into a steady state, and then begin recording measurements. For analysis, the data obtained, firstly needs to be reduced. Logs taken from tcpdump are very large and we use the modified tcptrace [8] to gather information from the packet traces. WebStone already generates its own set of statistics and we use those in our analysis. We summarize the results using a standard spreadsheet for data gathered from active measurements and vmstat. A Performance under varying Loads In our initial set of experiments, we measure the performance of the Web Server under various client loads while retrieving a KB file. Connection statistics, latencies, OS statistics under various server loads are shown in Table 1-3. In Table 1 we observe that increasing client numbers does neither significantly increase the connection rate nor throughput of the system. This may be to do with the fact that at even 5 client processes, WebStone can still generate traffic near the maximum potential of the system. There are four different page mixes that attempt to model real workloads. Type of pages is defined by file size and access frequency. Each page in the mix has a weigh that indicates its probability of being accessed. The total amount of data moved is the product of the total number of pages retrieved and the page sizes. Little s Load Factor (LLF), derived from Little s law [10] reflects the number of clients actually being serviced at a time. The value calculated should be close to the number of clients, otherwise it indicates that the server is overloaded, and some request are not being serviced before they time out. Table 2 shows that increasing loads affect the latency in direct proportion. Connect latency is the time to establish a full TCP connection and response latency is time to transfer file after connection. As there are more clients added to the network there is more contention, hence longer delays. Table 3 shows us that increasing clients causes an increase in the number of processes waiting for execution time. This is because a separate process handles each client and only one process can ever execute at a time on our Linux system. It also shows that for a high load, in this situation, almost 80% of time is spent in the kernel, hence the lower number of context switches for clients. It is evident that on a Unix system, the HTTP servicing process requires a significant time doing system calls, when servicing a high number of concurrent users. Added with the fact that there are also many httpd processes waiting for execution time, we conclude that the forking process model is inefficient under higher loads. In the following section we experiment with a Web server accelerator,

4 phhttpd and look at the performance of using this system, which avoids using the forking model. Table 1:Connection statistics under several loads File Size (KB) Clients Pages Transferre d Connection rate (conn/sec) Server Throughput (Mbits/sec) LLF Total errors Table 2: Latency statistics under several loads File Size (KB) Clients Connect Latency (ms) Response Latency (ms) Local Ping (ms) Remote ping (ms) Remote Trace (ms) Fetch (s) Table 3: OS statistics under several Loads File size (KB) Clien ts Process Waiting (pw/sec) Interrupts (int/sec) Context Switches (cs/sec) %User %Syste m % Idle B Web Server Accelerator Performance In our next set of experiments we look at the performance aspects of phhttpd web accelerators. In figure 2 we see marginal gains in throughput and almost an increase of connections per second using phhttpd. Operating system statistics in Figure 3 shows that there is an 80% decrease in the number of context switches performed per second when Apache is used in conjunction with phhttpd, because httpd accelerator serves requests from the kernel. This increase in performance, allows more requests to come through, hence the increase in interrupts, from the network cards. Looking at how the CPU time is divided [Figure 4], phhttpd breezes through the benchmark, leaving the CPU idle 57% of the time. User time is very minimal at 8% while system time is 34%. The Web accelerator looks very promising, providing increased performance along with efficient utilization of resources. Throughput (Mbit) Connection rate (per sec) Figure 2 Throughput and connection rate Apache Standalone Apache+phhttpd

Figure 5: Connections per sec Figure 3 Context switches and interrupts Interrupts/s Context switches/s 0 5000 00 15000 20000 Apache standalone Apache+phhttpd We use another file set and measure the

Results from the experiments for Apache and Apache+phhttpd are shown in Figures 5 and 6. We notice substantial gains in the connection rate and server throughput when accelerator is used.

We also have a look at the operating system statistics under this experiment.

5 Figure 5: Connections per sec Figure 3 Context switches and interrupts Interrupts/s Context switches/s Apache standalone Apache+phhttpd We use another file set and measure the increase in performance along various number of clients. We run the experiments for four minutes, each time varying the number of clients. Results from the experiments for Apache and Apache+phhttpd are shown in Figures 5 and 6. We notice substantial gains in the connection rate and server throughput when accelerator is used. There is 54% increase in the number of connections and bandwidth usage at clients. This shows a significant improvement in server performance. We also have a look at the operating system statistics under this experiment. From Table 3, we see that Apache with phhttpd only has 1 process waiting on average, that results in a very large drop in the number of context switches done per second. This allows the Web server system to spend more time in the kernel (79.38%) and concentrate on the task of responding to network interrupts and sending out files, which all require system calls. Figure 4: CPU Figure 6. Interrupts /sec On the other hand, the Apache standalone model has 115 processes waiting and a 1600% increase in the number of context switches executed. The fact that there is zero idle time and processes waiting, indicating there is contention for CPU time between processes and system calls. Kernel routines have a higher priority and therefore less useful work from processes can be done, resulting in a degrade in performance. We see that a Web server accelerator like phhttpd, is very attractive and has shown significant performance gains and shows how an alternative model can be an advantage under a Unix based system. Table 4: Apache standalone vs Apache+phhttpd Processes Server Clients waiting (pw/sec) Interrupts (int/sec) Context Switches (cs/sec) % User % System %Idle Apache Apache+ phhttpd

6 4. CONCLUSION In this paper the performance gain from use of httpd accelerator with Web server has been studied. We have shown that under Unix, the forking model is very resource intensive. Experimental results in this paper show that under moderate to high loads, the CPU is being used to its full capacity, creating contention for CPU time amongst processes and system calls. This is caused by the fact that the httpd processes always need to switch in and out. A context switch to kernel mode is required to read from a network device driver. After parsing the request in user space, a system call is yet again required to send the request file onto the network, and yet another context switch is required to service other waiting httpd processes. Experiments were performed with an alternative model using a Web server accelerator, phhttpd. Results show substantial performance gain higher throughput and service rates and low resource compared to Apache. Accelerators run in the kernel and save the overheads of switching processes, although there is a trade off between execution speed and functionality. POSIX Real signals are very efficient mechanism in terms of the overhead and it also provides good throughput, but there are some limitations too. These limitations arise from the fact that the RT signal queue is a limited resource. Since each event results in a signal being appended to the signal queue, a few active connections could dominate the signal queue usage or even trigger an overflow. [12] has proposed a scheme called signal-per-fd which is an enhancement to the default RT signal implementation in the Linux kernel, and it could significantly reduce the complexity of a server implementation, increasing the robustness under high load, and potentially increasing its throughput The phhttpd front-end is very simple, and is limited to static requests for the time being. As is evident from our experiments, it is still worth using with Apache Web servers to serve static files such as images. On a general note, there is a potential for Web server accelerators to extend their functionality as the cost would seem minimal with its major performance gains over a mainstream Web server such as Apache. Smith et al [9] have done work to evaluate techniques for caching dynamic content, but still lots of work need to be done. Future work involves study of the dynamic contents issues and the work is under progress. REFERENCES [1] P. Barford, M. Crovella. Measuring Web Performance in the Wide Area, in Performance Evaluation Review, August, [2] Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias. Design and performance of a Web server accelerator. In Proceedings of IEEE INFOCOM 99, March 1999 [3] Neil Provos, chuck lever, stphen Tweedie, "Analyzing the overload behavior of a simple Web server." CITI Technical report 00-7, August 2000 [4] WebStone. [5] Neil provos, Chuck lever, "Scalable Network I/O in Linux", Proceedings of the Usenix technical conference, FREENIX track, June 2000 [6] WebBench 3.0/ [7] Z.Brown, phhttpd (documentation on phhttpd) [8] Tcptrace. [9] Ben Smith, Anurag Acharya, Tao Yang, Exploting Result Equivalence in Caching Dynamic web content. Department of Computer Science, University of California, Santa Barbara, CA [10] D.A. Menasce, V A F Almeida, capacity planning and performance modeling: From Mainframes to client server systems. Upper Saddle River, NJ: Prentice Hall, 1994 [11] Azer Bestravros, Bob Carter, Mark Crovella, Carlos Cunha, Abdelsalam Heddaya, and suliman Mirdad. Application-Level document caching in the Internet. In Proceedings of SDNE 95: The Second International Workshop on services in Distributed and Network Environment, June [12] Abhishek chandra, David Mosberger, Internet and mobile systems laboratory, HP laboratories Palo Alto, HPL ,"Scalability of linux Event-dispatch Mechanisms". December 14 th, Acknowledgment: Experimental work described in this paper was performed by Bryan Samuels for his minor thesis.

A Scalable Event Dispatching Library for Linux Network Servers

A Scalable Event Dispatching Library for Linux Network Servers Hao-Ran Liu and Tien-Fu Chen Dept. of CSIE National Chung Cheng University Traditional server: Multiple Process (MP) server A dedicated process