Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet *

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet * Aamir Shaikh and Kenneth J. Christensen Department of Computer Science and Engineering University of South Florida Tampa, FL 33620 {ashaikh, christen}@csee.usf.edu Abstract As network data rates scale-up, an understanding of how and why traffic characteristics change is needed. We study the characteristics of bulk data transfer TCP/IP flows in a fully-switched Gigabit Ethernet network. Both disk-to-disk and memory-to-memory transfers are studied. We investigate flow characteristics as a function of link speed, server load, operating system (Linux and WindowsNT), and number of simultaneous clients accessing a server. Using trace-based simulations, we find that there is great variability in the friendliness of a flow to a single-server queue in terms of packet losses. Flows on Gigabit Ethernet tend to cause greater packet losses for a given link utilization and buffer size under certain conditions. We also find some unexplained inefficiencies in some of the flows. We investigate methods of application-level traffic smoothing to improve flow characteristics. A prototype packet-spacing socket send function is developed to reduce packet losses. 1. Introduction Knowledge of traffic characteristics is needed for capacity planning of networks and to achieve appropriate design points for network devices such as switches and routers. In addition, a careful investigation of traffic can often result in improvements to underlying protocols to achieve greater efficiencies and higher performance. As networks scale-up in data rate, an investigation is needed to determine if and how traffic characteristics change, what are the root causes of any changes, and what are the resulting effects. TCP/IP is expected to remain the dominant protocol in high-speed networks. Currently, applications with short-lived flows (e.g., HTTP transfers) are dominant. However, high-speed networks will also be used for bulk data transfer with resulting long-lived flows. For example, high resolution medical images are typically 10 s to 100 s of Megabytes [2]. In this paper, we address five questions related to traffic characteristics on 100 and 1000-Mbps switched Ethernet networks with TCP/IP as the transport protocol. 1. How do TCP/IP traffic characteristics change as a function of disk or memory access, link speed, server load, and operating system? 2. How are destination packets interleaved as a function of the number of simultaneous client connections at a server? 3. What are the queuing delays and packet losses of TCP/IP flows for a range of queue utilizations? 4. What are the root causes of observed behaviors that significantly affect queuing delay and loss? 5. What can be done to decrease queuing delay and loss of TCP/IP flows? We seek to gain an understanding of root causes. In a network with overflow losses, TCP flow control mechanisms will affect traffic characteristics. We would like to configure high speed networks so that losses occur only very rarely, thus we do not study the effects of TCP flow control mechanisms. Our contributions will help provision networks to better minimize packet loss. The remainder of this paper is organized as follows. Section 2 is a brief background review of traffic characterization studies related to TCP/IP. Section 3 describes our experiments. Section 4 presents the traffic characterization methods and results. Section 5 presents insights. In Section 6 we propose and evaluate a timed socket send function for improving flow friendliness (i.e., to reduce queuing delays). Following Section 6 is the summary and list of references. * This material is based upon work funded by the National Science Foundation under Grant No. 9875177.

2. Background on Traffic Characterization Network traffic has been characterized at the aggregate level with significant discoveries that packet inter-arrival times are not exponentially distributed, that there exist correlations in source-destination pairs, and that Long Range Dependence (LRD) may be the primary determinant of queuing behavior [4, 10, 12]. The causes of LRD have been largely attributed to heavy-tailed distributions of file sizes [15] and user behavior [3] at the application layer (Web browsing in the case of [3]). TCP/IP behavior has been extensively studied with the primary motivation of better tuning TCP for higher performance. In [14] it was discovered that a large ATM AAL5 MTU causes deadlocks in TCP data transfers resulting in dramatic performance degradation to about 1% of the expected throughput and one order of magnitude longer response times. It was also found that this performance degradation occurred because of a deadlock situation in the TCP connection, which was resolved by a 200 millisecond timer that generated TCP delayed acknowledgments. In [13] the performance of TCP for streams of small messages was investigated. A method of delayed acknowledgments with faster TCP timeout was proposed and implemented which increased throughput by about 50% for flows of small messages. In [16] a study of multiple TCP/IP implementations (including Unix variants and Microsoft Windows) was conducted. Differences in behavior between the implementations were found. It was noted that some of the differences could cause significant performance problems. The performance of TCP/IP over Gigabit Ethernet was studied in [6]. The effect of processor speed, type of adapter, Linux kernel version, and various tuning parameters were studied. The netpipe program [7] was used to measure host-to-host throughput for a range of transfer block sizes. A maximum throughput of 470- Mbps between two PC s with Alteon ACEnic adapters and using Alteon Jumbo Frames (9 KB size) was measured. For 1500 byte frame size, a maximum throughput of about 350-Mbps was measured. Linux implementation issues in the area of delayed TCP acknowledgements were shown to have considerable effect on throughput. It was shown that recent versions of Linux fix these problems. The addition of pacing to TCP is studied in [1]. Pacing attempts to evenly space, or smooth, transmissions in the time scale of a round trip time. The pacing rate is determined as the window size divided by the round trip time. Pacing implemented at the sender side delays data packets to spread them out. Pacing implemented in the receiver likewise delays ACKs with the result of spreading-out transmission of data packets from the sender. However, it is shown in [1] that pacing can, in many cases, result in lower throughputs and higher delays. Pacing delays the detection of congestion and causes synchronized losses at bottleneck queues. It is not clear, however, if the decrease in performance is exactly proportional to an increase in pacing or for what network configurations pacing has the most effect. That is, while a large amount of pacing may not be beneficial, it is possible that a small amount of pacing may reduce packet losses in a controlled manner and thus prevent unnecessary reductions in transmission rates. 3. Experiments to Characterize TCP/IP Flows This section describes our testbed and experiments for studying gigabit bulk data transfer TCP/IP flows. 3.1 Description of the test bed Our testbed consists of an Alteon ACEswitch 180 nine-port Gigabit Ethernet switch. Connected to the switch are three Dell Pentium III 700-Mhz and three Dell Pentium III 866-Mhz PC s. Also connected to the switch is a Dell Pentium II dual-300-mhz PC. The 700 and 866 PC s have 128 MB of RAM and the dual-300 has 64- MB. The 700 and 866 PC s connect to the switch via an Alteon ACEnic Gigabit Ethernet adapter (a full-duplex fiber connection) or via a 3COM 3C905X 10/100 Ethernet adapter (a full-duplex copper twisted-pair connection). Depending on the experiment being conducted, only one of the adapters and connections in each PC is active. The dual-300 PC connects only at 100-Mbps. RedHat Linux 6.1 (kernel version 2.2.12-20) and WindowsNT 4.0 are installed on these machines. Figure 1 shows the testbed in the Information Systems Laboratory at the University of South Florida. Server (700) Trace collector (700) Load generator (dual-300) Clients (1x700, 3x866) Alteon ACEswitch 180 Figure 1 - Gigabit Ethernet testbed configuration

The software used for our experiments consists of FTP servers for disk-to-disk transfers, benchmark programs for memory-to-memory transfers, a load generator utility, and a trace collection utility. The software used was: File transfer software - For WindowsNT we used the FTP server that installs with Microsoft Internet Information Server (IIS) 4.0. For Linux, we used the FTP server that comes in the standard build. The Apache Web server was also installed on Linux. Memory-to-memory transfer software - For both Windows and Linux we used Netperf [11] for memory-to-memory transfers. Netperf is a longstanding and portable benchmark program for measuring network adapter throughput for Ethernet, ATM, and other LAN technologies. Load generation utility - To generate load on the server, we used http_load [8]. The http_load utility enables a single client to initiate multiple, parallel HTTP requests to a Web server. Trace collection utility - For collecting traces we used Windump [19], a Windows version of the standard tcpdump utility [9]. The Windump utility captures packet headers showing timestamps with microsecond accuracies. Packet headers are decoded to show addresses, flags, window sizes, and other TCP and IP protocol information. For trace collection, port mirroring is used in the switch. The server port was mirrored, and it was confirmed (during the experiments) that no packets were ever lost due to the mirroring. Packet loss statistics are maintained within the switch and are accessible via SNMP or via telneting to the internal switch management program. 3.2 Description of the trace collection experiments Five data collection experiments are defined. For each experiment, a trace of the traffic is taken using Windump. A minimum of one million packets were collected for each experiment. The raw trace data is available directly from the authors and includes a readme.txt file with detailed experiment descriptions. Disk-to-disk transfer: Traffic characteristics as a function of link speed for disk-to-disk file transfer. Traffic is traced for a large (200 MB) FTP file transfer from server to client. Server and client are tested for 100 and 1000-Mbps link speeds and for the mixed link speed case of server at 1000-Mbps and client at 100-Mbps. Memory-to-memory transfer: Traffic characteristics as a function of link speed for memory-to-memory data transfer. Similar to the disk-to-disk experiment, except that Netperf is used to transfer 200 MB from the serverto-client with no disk involvement (i.e., memory-tomemory). Operating system: Traffic characteristics as a function of operating system. The disk-to-disk and memory-tomemory experiments are repeated for WindowsNT and Linux. For the WindowsNT experiments both the server and client were WindowsNT, for Linux both server and client were Linux. Server load: Traffic characteristics as a function of an unloaded or loaded server. The disk-to-disk and memory-to-memory experiments are repeated where the server is loaded using http_load to achieve a mean server CPU utilization of about 20% and traffic of about 20 to 30-Mbps for the loaded case. The http_load program was configured to request 10 KB and 2 MB size files at a rate of 10 files per second for the 100-Mbps server and 10-KB, 2 MB, and 20 MB size files at a rate of 5 files per second for the 1000-Mbps server (both resulting in 20% server CPU utilization as measured by the performance monitor in WindowsNT and the top command in Linux). This experiment is executed for both WindowsNT and Linux. Multiple clients: Traffic characteristics, in particular packet interleaving, as a function of the number of clients. This experiment repeats the disk-to-disk transfer experiment for one through three clients. Clients either all access the same 200 MB file or access different 200 MB files. This experiment is executed for the server running WindowsNT and the clients running Windows 2000 (we did not have Linux installed on all clients). 4. Characterization Methods and Results The collected traffic traces are characterized using tools available at [18]. The results lead to new insights and motivate the need to be able to shape TCP flows. 4.1 Characterization methods All traffic traces are characterized using statistical measures and with a trace-driven queuing simulation. For each collected traffic trace, the packet inter-arrival times for server-to-client flows are described by mean, coefficient of variation (CoV), and peak-to-mean (P/M) ratio (P/M ratio is shown in [5] to be a good predictor of queuing behavior). We study queuing behavior by feeding the traffic trace into a single-server queuing simulation model. The queuing model, implemented in CSIM18 [17], takes inter-arrival time and packet lengths from a trace file. The service times are generated as packet length divided by a (constant) link speed. The

15:03:34.144364 giga1.usf.edu.20 > giga3.usf.edu.1045:. 69505:70953(1448) ack 1 win 32120 <nop,nop,timestamp 434817 447149> (DF) [tos 0x8] 15:03:34.144378 giga1.usf.edu.20 > giga3.usf.edu.1045:. 70953:72401(1448) ack 1 win 32120 <nop,nop,timestamp 434817 447149> (DF) [tos 0x8] 15:03:34.144392 giga1.usf.edu.20 > giga3.usf.edu.1045: P 72401:73849(1448) ack 1 win 32120 <nop,nop,timestamp 434817 447149> (DF) [tos 0x8] 15:03:34.144679 giga1.usf.edu.20 > giga3.usf.edu.1045:. 73849:75297(1448) ack 1 win 32120 <nop,nop,timestamp 434817 447149> (DF) [tos 0x8] 15:03:34.144686 giga3.usf.edu.1045 > giga1.usf.edu.20:. ack 68057 win 14480 <nop,nop,timestamp 447149 434817> (DF) [tos 0x8] 15:03:34.144700 giga1.usf.edu.20 > giga3.usf.edu.1045:. 75297:76745(1448) ack 1 win 32120 <nop,nop,timestamp 434817 447149> (DF) [tos 0x8] <SNIP SNIP> 15:03:34.144749 giga3.usf.edu.1045 > giga1.usf.edu.20:. ack 70953 win 18824 <nop,nop,timestamp 447149 434817> (DF) [tos 0x8] Figure 2 - Snippet of a 1000-1000 Mbps disk-to-disk Windump trace link speed is varied as a function of the mean transfer rate to yield a range of offered load values (e.g., from 50% to 95%). The mean rate of a trace is simply the sum of the bytes of all data packets divided by the total time. We study both queuing delay for an infinite size buffer and packet loss for a range of finite buffer queues. Buffer capacity in the queuing simulation is in bytes. Finally, we alter the traffic traces to better understand the root causes of given behaviors. We spread and smooth the interarrival times in the traces: Spreading - all inter-arrival times greater than a specified value are clipped to the specified value and then the sum of clipping values is evenly spread to all inter-arrival times. That is, for N inter-arrival times ( t i for i = 1,..., N ) and a clipping value of t clip, we compute N ( ti tclip ) if ti tclip S clip = > i= 1 and then modify all inter-arrival times as, Sclip ti = ti +. N Smoothing - inter-arrival times are modified using a packet-based leaky bucket model with tokens with a specifiable token rate (bytes/sec) and token bucket capacity (bytes). Spreading and smoothing do not change the mean transfer rate (of the traced transfer). Spreading requires knowledge of traffic beforehand and hence cannot (easily) be implemented in an application, TCP/IP protocol stack, or an adapter. However, it can lead to some useful insights. Smoothing via a leaky bucket can be implemented in a real system. 4.2 Characterization results for single client case Figure 2 shows a snippet of a 1000-1000 Mbps Linux disk-to-disk trace. It shows the timestamps, source and destination IP addresses and packet sequence numbers along with other control information. This trace shows the transmission between giga1 (server) and giga3 (client). Server giga1 sends data packets of size 1448 bytes and client giga3 transmits the corresponding acknowledgments (ACKs) for these data packets. The last line of the trace shows the ACK for the packet with sequence number 70953. The current window size after this ACK is 18 KB with a maximum of 32 KB. We filter-out the client-to-server ACKs and focus only on the server-to-client data packets. Figure 3 shows a snapshot of data transfer throughput for a randomly chosen time interval in a trace of 1000-1000 Mbps memory-to-memory transfer (for both Linux and WindowsNT). The mean transfer rate, shown by the horizontal lines is 340 Mbps for WindowsNT and 318 Mbps for Linux. The time scale is 400 milliseconds; finer and coarser time scales show details at different levels. The mean rates are calculated for the entire transfer. From the raw traces it was seen that the maximum packet size for WindowsNT transfers is 1460 bytes whereas for Linux it is 1448 bytes (because of the presence of timestamp option in the headers). For Linux disk-to-disk transfers, about one out of every 6 packets is of 1000 bytes in size. The drop-outs in transfer rate at points 1 and 2 are due to the client (the receiver) advertising a zero window to the server. This is presumably due to the receiver being temporarily overrun. Other dropouts are caused by the server waiting for the client to send an ACK packet for it to resume transmission. However, some drop-outs, such as at point 3 are not easily explainable and seem to be caused by the sender delaying the transmission of packets by over 300 microseconds even after receiving an ACK packet. Transfer rate (Mbps) 450 400 350 300 WindowsNT 2 250 Linux 1 3 200 3 3.4 3.8 4.2 4.6 5 5.4 5.8 Time (sec) Figure 3 - Linux and WindowsNT transfer snapshot

Table 1 - Disk-to-disk without server load Linux 1000-1000 0.558 10.2 51.1 1243.32 100-100 0.642 9.4 67.2 1181.07 1000-100 0.645 8.5 60.6 1096.78 WinNT 1000-1000 0.087 7.0 13.7 5.83 100-100 0.128 3.1 10.8 0.37 1000-100 0.129 3.9 12.2 1.81 Table 2 - Memory-to-memory without server load Linux 1000-1000 0.036 6.4 3.2 0.20 100-100 0.126 2.8 10.7 0.35 1000-100 0.128 3.9 11.2 0.47 WinNT 1000-1000 0.034 2.5 3.0 0.30 100-100 0.123 2.3 10.2 0.34 1000-100 0.123 2.3 10.2 0.34 Table 1 summarizes the characterization for disk-todisk transfer with server unloaded. The table shows mean (in milliseconds), CoV, and P/M for packet interarrival times. The P/M ratio is for a one millisecond interval. Ranging the time interval from 100 microseconds to 10 milliseconds made little difference in the P/M ratio. Table 1 also shows the mean queuing delay in milliseconds (Q Delay) for a simulated singleserver queue with infinite buffer size and 90% offered load. All queuing delay results in this paper are from the trace driven simulation model. Table 2 shows the same results for memory-to-memory transfers. Tables 3 and 4 show the same results as Tables 1 and 2 for a loaded server. We make the following general observations: Disk-to-disk transfers are much unfriendlier (i.e., have higher queueing delay) than memoryto-memory transfers in all cases. For disk-to-disk transfers, queueing delay increases with link rate. However, for memoryto-memory transfers, queueing delay decreases with link rate (except for 1000-1000 Mbps with load where delay increases dramatically). WindowsNT stream behavior is very different from that of Linux for disk-to-disk transfers, but about the same for memory-to-memory transfers. Putting load on the server affects disk-to-disk and memory-to-memory transfers in opposite ways. For WindowsNT memory-to-memory transfers load has little effect but for disk-to-disk transfers load increases stream unfriendliness. For Linux memory-to-memory transfers, load increases unfriendliness, but for disk-to-disk transfers load decreases unfriendliness. Table 3 - Disk-to-disk with server load Linux 1000-1000 0.684 10.2 68.5 905.89 100-100 0.829 6.6 57.7 405.45 1000-100 0.750 8.8 77.2 633.46 WinNT 1000-1000 0.102 8.6 11.8 53.82 100-100 0.215 2.4 16.0 359.64 1000-100 0.157 6.9 15.1 101.02 Table 4 - Memory-to-memory with server load Linux 1000-1000 0.053 15.5 5.3 19.41 100-100 0.129 2.8 10.6 0.37 1000-100 0.138 8.9 10.4 3.81 WinNT 1000-1000 0.044 2.4 3.7 0.28 100-100 0.125 2.2 10.0 0.34 1000-100 0.123 2.0 9.4 0.32 For most cases, the CoV is a good predictor of the queuing delay. But, for some cases, CoV is not able to accurately predict the performance of a traffic stream. For these cases the P/M ratio is a better predictor of queuing behavior. The Hurst parameter (H) did not compute correctly for some cases (i.e., resulted in H estimates of less than 0.50, possibly due to an insufficient number of samples), for this reason we do not consider H as a predictor. 4.3 Characterization results for multiple clients For multiple clients simultaneously accessing a server, we examine the characteristics of both the aggregate traffic stream and of the individual streams. We also characterize the interleaving of packets between the connections. We use the term block to refer to the number of consecutive packets transmitted from the server to one single client before beginning transmission to the next client. Table 5 shows the characteristics of the aggregate traffic stream for the two cases. Table 6 shows the characteristics for a single client stream, which is filtered from the aggregate stream. It can be seen that: Increasing the number of clients simultaneously accessing the same file on the server increases the friendliness of the aggregated traffic stream. Increasing the number of clients simultaneously accessing different files on the server decreases friendliness and increases the queuing delay. Traffic streams from a single flow are unfriendlier, with a higher queuing delay, as compared to the aggregate stream. This indicates that merging of streams results in improved traffic characteristics and reduced queuing delays.

When multiple clients are accessing the same file on the server, most of the packets are sent in blocks of 6 packets. When multiple clients access different files, most of the packets are sent in blocks of 45 packets. Increasing the number of clients increases the frequency of occurrence of this 45 packet block size. Figure 4 shows the autocorrelation for an aggregated stream of multiple clients accessing different files on a server. Peaks in autocorrelation at lags of 6 and 45 correspond to the blocking effects of 6 and 45 packets (per block). Autocorrelation Table 5 - Aggregate flow, same versus different files Same - 1 client 0.110 3.23 10.7 17.37 2 clients 0.052 5.16 5.2 1.51 3 clients 0.037 5.88 3.7 0.69 Diff - 1 client 0.110 3.23 10.7 17.37 2 clients 0.203 5.15 23.0 40.98 3 clients 0.254 6.07 26.0 341.11 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2 Table 6 - Single flow, same versus different files Same - 1 client 0.110 3.23 10.7 17.37 2 clients 0.103 3.92 9.7 1.83 3 clients 0.111 4.08 10.5 1.05 Diff - 1 client 0.110 3.23 10.7 17.37 2 clients 0.408 0.40 43.0 60.67 3 clients 0.731 6.71 64.0 3224.40 3 clients, diff file 1 client 2 clients, diff file 0 10 20 30 40 50 60 70 80 90 100 Lag Figure 4 - Autocorrelation of multiple client stream 4.4 Effects from spreading and smoothing Table 7 shows the results of 1% spreading (i.e. spreading the highest 1% of inter-arrival times) for diskto-disk, no server load (Table 1 is the original ). Especially notable is the reduction in queuing delay for WindowsNT 1000-1000 Mbps. Figure 5 shows the effects of 1% spreading for a range of offered loads. It can be seen that for both WindowsNT and Linux traffic, the amount of packet loss for the same offered load reduces with spreading. For WindowsNT, the losses are reduced to almost 0% even for a 95% offered load as compared to 18% for the original case. For Linux, spreading reduces the losses from 26% for a 64 KB buffer and 70% offered load to about 4%. Tables 8 and 9 show the same results for smoothing (for a mean byte rate of 1x and 3x the mean transfer rate, respectively and a 1460 byte token bucket size in all cases). This motivates us to explore the implementation of smoothing in real systems that use TCP/IP for bulk data transfers. Packet loss (%) Table 7 - Disk-to-disk, spreading (1%) Linux 1000-1000 0.558 2.8 2.8 847.17 100-100 0.642 3.6 4.4 546.44 1000-100 0.645 2.5 3.8 666.38 WinNT 1000-1000 0.087 1.9 2.6 0.52 100-100 0.128 2.2 7.3 0.48 1000-100 0.129 2.1 5.2 1.43 Table 8 - Disk-to-disk, leaky bucket (rate = 1x) Linux 1000-1000 0.560 2.7 1.0 0.01 100-100 0.645 3.5 1.1 0.03 1000-100 0.646 1.3 1.0 0.00 WinNT 1000-1000 0.087 1.2 1.5 0.00 100-100 0.129 2.1 1.2 0.00 1000-100 0.129 1.0 1.1 0.00 Table 9 - Disk-to-disk, leaky bucket (rate = 3x) Linux 1000-1000 0.558 9.7 3.3 1235.43 100-100 0.642 9.0 5.4 1173.80 1000-100 0.645 8.3 3.1 1091.01 WinNT 1000-1000 0.087 6.8 8.6 5.63 100-100 0.128 2.1 1.2 0.00 1000-100 0.129 2.7 1.3 0.00 10 9 Buffer = 128KB 8 WindowsNT 7 6 5 Linux Linux w/ 4 1% spread 3 2 WindowsNT 1 w/ 1% spread 0 50 60 70 80 90 100 Offered load (%) Figure 5 - Effects of 1% spreading (disk-to-disk, no load)

For the results of Table 8, the average added delay (i.e., the leaky bucket queueing delay) in the sending server host was 1.9 seconds (!) for Linux transfers and 180 milliseconds for WindowsNT transfers. Such delays are clearly infeasible. For the results of Table 9, the average added delay was 3.9 milliseconds for Linux and 0.8 milliseconds for WindowsNT. For the 100-100 Mbps and 1000-100 Mbps WindowsNT cases, the delay is entirely and equally traded-off from in the network to within the sending host. Trade-offs in delay within the sending host and in the network needs to be studied. 5. Insights from the Traffic Characterization The traffic characterization results show that TCP/IP streams, even for a single server-to-client stream, are bursty. The level of burstiness is a function of all of the factors studied, namely link speed, type of transfer, operating system, and server load. The burstiness was quantified by time series measures of the CoV and P/M ratio. The CoV and P/M ratio range from about 3 to 10, but the queueing delay of a 90% offered load, infinite buffer size queue, ranges by three to four orders of magnitude! The burstiness in any transfer is caused by the variation in packet inter-arrival times. For 1000-1000 Mbps disk-to-disk transfer it can be seen in the traces that this variation is caused either by the client not sending the ACK back to the server, or the server receiving the ACK but still not transmitting a packet. A possible explanation as to why the client might not be immediately sending ACKs back to the server could be due to interrupt coalescing used to reduce CPU utilization. Interrupt coalescing allows the adapter to not generate an interrupt for each arrived packet, but rather generate interrupts after a set number of packets have arrived or a timer has expired (i.e., to prevent the driver from forever holding a lone packet). Thus, interrupt coalescing may cause bunching of packets and hence burstiness in a traffic stream. The effects of interrupt coalescing are likely to depend on the rate of the traffic stream. A high rate stream will more quickly fill up the coalescing bucket than a slower stream. For a diskto-disk transfer, when the transfer rate is comparatively lower, interrupt coalescing can have a significant effect as sufficient packets do not arrive at the client side for it to quickly generate an interrupt and send an ACK. This might explain why certain 1000-1000 Mbps transfers are unfriendlier than the 1000-100 or 100-100 Mbps transfers. To test this hypothesis, we modified the interrupt coalescing parameters for the ACEnic driver for Linux, which includes the send/receive timers and the number of packets sent/received before generating an interrupt. By default, these parameters are set to maximize the throughput. The default values are: Transmit coalescing ticks: 400 microseconds Receive coalescing ticks: 120 microseconds Transmit maximum packets: 40 Receive maximum packets: 25 By changing the value of the Transmit and Receive maximum packets to 1, we were able to see a reduction in the burstiness (and a 10% reduction in both CoV and queueing delay for a disk-to-disk transfer). For memoryto-memory transfer, the transfer rate was reduced by 30% due to CPU utilization maxing out with the increased interrupt rate. For the second case when the server receives the ACK but still does not transmit the packets, this can possibly be explained by ongoing disk activity, which takes a comparatively larger amount of time than a memory access. This can also explain why multiple clients accessing the same file results in smaller queuing delays as compared to the clients accessing different files because a single file can be cached in memory, multiple large files cannot. When multiple clients access a single file from the server, the block size appears to be a function of the TCP window size. Looking at the trace files it is observed that the clients normally acknowledge with a window size of 8 KB, which is the default window size for this experiment. This results in the server sending blocks of 6 packets of 1460 bytes each (equivalent to 8 KB) to one client, wait for the ACK and start transmitting another 6 packets to the second client and so on. For the case when different files are transferred between a server and two clients, most of the transfers are in blocks of 45 packets, which correspond to a 64 KB disk data block. From the traces it can be seen that the client advertises a window of 16 KB with the server sending data worth of 16 KB, receiving the ACK and then again sending data worth 16 KB. This process continues until the 64 KB size is reached with the server sending the last packet (in the 64 KB block) with the PUSH flag set and starts its transmission to the next client. 6. Prototyping a Timed Send Socket Call In this section we propose and implement a mechanism to shape TCP traffic with a goal of reducing the burstiness and hence also queuing delays. Traffic shaping is typically implemented in the network adapter using hardware-implemented leaky bucket shapers (e.g., as is done in ATM adapters). Shaping could also potentially be implemented in software within the adapter device driver, in the TCP/IP protocol stack, or in the application itself. Implementing shaping below the

application requires that lower-layers have knowledge of connections and their shaping parameters. This would require significant changes to these lower layers. Changing the TCP/IP stack is problematic from a standards and deployment viewpoint. Shaping purely in software is also problematic with regards to the granularity of timers needed. We investigate shaping at the socket interface level via a timed send() socket call and propose the addition of hardware timers to an Ethernet adapter to better support software implemented shaping. To send data in a streams-based socket program a call to a socket send() function is made. For a datagram (UDP/IP) program, a sendto() call is used. The arguments for the send() function include the socket descriptor, a pointer to the data buffer that needs to be sent, its length and flags specifying the type of transmission. The send() function blocks until the bytes in the buffer have been successfully accepted by the TCP/IP stack. All socket programs must have a built-in means of handling blocking. Typically, separate threads are used for each TCP session or for single-thread applications a select() function can be polled to determine when a send() completes. In order for an application program to implement shaping, it should be able to extract information as to when the last packet was sent and depending on the time elapsed, be able to block the application to ensure correct spacing between two consecutive packets. One of the arguments of the select() function is the structure timeval, which contains the system time with microsecond accuracy. Using this function it would seem possible to be able to block an application until a timeout value has passed and then issue the next send() for the next packet. In all Unix based systems, the kernel timers are recorded on the basis of the variable jiffies. For most systems, and especially Intel-based architecture systems, the granularity of jiffies is in the 10 s of milliseconds. For a gigabit link (where a single 1500 byte packet takes 12 microseconds to transmit), we need sub-microsecond granularities for our shaping timer. Thus, using the built-in jiffies will not work. Microsecond granularities are probably best provided by hardware support. Figure 6 shows the protocol stack with a shim layer at the sockets interface implementing packet spacing. The shim layer implements the socket interface making existing applications compatible with spacing. Traffic management software - beyond the scope of this work - can set the spacing values. Hardware timers could be implemented and made accessible on the adapter. Timer set and timer interrupt Ethernet adapter with microsecond timer support Application Sockets shim TCP (or UDP) IP Device driver Sockets interface Management software Figure 6 - Shim layer for sockets with shaping Link We prototyped a timed send sockets call using a simple tuned (i.e., tuned for the particular machine on which it runs) spin loop to generate microsecond delays. We recognize that a spin loop inefficiently uses CPU resources; this is the reason for the hardware timer support described above. Following each send() call, we called a delay(int num_microsec) function to delay num_microsec microseconds. This delay value could also be incorporated directly into the send() call as a new argument. We evaluated our prototype timed send() for the WindowsNT disk-to-disk and memoryto-memory transfers. A simple sockets file transfer program was built that transfers the same large file (diskto-disk) with and without the timed-send() sockets function. The same program was modified to also do a memory-to-memory transfer (by sending the contents of a large array). Traces were collected for both of these cases and the results are shown in Tables 10 and 11 for each case (means of four independent runs). As in previous experiments, the queueing delay is for a 90% offered load to a single-server queue with infinite buffer size. It can be seen that by smoothing we are able to increase the friendliness of the traffic stream and thus achieve a reduction in the queuing delay. Because of the spin loop implementation, the CPU utilization increased from 65% to 90% for a memory-to-memory transfer and from 73% to 75% for a disk-to-disk transfer. An implementation using hardware timers should be able to avoid this problem. Thus, using smoothing we are able to achieve about 23% reduction in the queuing delay for a disk-to-disk transfer and an order of magnitude reduction for a memory-to-memory transfer without sacrificing the mean rate of the transfer. This is significant!

Table 10 - WindowsNT disk-to-disk (timed send()) Without timed send() With timed send() 0.092 4.7 8.7 1.95 0.092 4.7 9.0 1.51 Table 11 - WinNT memory-to-memory (timed send()) Without timed send() With timed send() 0.038 1.89 3.2 0.86 0.036 1.3 3.0 0.09 7. Summary and Future Work We have studied the traffic characteristics of TCP/IP bulk data transfers over Gigabit and 100-Mbps switched Ethernet focusing on both disk-to-disk and memory-tomemory transfers. Our study was conducted for both WindowsNT and Linux and for cases of a server being loaded (by other network requests) and unloaded. We have shown that transfers involving disk access are burstier compared to those involving only memory access. Transfers involving disk access result in larger queueing delays, and thus also larger packet loss rates, on a Gigabit network as compared to a fast (100-Mbps) Ethernet network. We implemented a prototype timed send() socket call that allows us to achieve smoothing at the application layer and significantly improve flow characteristics to reduce queueing delays in the network. Future work should concentrate on 1) studying the operating system parameters that influence network traffic characteristics, 2) implementing a smoothing routine in a socket shim layer to enable existing applications to take advantage of application-layer smoothing, and 3) studying trade-offs of adding delay in the sending host versus queueing delays in the network. We also need to make queuing delay and loss measurements on real network devices and not just for a simulated queue. Acknowledgements The authors acknowledge Zane Reynolds and Hiroshi Fujinoki, students at the University of South Florida, for their helpful comments. References [1] A. Aggarwal, S. Savage, and T. Anderson, Understanding the Performance of TCP Pacing, Proceedings of IEEE INFOCOM, pp. 1157-1165, March 2000. [2] W. Chimiak, The Radiology Environment, IEEE Journal on Selected Areas in Communications, Vol. 10, No. 7, pp. 1133-1144, September 1992. [3] M. Crovella and A. Bestavros, Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes, IEEE/ACM Transactions on Networking, Vol. 5, No. 6, pp. 835-846, December 1997. [4] A. Erramilli, O. Narayan, and W. Willinger, Experimental Queuing Analysis with Long-Range Dependent Packet Traffic, IEEE/ACM Transactions on Networking, Vol. 4, No. 2, pp. 209-223, April 1996. [5] A. Erramilli and J. Wang, Monitoring Packet Traffic Levels, Proceedings of IEEE GLOBECOM, pp. 274-280, December 1994. [6] P. Farrell and H. Ong, Communication Performance over a Gigabit Ethernet Network, Proceedings of the IEEE International Performance, Computing and Communications Conference, pp. 181-189, February 2000. [7] G. Helmer and J. Gustafson, Netpipe: A Network Protocol Independent Performance Evaluator, 2000. URL: http://www.scl.ameslab.gov/netpipe/. [8] Http_Load - Multiprocessing HTTP Test Client, 2000. URL: http://www.acme.com/software/http_load. [9] V. Jacobson, C. Leres, and S. McCanne, The tcpdump Manual Page, Lawrence Berkeley Laboratory, Berkeley, CA, June 1989. [10] R. Jain and S. Routhier, Packet Trains - Measurements and a New Model for Computer Network Traffic, IEEE Journal on Selected Areas in Communications, Vol. 4, No. 6, pp. 986-995, September 1986. [11] R. Jones, Netperf : A network performance benchmark, Rev. 1.7, Hewlett Packard Co., March 1993. [12] W. Leland, M. Taqqu, W. Willinger, and D. Wilson, On the Self-Similar Nature of Ethernet Traffic (extended version), IEEE/ACM Transactions on Networking, Vol. 2, No. 1, pp. 1-15, February 1994. [13] Linux 2.2.12 TCP Performance Fix for Short Messages http://www.icase.edu/coral/linuxtcp2.html. [14] K. Moldeklev and P. Gunningberg, How a Large ATM MTU Causes Deadlocks in TCP Data Transfers, IEEE/ACM Transactions on Networking, Vol. 3, No. 4, pp. 409-422, August 1995. [15] K. Park, G. Kim, and M. Crovella, On the Relationship between File Sizes, Transport Protocols, and Self-Similar Network Traffic, Proceedings of International Conference on Network Protocols, pp 171-180, October 1996. [16] V. Paxson, Automated Packet Trace Analysis of TCP Implementations, Proceedings of the ACM SIGCOMM, pp. 167-179, September 1997 [17] H. Schwetman, CSIM18 - The Simulation Engine, Proceedings of the 1996 Winter Simulation Conference, pp. 517-521, December 1996. [18] Tools Page for Kenneth J. Christensen, 2000. URL: http://www.csee.usf.edu/~christen/tools/toolpage.html. [19] Windump: Tcpdump for Windows, 2000. URL: http://netgroup-serv.polito.it/windump.