Tracking the Evolution of Web Traffic:

The University of North Carolina at Chapel Hill Department of Computer Science 11 th ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) Orlando, October 13 th, 2003 Tracking the Evolution of Web Traffic: 1995-2003 Félix Hernández-Campos Kevin Jeffay F. Donelson Smith http://www.cs.unc.edu/research/dirt Web Traffic Measurement and Analysis at UNC-Chapel Hill In 1997, populating web traffic generators for experimental networking research motivated a large- scale study of web traffic at UNC with three goals: Develop a light-weight methodology Based on passive measurement Easy to maintain models up-to-date Replace smaller-scale, quickly aging models Mah,, 1995 data set Crovella et. al,, 1995 data set (revised with 1998 data) Characterize the use of the protocol E.g.,, Use of persistent connections 1 2 Web Traffic Measurement and Analysis at UNC-Chapel Hill Our methodology and first results were published in SIGMETRICS/Performance 01 What TCP/IP Protocol Headers Can Tell Us About the Web Modeling aspect explored in a series of papers E.g., Variable Heavy Tails in Internet Traffic (with J.S. Marron)»(Part» I: Understanding Heavy Tails published in MASCOTS 02) In this talk, I will describe our approach and our observation on the evolution of web traffic: Three data sets: 1999, 2001 and 2003 Comparisons to Mah and Crovella et al. Methodology Study of Web Content Consumers University of North Carolina at Chapel Hill Web Clients Requests Responses Internet Web Servers We studied a large collection of users (~35,000) as web content consumers The only source of data for our study were packet header traces Anonymized IP addresses No headers 3 4

Methodology One-Way Packet Header Traces Methodology Processing Sequence Overview University of North Carolina at Chapel Hill Web Clients Gigabit Ethernet Traffic Monitor (tcpdump) Internet Web Servers tcpdump Raw TCP/IP headers trace Filter & Sort TCP Connections (Port 80) Connection-level Analysis Only inbound TCP/IP headers are captured Eliminate synchronization and buffering issues on the NIC Reduce trace size Trace collection: 2.7 TB of packet headers ~40 billion packets ~16 TB of data transfers Client Behavior Client-level Analysis Statistical Analysis Req/Rsp Rsp Exchanges 5 6 TCP/IP Headers and Request/response Exchange Web Client (UNC) Request SYN SYN- - - seqno 305 ackno 1 seqno 1 ackno 305 seqno 1461 ackno 305 seqno 2876 ackno 305 seqno 305 ackno 2876 Web Server (Internet) Response 7 TCP/IP Headers and Server-to-client Segments Only Web Client (UNC) Request Response SYN- seqno 1 ackno 1 - seqno 1 ackno 305 seqno 1461 ackno 305 seqno 2876 ackno 305 Web Server (Internet) Ackno Seqno 8

Methodology Request/Response Traces Unidirectional TCP/IP header traces are sufficient for capturing application-level behavior Web Client Request Computed Response Directly Observed Web Server Persistent Connections in Example TCP/IP Headers Web Client (UNC) Request 1 Response 1 Request 2 262 bytes Response 2 3465 bytes SYN- seqno 1 ackno 305 seqno 1461 ackno 305 seqno 2876 ackno 305 seqno 2876 ackno 567 seqno 4336 ackno 567 seqno 5796 ackno 567 seqno 6341 ackno 567 - seqno 1 ackno 1 Web Server (Internet) Ackno Seqno Ackno 9 10 Sizes of Requests Empirical CDFs Sizes of Requests Empirical CCDFs 0.82 0.64 1999 2003 Complementary 0.3e-5 1 MB 0.0003% 296 Requests 2.7% Bytes Size in Bytes 11 No. of Bytes 12

Response Sizes Response Sizes LogNormal Fits Complementary Systematic Wobbles Pareto Fits Size in Bytes 13 No. of Bytes 14 Page Identification Heuristic Two TCP Connections Example Objects Per Page Client Top-level Object Server Page 1 Quiet Time >1 second Client Embedded Objects Server Page 2 15 No. of Objects ( Exchanges) 16

Objects Per Page Page Requests Per IP Address Complementary No. of Objects ( Exchanges) 17 No. of Page Requests 18 Impact of Tracing Interval Length Questions: Can we obtain a sufficiently large sample with a small number of short traces? How does the length of the tracing interval affect the overall empirical distribution shapes? Should we include in the empirical distributions the data from incomplete TCP connections? Approach: Examine a wide range of trace lengths»4» 4 h., 2 h., 1h., 30 min., 15 min., 5 min. and 90 sec. Construct datasets by sub-sampling the 21 4-hour-long traces collected in 2001 E.g.,, remove first and last hour of each trace to produce 21 2-hour-long traces 19 Response Size in Bytes 20

Impact of Tracing Interval Length Impact of Tracing Interval Length Complementary Response Size in Bytes 21 No. of Pages Per Client IP Address 22 Impact of Partially-Captured Objects Impact of Partially-Captured Objects Complementary Response Size in Bytes 23 No. of Bytes 24

Summary and Conclusions Web Traffic Characterization New data to populate traffic generators Request sizes Response sizes Use of persistent connections... 1-hour long traces are sufficient to capture application-level behavior Short traces cut off large objects, which skews the tails of the distributions Persistent Connections: ~15% of all the connections 40-50% of all the transferred bytes 25