600.413 Topics in P2P Networked Systems Week 4 Measurements Andreas Terzis Slides from Stefan Saroiu Content Delivery is Changing Thirst for data continues to increase (more data & users) New types of data have emerged audio, video People use new means to exchange this data The result Internet is now seeing a mixture of old and new content delivery systems: Conventional Web servers and Web clients CDNs: Akamai, Digital Island, Speedera P2Ps:, Gnutella, Napster, AudioGalaxy 1
Goals of this Work Explore and characterize content delivery in the new Internet World Wide Web, Akamai CDN, & Gnutella P2P High-level questions: What is the impact of these systems on the Internet? What are the characteristics of the new delivery systems? Methodology Collect anonymized traces at UW border routers Extension of existing infrastructure [Wolman & Voelker] Identify and log HTTP traffic: Akamai ports 80, 8080, 443 and served by Akamai server ports 80, 8080, 443 (exclude Akamai) Gnutella ports 6346, 6347 port 1214 In this talk, results are based on a 9-day trace May 28th June 6th, 2002 500 million Xactions, 20TB of HTTP data 2
Question 1: What is the bandwidth impact of the new content delivery systems? Mbps 800 Where has all the bandwidth gone? 700 600 500 400 300 200 100 0 non-http TCP P2P non-http TCP P2P Akamai Wed Thu Fri Sat Sun Mon Tue Wed Thu Web = 14% of TCP; P2P = 43% of TCP P2P now dominates Web in bandwidth consumed!!! 3
Bytes Xferred Unique objects inbound outbound 1.51TB 3.02TB 72,818,997 3,412,647 inbound outbound 1.78TB 13.6TB 111,437 166,442 Clients 39,285 1,231,308 4,644 611,005 Servers 403,087 1,463 281,026 3,888 Bandwidth consumed by UW servers 250 Bandwidth Consumed by UW Servers 200 Mbps 150 100 50 Gnutella 0 Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri P2P has diurnal cycle, like the Web P2P peaks later at night 4
Summary 1: (what is the bandwidth impact?) P2P traffic is the largest bandwidth consumer Three times bigger than Web, at UW Uploads significantly exceed downloads at least within UW-like environments Question 2: What are the bytes carrying? 5
Specialized Content Delivery Systems 100% Byte Breakdown per Content Delivery System 80% TEXT (T) IMAGES (I) AUDIO (A) VIDEO (V) OTHER (O) V % Bytes 60% V I 40% A T I O 20% T V O A V A O O A T I T I 0% Akamai Gnutella Web = text + images = video Akamai = images Gnutella = video + audio HTTP Bytes: Now vs. Then May 1999 May 2002 Content Types Ordered by Size 1.7% 2.1% video/x-msvideo text/plain 4.8% 7.1% MP3 7.5% 8.8% app/octet-stream 9.9% GIF JPG 11.3% HASHED MPG 14.6% HTML 18.0% AVI 0% 10% 20% % All Bytes 6
Summary 2: (what are the bytes?) Systems have become specialized in the type of content they deliver P2P = video + audio bytes It s not the music a single video is 200 MP3s Web = text + image bytes Question 3: How do workload characteristics differ in the new delivery systems, compared to the old ones? 7
Object size 100% Object Size CDF % Objects 80% 60% 40% Akamai Gnutella 20% 0% 0 1 10 100 1,000 10,000 100,000 1,000,000 Object Size (KB) P2P objects are 3 orders of magnitude bigger than Web objects Most bandwidth consuming object (inbound) (inbound) (outbound) object size (MB) # requests object size (MB) # clients # servers object size (MB) # clients # servers 1 0.009 1.4 mil. 694.4 20 164 696.9 397 1 8
Top 5 bandwidth consuming objects 1 2 3 4 5 (inbound) object size (MB) 0.009 0.002 333 0.005 2.23 # requests 1.4 mil. 3 mil. 21 1.4 mil. 1,457 (inbound) object size (MB) 694.4 702.2 690.3 775.6 698.1 # clients 20 14 22 16 14 # servers 164 91 83 105 74 (outbound) object size (MB) 696.9 699.2 699.1 700.9 634.2 # clients 397 1000 Few UW clients and servers exchange big, popular objects 390 558 540 # servers 1 4 10 2 1 Server Availability 100% Fraction of Requests by Server Status Codes 80% 35% 30% % Responses 60% 40% 65% 69% 81% 89% Other Error (5xx) Success (2xx) 20% 0% 19% Akamai Gnutella 6% Most P2P servers are unresponsive 9
Summary 3: (how do characteristics differ?) P2P objects are three orders of magnitude bigger than Web objects P2P servers are largely unresponsive (but it s not clear whether it matters to users) Question 4: How is bandwidth use distributed among objects, clients, servers? 10
Bytes Xferred Unique objects inbound outbound 1.51TB 3.02TB 72,818,997 3,412,647 inbound outbound 1.78TB 13.6TB 111,437 166,442 Clients 39,285 1,231,308 4,644 611,005 Servers 403,087 1,463 281,026 3,888 Object Popularity at UW 100% Gnutella Top Bandwidth Consuming Objects 80% % Bytes 60% 40% Akamai 20% 0% 0 100 200 300 400 500 600 700 800 900 1000 Number of Objects 1,000 objects (out of 111K) responsible for 50% of bytes transferred 11
Bytes Xferred Unique objects Clients inbound outbound 1.51TB 3.02TB 72,818,997 3,412,647 39,285 1,231,308 inbound outbound 1.78TB 13.6TB 111,437 166,442 4,644 611,005 Servers 403,087 1,463 281,026 3,888 UW Clients 100% 80% Gnutella Top Bandwidth Consuming UW Clients (as fraction of each system) % Bytes 60% 40% 20% + Akamai 0% 0 100 200 300 400 500 600 700 800 900 1000 Number of UW Clients Small number of clients account for large portion of traffic 200 Web+Akamai = 13% of traffic 200 = 50% of traffic (20% of all incoming HTTP) 12
Bytes Xferred Unique objects Clients inbound outbound 1.51TB 3.02TB 72,818,997 3,412,647 39,285 1,231,308 inbound outbound 1.78TB 13.6TB 111,437 166,442 4,644 611,005 Servers 403,087 1,463 281,026 3,888 UW Servers % Bytes 100% 80% 60% 40% Top Bandwidth Consuming UW Servers (as fraction of each system) Gnutella 20% 0% 0 200 400 600 800 1000 Number of UW Servers 20 servers supply 80% of Web traffic 334 servers (10%) supply 80% of traffic Load is not uniformly spread across peers! 13
Summary 4: (how is bandwidth distributed?) A very small number of P2P objects, clients, and servers consume an enormous fraction of traffic Question 5: Could P2P data be cached? 14
Caching Methodology Simulation of an ideal cache: Infinite storage, bandwidth, CPU, etc All objects are cached forever P2P objects are immutable, no dynamic data Evaluated: Both inbound and outbound P2P data cacheability Objects are big: Byte hit rate more important than object hit rate Gives an upper bound on potential for caching P2P Data Cacheability 100% Ideal Byte Hit Rate () Outbound 80% Byte Hit Rate 60% 40% Inbound 20% 0% 28-May 2-Jun 7-Jun 12-Jun 17-Jun 22-Jun 27-Jun 2-Jul 7-Jul 12-Jul Potential for caching is high in both directions UW inbound cache needs a month to warm-up 15
Summary 5: (is it cacheable?) Large potential for data cacheability in P2P P2P caches take a long time to warm-up Conclusions The Internet is being used in a completely different way: P2P traffic is the largest bandwidth consumer Peers consume significant bandwidth in both directions P2P objects are 1,000 times larger than Web objects Small number of huge objects are responsible for an enormous fraction of bytes transferred 300 objects consumed 5.6TB bandwidth! Few P2P peers are causing much of the traffic Large potential for caching in P2P 16
Final Thoughts The average peer consumes 90x more bandwidth than the average Web client Adding another 450 peers is equivalent to doubling the entire UW web client population Is the current architecture of scalable? 17