Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Motivation Scientific experiments are generating large amounts of data Education research & commercial videos are not far behind Data may be generated and stored at multiple sites How to efficiently store and process this data? Applic ation SDSS LIGO ATLAS /CMS First Data 1999 2002 2005 Data Volume (TB/yr) 10 250 5,000 Users 100s 100s 1000s Source: GriPhyN Proposal, 2000 WCER 2004 500+ 100s 2/33

Motivation Grid enables large scale computation Problems Data intensive applications have suboptimal performance Scaling up creates problems Storage servers thrash and crash Users want to reduce failure rate and improve throughput 3/33

Profiling Protocols and Servers Profiling is a first step Enables us to understand how time is spent Gives valuable insights Helps computer architects add processor features OS designers add OS features middleware developers to optimize the middleware application designers design adaptive applications 4/33

Profiling We (middleware designers) are aiming for automated tuning Tune protocol parameters, concurrency level Depends on dynamic state of network, storage server We are developing low overhead online analysis Detailed Offline + Online analysis would enable automated tuning 5/33

Requirements Profiling Should not alter system characteristics Full system profile Low overhead Used OProfile Based on Digital Continuous Profiling Infrastructure Kernel profiling No instrumentation Low overhead/tunable overhead 6/33

Two server machines Profiling Setup Moderate server: 1660 MHzAthlon XP CPU with 512 MB RAM Powerful server: dual Pentium 4 Xeon 2.4 GHz CPU with 1 GB RAM. Client Machines were more powerful dual Xeons To isolate server performance 100 Mbps network connectivity Linux kernel 2.4.20,, GridFTP server 2.4.3, NeST prerelease 7/33

GridFTP Profile 45.0 40.0 Percentage of CPU Time 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP Read Rate = 6.45 MBPS, Write Rate = 7.83 MBPS =>Writes to server faster than reads from it 8/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP Writes to the network more expensive than reads => Interrupt coalescing 9/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP File system writes costlier than reads => Need to allocate disk blocks 11/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP More overhead for writes because of higher transfer rate 12/33

GridFTP Profile Summary Writes to the network more expensive than reads Interrupt coalescing DMA would help IDE reads more expensive than writes Tuning the disk elevator algorithm would help Writing to file system is costlier than reading Need to allocate disk blocks Larger block size would help 13/33

NeST Profile 60.0 50.0 Percentage of CPU Time 40.0 30.0 20.0 10.0 0.0 Idle Ethernet Driver Interrupt Handling Libc NeST Oprofile IDE File I/O Rest of Kernel Read From NeST Write To NeST Read Rate = 7.69 MBPS, Write Rate = 5.5 MBPS 14/33

GridFTP GridFTP versus NeST Read Rate = 6.45 MBPS, write Rate = 7.83 MBPS NeST Read Rate = 7.69 MBPS, write Rate = 5.5 MBPS GridFTP is 16% slower on reads GridFTP I/O block size 1 MB (NeST 64 KB) Non-overlap of disk I/O & network I/O NeST is 30% slower on writes Lots (space reservation/allocation) 18/33

Effect of Protocol Parameters Different tunable parameters I/O block size TCP buffer size Number of parallel streams Number of concurrent transfers 19/33

Read Transfer Rate 20/33

Server CPU Load on Read 21/33

Write Transfer Rate 22/33

Server CPU Load on Write 23/33

Transfer Rate and CPU Load 24/33

Server CPU Load and L2 DTLB misses 25/33

L2 DTLB Misses Parallelism triggers the kernel to use larger page size => lower DTLB miss 26/33

Profiles on powerful server Next set of graphs were obtained using the powerful server 27/33

Parallel Streams versus Concurrency 28/33

Effect of File Size (Local Area) 29/33

Transfer Rate versus Parallelism in Short Latency (10 ms) Wide Area 30/33

Server CPU Utilization 31/33

Conclusion Full system profile gives valuable insights Larger I/O block size may lower transfer rate Network, disk I/O not overlapped Parallelism may reduce CPU load May cause kernel to use larger page size Processor feature for variable sized pages would be useful Operating system support for variable page size would be useful Concurrency improves throughput at increased server load 32/33

Contact Questions kola@cs.wisc.edu www.cs.wisc.edu/condor/publications.html 33/33