Data Transfers in the Grid: Workload Analysis of Globus GridFTP

Similar documents
The Power of Prediction: Cloud Bandwidth and Cost Reduction

Annex A to the DVD-R Disc and DVD-RW Disc Patent License Agreement Essential Sony Patents relevant to DVD-RW Disc

Broadband Rate Design for Public Benefit

A First Look at QUIC in the Wild

Section 1.2: What is a Function? y = 4x

AIMMS Function Reference - Date Time Related Identifiers

Traffic Types and Growth in Backbone Networks

This report is based on sampled data. Jun 1 Jul 6 Aug 10 Sep 14 Oct 19 Nov 23 Dec 28 Feb 1 Mar 8 Apr 12 May 17 Ju

How It All Stacks Up - or - Bar Charts with Plotly. ISC1057 Janet Peterson and John Burkardt Computational Thinking Fall Semester 2016

Asia Key Economic and Financial Indicators

Annex A to the MPEG Audio Patent License Agreement Essential Philips, France Telecom and IRT Patents relevant to DVD-Video Disc - MPEG Audio - general

Technical University of Munich - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

University of Rochester - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

University of Osnabruck - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Polycom Advantage Service Endpoint Utilization Report

Polycom Advantage Service Endpoint Utilization Report

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Opera Web Browser Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

METERS AND MORE: Beyond the Meter

AVM Networks - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Asia Key Economic and Financial Indicators

Asia Key Economic and Financial Indicators

software.sci.utah.edu (Select Visitors)

GWDG Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Quatius Corporation - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

University of the Free State - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Tracking the Internet s BGP Table

Eindhoven University of Technology - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Proven video conference management software for Cisco Meeting Server

App Economy Market analysis for Economic Development

Topics in P2P Networked Systems

Omega Engineering Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

XEmacs Project Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Excel Using PowerPivot & Power View in Data Analysis

10 Key Things Your VoIP Firewall Should Do. When voice joins applications and data on your network

Payflow Implementer's Guide FAQs

National Aeronautics and Space Admin. - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Mobile Broadband and benefits with harmonized UHF spectrum

Excel Functions & Tables

All King County Summary Report

SCI - software.sci.utah.edu (Select Visitors)

Mpoli Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

THE AUSTRALIAN ONLINE LANDSCAPE REVIEW AUGUST 2015

MONITORING REPORT ON THE WEBSITE OF THE STATISTICAL SERVICE OF CYPRUS DECEMBER The report is issued by the.

Fluidity Trader Historical Data for Ensign Software Playback

Monthly SEO Report. Example Client 16 November 2012 Scott Lawson. Date. Prepared by

Omnitel Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Zhengyang Liu University of Virginia. Oct 29, 2012

New CLIK (CLIK 3.0) CLimate Information tool Kit User Manual

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

XS4ALL Networks - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

S K Gupta Advisor, TRAI

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

CIMA Certificate BA Interactive Timetable

DDOS-GUARD Q DDoS Attack Report

Seattle (NWMLS Areas: 140, 380, 385, 390, 700, 701, 705, 710) Summary

IKS Service GmbH - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Automatic Renewal Using DIY Technology to Create an Improved Patron Experience

Statistical Methods in Trending. Ron Spivey RETIRED Associate Director Global Complaints Tending Alcon Laboratories

SCI - NIH/NCRR Site. Web Log Analysis Yearly Report Report Range: 01/01/ :00:00-12/31/ :59:59.

POSTAL AND TELECOMMUNICATIONS REGULATORY AUTHORITY OF ZIMBABWE (POTRAZ)

Computing Model Tier-2 Plans for Germany Relations to GridKa/Tier-1

No domain left behind

Economic and Housing Market Trends and Outlook

DBit Ersatz-11 PDP-11 Emulator - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

TODAY S TV IS AVAILABLE ON EVERY PLATFORM, ANYWHERE AND AT ANY TIME

A More Realistic Way of Stressing the End-to-end I/O System

The Vision Council Winds of Change

Wisconsin Gov Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

1 of 10 8/10/2009 4:51 PM

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

Rzeszow University Of Technology - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

ICT PROFESSIONAL MICROSOFT OFFICE SCHEDULE MIDRAND

3. EXCEL FORMULAS & TABLES

Q3 FY18 Connections Update 13 April 2018

QuickSpecs. What's New. Models. HP SATA Hard Drives. Overview

NMOSE GPCD CALCULATOR

Opportunities for Exploiting Social Awareness in Overlay Networks. Bruce Maggs Duke University Akamai Technologies

WHAT ARE MOBILE PHONE SHOPPERS SEARCHING ONLINE?

Backschues Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

National Instruments Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Industrial Machinery. Search Marketing Case Study

2016 MEDIA INFORMATION. energyegypt.net

RLMYPRINT.COM 30-DAY FREE NO-OBLIGATION TRIAL OF RANDOM LENGTHS MY PRINT.

WHOIS Accuracy Reporting System (ARS): Phase 2 Cycle 1 Results Webinar 12 January ICANN GDD Operations NORC at the University of Chicago

Improving Perforce Performance At Research In Motion (RIM)

Nigerian Telecommunications (Services) Sector Report Q3 2016

DAS LRS Monthly Service Report

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Funcom Multiplayer Online Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

University of California - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Fuji Xerox Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Building the Business Case for Mobile Broadband The HSPA Evolution Path

Jobs Resource Utilization as a Metric for Clusters Comparison and Optimization. Slurm User Group Meeting 9-10 October, 2012

Heilbronn University - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Programming for Engineers Structures, Unions

South Platte Summary January Compiled by Lee Cunning, P.E.

University of Tor Vergata - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Asks for clarification of whether a GOP must communicate to a TOP that a generator is in manual mode (no AVR) during start up or shut down.

WinZip Software Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space. Top 10 File Categories Sorted By Disk Space

Transcription:

Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South Florida Dan Fraser Argonne National Laboratory

Objective : Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective : Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: 3 Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers? 3

Outline Metrics dataset Surprises and zoom in (TeraGrid( TeraGrid) Lessons and discussions

GridFTP Metrics Dataset Field Range of Values Comment Source hostname/host IP String/IPnet Anonymized Start time of the transfer Timestamp Accuracy: ms End time of the transfer Timestamp Accuracy: ms TCP Buffer Size Integer (Bytes) Total Number of Bytes Integer (Bytes) Number of Streams Integer Number of Stripes Integer Store or Retrieve Integer (,,) STOR, RETR, LIST 5

Metrics Dataset Started with ~37.5 million records (Jul 5 - Mar 7) Cleaning: transfer size : ~. million records buffer size <: ~ records directory listings: ~3.9 million records invalid hostnames (e.g., /[B@97e): ~,6 records ANL-TeraGrid testing: ~. million records duplicate reports: ~6. million records self transfers (source=destination): ~5.75 million records Clean database: ~77. million records (~56.%) 6

Surprise #: Transfer Size Distribution % of total 5 6 3 6 56 5 6 3 6 56 5 6 3 6 56 5 Bytes KB MB Transfer size Objective : Quantify volume of transfers 7

Zoom-in: TeraGrid Are these results representative for production grids? GridFTP testing for deployment and learning Identify transfers from TeraGRid and analyze dataset.

Transfer Size Distribution (TG) 5 % of total 5 5 6 3 6 56 5 6 3 6 56 5 6 3 6 56 5 Bytes KΒ MΒ Transfer Size Objective : Quantify volume of transfers 9

5 56 5 6 3 6 56 5 6 3 6 56 5 6 3 6 5 5 5 Bytes KB MB Transfer size 56 5 % of total 6 3 6 56 5 6 3 6 56 5 6 3 6 Bytes KΒ MΒ Transfer Size % of total All TG

Why So Small Transfers? There are still many old versions (i.e., before v3.9.5) of GridFTP in use. These versions do not include trace reporting capabilities. Other data transfer protocols and implementations are used Users have turned off the reporting capability Some of the logs are inevitably lost due to the UDP-based reporting mechanism The low transfer volumes could suggest a shift towards data-aware aware job scheduling (?)

Server to Server Transfers % 7% 6% 5% % 7.% # Transfers Volume 39.5% 3.% 3% % %.7% 9.7%.% % InterDomain InterIP SelfTransfers High reporting of Self Transfers (more than /3) Objective : Quantify volume of transfers

Top 6 Active Hosts (all) 5 Volume Transferred Number of Transfers.E+ Volume Transferred (TB) 5 5.E+7.E+6.E+5.E+.E+3.E+.E+ Number of Transfers 3 Host 5 6.E+ Top 6 hosts traffic adds up to ~% of total volume Next hosts (IPs( IPs) ) transferred s of TB Objective : Quantify volume of transfers 3

Number of Transfers & Volume (TG) 3 5 5 5 Number of Transfers per Month Total Volume per Month 7 6 5 3 Aug-5 Sep-5 Oct-5 Nov-5 Dec-5 Jan-6 Feb-6 Mar-6 Apr-6 May-6 Jun-6 Jul-6 Aug-6 Sep-6 Oct-6 Nov-6 Dec-6 Jan-7 Feb-7 Mar-7 Number of Transfers (thousands) Volume (TB) Month-Year Objective : Quantify volume of transfers

Average Transfer Size & Total Volume (TG) 6 Total Volume (TB) Average Transfer Size (GB)..7 Total Volume (TB) 6.6.5..3.. Average Transfer Size (GB) 3 5 6 7 TERAGRID SITES Objective : Quantify volume of transfers 5

Daily Workload (TG) Volume (TB).....6... Average Volume per Day Average Number of Transfers per Day Monday Tuesday WednesdayThursday Friday Saturday Sunday Average volume transferred per day: ~.6TB GridFTP doesn t t get weekends free! Objective : Quantify volume of transfers 5 5 35 3 5 5 5 Number of Transfers 6

Monthly Workload (TG).5 Average Volume Transferred per day Average Number of Transfers per day Volume (TB)..5..5 6 Number of Transfers. 3 5 7 9 3 5 7 9 3 5 7 9 3 Day of the month ~5, transfers per day ~TB per day of total volume Lowest around.5tb per day Peaks due to particular days Objective : Quantify volume of transfers 7

Objective : Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective : Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: 3 Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers?

Surprise #: Usage of Streams and Stripes Unreliably reported (from Globus team). Reliable observations: At least ~% of the transfers used streams (suggested number by ANL s website) At least ~% of the transfers used a different value, larger than one. Maximum number of streams reported: (!!) Objective : Understand how tuning capabilities are used 9

Buffer Size Distribution 9 7 % of total 6 5 3 6 3 6 56 5 6 KB MB Buffer Size 6% from the original table: : OS-controlled ( bytes) Most commonly used: 6 KB KB Largest buffer size: -GB (9 records) Objective : Understand how tuning capabilities are used

Average Bandwidth Distribution % of total 6 6 3 6 56 5 6 3 6 56 5 6 3 6 56 5 bps Kbps Mbps Average Bandwidth Peak: -56Mbps, 56Mbps, ~7.7 million records Most common (5%): Mbps Gbps

Average Bandwidth Distribution (TG) % of total 6 6 6 3 6 56 5 6 3 6 56 5 Kbps Mbps Gbps Average Bandwidth Compared to the total dataset: => The region of Mbps to Gbps includes more than 5% of the transfers (5% for the whole dataset)

Objective : Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective : Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: 3 Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers? 3

Surprise #3: Geographic Distribution AT GB ES CH AU DE IT JP TW CA # transfers volume % % % 3% % 5% 6% 7% % 9% % USA: 7.% or ~5. million transfers and.9% or ~.7 PB Canada+Taiwan+Japan+Spain:~M transfers and 36TB 9 different countries and 6 different cities (7 cities from USA) Objective 3: 3 Quantify user base and predict usage trends

User and Domain Evolution (all) # IPs # Domains Linear Fit (# Domains) IPs 6 y =.3x + 67.7 R =.97 Jul-5 Aug-5 Sep-5 Oct-5 Nov-5 Dec-5 Jan-6 Feb-6 Mar-6 Apr-6 May-6 Jun-6 Jul-6 Aug-6 Sep-6 Oct-6 Nov-6 Dec-6 Jan-7 Feb-7 Mar-7 35 3 5 5 5 Domains Continuing increase of user and organization population Forecasts: 67 new IPs and new domains per month Objective 3: 3 Quantify user base and predict usage trends 5

Summary of Results Many transfers in the range of KBs to s MB (peak in 6MB-3MB). relevant for setting up realistic simulations. previous work assume different, larger file sizes. Bandwidth measured in previous work is confirmed by our workload analysis. Tuning parameters: Users tend not to set the buffer size explicitly (6%), leaving it to the OS The unexpectedly small transfers do not justify tuning GridFTP parameters (stripes and streams) The usage of Globus GridFTP is growing over time in terms of IPs (users), domains (organizations), and volume transferred. Missing some of the big players 6