Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Similar documents
Profiling Grid Data Transfer Protocols and Servers

A Fully Automated Faulttolerant. Distributed Video Processing and Off site Replication

DATA PLACEMENT IN WIDELY DISTRIBUTED SYSTEMS. by Tevfik Kosar. A dissertation submitted in partial fulfillment of the requirements for the degree of

STORK: Making Data Placement a First Class Citizen in the Grid

Stork: State of the Art

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Initial Evaluation of a User-Level Device Driver Framework

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Zhengyang Liu! Oct 25, Supported by NSF Grant OCI

Parallelized Progressive Network Coding with Hardware Acceleration

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Capriccio : Scalable Threads for Internet Services

Embedded SDR for Small Form Factor Systems

Securing Grid Data Transfer Services with Active Network Portals

Xytech MediaPulse Equipment Guidelines (Version 8 and Sky)

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

I/O Systems. Jo, Heeseung

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Xenoprof overview & Networking Performance Analysis

Xytech MediaPulse Equipment Guidelines (Version 8 and Sky)

PostgreSQL as a benchmarking tool

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

Real Parallel Computers

White Paper. File System Throughput Performance on RedHawk Linux

A Case for High Performance Computing with Virtual Machines

I/O Management Intro. Chapter 5

An Analysis of iscsi for Use in Distributed File System Design

SNMP MIBs and Traps Supported

Join Processing for Flash SSDs: Remembering Past Lessons

Minimum Hardware and OS Specifications

Introduction to Grid Computing

Securing the Frisbee Multicast Disk Loader

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

Virtualization, Xen and Denali

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

LatencyMon has been analyzing your system for 0:09:55 (h:mm:ss) on all processors.

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Vmware VCP-101V. Infrastructure with ESX Server and VirtualCenter. Download Full Version :

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Data transfer over the wide area network with a large round trip time

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

SSH Bulk Transfer Performance. Allan Jude --

Server Specifications

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

CSC Operating Systems Spring Lecture - XIX Storage and I/O - II. Tevfik Koşar. Louisiana State University.

RAID Structure. RAID Levels. RAID (cont) RAID (0 + 1) and (1 + 0) Tevfik Koşar. Hierarchical Storage Management (HSM)

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

A Scalable Event Dispatching Library for Linux Network Servers

Full-System Timing-First Simulation

Review: Hardware user/kernel boundary

SE Memory Consumption

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Chapter 1 Computer System Overview

11.2 TwinDVR System. TCP-IP Mode

Using Transparent Compression to Improve SSD-based I/O Caches

Molecular Devices High Content Screening Computer Specifications

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

CS510 Operating System Foundations. Jonathan Walpole

Enhancements to Linux I/O Scheduling

Running MySQL on AWS. Michael Coburn Wednesday, April 15th, 2015

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Flash: an efficient and portable web server

Operating System Design Issues. I/O Management

Virtual Memory. Chapter 8

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

The Google File System

Hardware & System Requirements

TCPivo A High-Performance Packet Replay Engine. Wu-chang Feng Ashvin Goel Abdelmajid Bezzaz Wu-chi Feng Jonathan Walpole

Chapter 1 Computer System Overview

Presented by: Nafiseh Mahmoudi Spring 2017

High Throughput WAN Data Transfer with Hadoop-based Storage

Nested Virtualization and Server Consolidation

MidoNet Scalability Report

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Massive Data Processing on the Acxiom Cluster Testbed

Chapter 8. Virtual Memory

Memory Management Strategies for Data Serving with RDMA

Performance Characteristics on Fast Ethernet and Gigabit networks

Xen Network I/O Performance Analysis and Opportunities for Improvement

Lab Determining Data Storage Capacity

Isilon Performance. Name

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

File Memory for Extended Storage Disk Caches

GridFTP Scalability and Performance Results Ioan Raicu Catalin Dumitrescu -

I/O Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Execution architecture concepts

Advanced Computer Networks. End Host Optimization

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

What Operating Systems Do An operating system is a program hardware that manages the computer provides a basis for application programs acts as an int

Transcription:

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Motivation Scientific experiments are generating large amounts of data Education research & commercial videos are not far behind Data may be generated and stored at multiple sites How to efficiently store and process this data? Applic ation SDSS LIGO ATLAS /CMS First Data 1999 2002 2005 Data Volume (TB/yr) 10 250 5,000 Users 100s 100s 1000s Source: GriPhyN Proposal, 2000 WCER 2004 500+ 100s 2/33

Motivation Grid enables large scale computation Problems Data intensive applications have suboptimal performance Scaling up creates problems Storage servers thrash and crash Users want to reduce failure rate and improve throughput 3/33

Profiling Protocols and Servers Profiling is a first step Enables us to understand how time is spent Gives valuable insights Helps computer architects add processor features OS designers add OS features middleware developers to optimize the middleware application designers design adaptive applications 4/33

Profiling We (middleware designers) are aiming for automated tuning Tune protocol parameters, concurrency level Depends on dynamic state of network, storage server We are developing low overhead online analysis Detailed Offline + Online analysis would enable automated tuning 5/33

Requirements Profiling Should not alter system characteristics Full system profile Low overhead Used OProfile Based on Digital Continuous Profiling Infrastructure Kernel profiling No instrumentation Low overhead/tunable overhead 6/33

Two server machines Profiling Setup Moderate server: 1660 MHzAthlon XP CPU with 512 MB RAM Powerful server: dual Pentium 4 Xeon 2.4 GHz CPU with 1 GB RAM. Client Machines were more powerful dual Xeons To isolate server performance 100 Mbps network connectivity Linux kernel 2.4.20,, GridFTP server 2.4.3, NeST prerelease 7/33

GridFTP Profile 45.0 40.0 Percentage of CPU Time 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP Read Rate = 6.45 MBPS, Write Rate = 7.83 MBPS =>Writes to server faster than reads from it 8/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP Writes to the network more expensive than reads => Interrupt coalescing 9/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP IDE reads more expensive than writes 10/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP File system writes costlier than reads => Need to allocate disk blocks 11/33

GridFTP Profile 45.0 40.0 35.0 Percentage of CPU Time 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethernet Driver Interrupt Handling Libc Globus Oprofile IDE File I/O Rest of Kernel Read From GridFTP Write To GridFTP More overhead for writes because of higher transfer rate 12/33

GridFTP Profile Summary Writes to the network more expensive than reads Interrupt coalescing DMA would help IDE reads more expensive than writes Tuning the disk elevator algorithm would help Writing to file system is costlier than reading Need to allocate disk blocks Larger block size would help 13/33

NeST Profile 60.0 50.0 Percentage of CPU Time 40.0 30.0 20.0 10.0 0.0 Idle Ethernet Driver Interrupt Handling Libc NeST Oprofile IDE File I/O Rest of Kernel Read From NeST Write To NeST Read Rate = 7.69 MBPS, Write Rate = 5.5 MBPS 14/33

NeST Profile 60.0 50.0 Percentage of CPU Time 40.0 30.0 20.0 10.0 0.0 Idle Ethernet Driver Interrupt Handling Libc NeST Oprofile IDE File I/O Rest of Kernel Read From NeST Write To NeST Similar trend as GridFTP 15/33

NeST Profile 60.0 50.0 Percentage of CPU Time 40.0 30.0 20.0 10.0 0.0 Idle Ethernet Driver Interrupt Handling Libc NeST Oprofile IDE File I/O Rest of Kernel Read From NeST Write To NeST More overhead for reads because of higher transfer rate 16/33

NeST Profile 60.0 50.0 Percentage of CPU Time 40.0 30.0 20.0 10.0 0.0 Idle Ethernet Driver Interrupt Handling Libc NeST Oprofile IDE File I/O Rest of Kernel Read From NeST Write To NeST Meta data updates (space allocation) makes NeST writes more expensive 17/33

GridFTP GridFTP versus NeST Read Rate = 6.45 MBPS, write Rate = 7.83 MBPS NeST Read Rate = 7.69 MBPS, write Rate = 5.5 MBPS GridFTP is 16% slower on reads GridFTP I/O block size 1 MB (NeST 64 KB) Non-overlap of disk I/O & network I/O NeST is 30% slower on writes Lots (space reservation/allocation) 18/33

Effect of Protocol Parameters Different tunable parameters I/O block size TCP buffer size Number of parallel streams Number of concurrent transfers 19/33

Read Transfer Rate 20/33

Server CPU Load on Read 21/33

Write Transfer Rate 22/33

Server CPU Load on Write 23/33

Transfer Rate and CPU Load 24/33

Server CPU Load and L2 DTLB misses 25/33

L2 DTLB Misses Parallelism triggers the kernel to use larger page size => lower DTLB miss 26/33

Profiles on powerful server Next set of graphs were obtained using the powerful server 27/33

Parallel Streams versus Concurrency 28/33

Effect of File Size (Local Area) 29/33

Transfer Rate versus Parallelism in Short Latency (10 ms) Wide Area 30/33

Server CPU Utilization 31/33

Conclusion Full system profile gives valuable insights Larger I/O block size may lower transfer rate Network, disk I/O not overlapped Parallelism may reduce CPU load May cause kernel to use larger page size Processor feature for variable sized pages would be useful Operating system support for variable page size would be useful Concurrency improves throughput at increased server load 32/33

Contact Questions kola@cs.wisc.edu www.cs.wisc.edu/condor/publications.html 33/33