Data Staging: Moving large amounts of data around, and moving it close to compute resources

Similar documents
Data Staging: Moving large amounts of data around, and moving it close to compute resources

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

EUDAT. Towards a pan-european Collaborative Data Infrastructure. KNMI Workshop, Utrecht, Netherlands

EUDAT. Towards a Collaborative Data Infrastructure. Ari Lukkarinen CSC-IT Center for Science, Finland NORDUnet 2012 Oslo, 18 August 2012

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Using EUDAT services to replicate, store, share, and find cultural heritage data

File Transfer: Basics and Best Practices. Joon Kim. Ph.D. PICSciE. Research Computing 09/07/2018

Introduction to High Performance Parallel I/O

EUDAT & AAI. Daan Broeder MPI for Psycholinguistics

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUDAT Common data infrastructure

Introduction to Grid Computing

Parallel File Systems. John White Lawrence Berkeley National Lab

EUDAT Towards a Collaborative Data Infrastructure

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Damien Lecarpentier CSC-IT Center for Science, Finland EUDAT User Forum, Barcelona

ProMAX Cache-A Overview

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

Ultra high-speed transmission technology for wide area data movement

I data set della ricerca ed il progetto EUDAT

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Automatically Encapsulating HPC Best Practices Into Data Transfers

USE CASES IN SEISMOLOGY. Alberto Michelini INGV

Table 9. ASCI Data Storage Requirements

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Data Discovery - Introduction

XSEDE Software and Services Table For Service Providers and Campus Bridging

Outline. ASP 2012 Grid School

The Internet. Session 3 INST 301 Introduction to Information Science

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

QoS on Low Bandwidth High Delay Links. Prakash Shende Planning & Engg. Team Data Network Reliance Infocomm

EMC Solutions for Backup to Disk EMC Celerra LAN Backup to Disk with IBM Tivoli Storage Manager Best Practices Planning

Data transfer at CINECA: how to enjoy it!

Isilon Performance. Name

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

Optimizing reasonably secure Long Distance Data Transfer. How to transfer Data while being poor

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

A data Grid testbed environment in Gigabit WAN with HPSS

Extreme I/O Scaling with HDF5

New User Seminar: Part 2 (best practices)

Overcoming Obstacles to Petabyte Archives

Server Specifications

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

Enterprise print management in VMware Horizon

Zhengyang Liu University of Virginia. Oct 29, 2012

Layer 4 TCP Performance in carrier transport networks Throughput is not always equal to bandwidth

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

Zhengyang Liu! Oct 25, Supported by NSF Grant OCI

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

EUDAT - Open Data Services for Research

XSEDE Software and Services Table For Service Providers and Campus Bridging

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Internet Content Distribution

Grid and Component Technologies in Physics Applications

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

SSH Bulk Transfer Performance. Allan Jude --

Enabling High Performance Bulk Data Transfers With SSH

MX Sizing Guide. 4Gon Tel: +44 (0) Fax: +44 (0)

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

An Overview of Fujitsu s Lustre Based File System

Building a Materials Data Facility (MDF)

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

LCG data management at IN2P3 CC FTS SRM dcache HPSS

GRID AND COMPONENT TECHNOLOGIES IN PHYSICS APPLICATIONS

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

Using SDSC Systems (part 2)

Approaches for Upgrading to SAS 9.2. CHAPTER 1 Overview of Migrating Content to SAS 9.2

A Crash Course In Wide Area Data Replication. Jacob Farmer, CTO, Cambridge Computer

Parallel Storage Systems for Large-Scale Machines

Protect enterprise data, achieve long-term data retention

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Student ID: CS457: Computer Networking Date: 3/20/2007 Name:

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

The Google File System

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

CSE 124: Networked Services Lecture-16

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

201 9 Iin n n n o o v v a a t tiin n t t S v Y e S rstie o M n 3 R S E ysq t U e IR m E R M eq E un ir T e S ments Da t e :

Networks & protocols research in Grid5000 DAS3

A Simple Mass Storage System for the SRB Data Grid

Introduction to The Storage Resource Broker

Safe Replication and Data Staging

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

High Throughput WAN Data Transfer with Hadoop-based Storage

Cisco Meraki MX products come in 6 models. The chart below outlines MX hardware properties for each model:

I/O: State of the art and Future developments

Data Management Components for a Research Data Archive

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

GFS: The Google File System

Memory Management Strategies for Data Serving with RDMA

European Collaborative Data Infrastructure EUDAT - Training on EUDAT Principles -

Securing Grid Data Transfer Services with Active Network Portals

GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Lecture 4: Introduction to Computer Network Design

Server Specifications

Transcription:

Data Staging: Moving large amounts of data around, and moving it close to compute resources Digital Preserva-on Advanced Prac--oner Course Glasgow, July 19 th 2013 c.cacciari@cineca.it

Definition Starting point Moving data around Size/Performances The user Data selection Transfer options Numbers Registered data Outline 2

Data movement: staging or replication? PID Safe Replica/on Data preservation and access optimization Data Staging Dynamic replication to HPC workspace for processing Data 3 3

Staging 4 4

Moving large amounts of data around Data sets can be large both in terms of Numbers of objects Single object size 5

All data are gray in the data staging The data types are irrelevant Except when they can affect the size or the number of the files hap://www.flickr.com/photos/oliviermar-ns/3095425065 6

When I should care about data types? When you can compress them Data transfer efficiency is crucial in data staging 7

Time to copy 1TB 10 Mb/s network: 300 hrs (12.5 days) 100 Mb/s network: 30 hrs 1 Gb/s network: 3 hrs (are your disks fast enough?) 10 Gb/s network: 20 minutes (need really fast disks and file system) Compare these speeds to: USB 2.0 portable disk 60 MB/sec (480 Mbps) peak 20 MB/sec (160 Mbps) reported on line 15-40 hours to load 1 Terabyte 8 Parallel I/O and management of large scien-fic data

Data Throughput Transfer Times This table available at http://fasterdata.es.net 9 Parallel I/O and management of large scien-fic data

Data staging is Just in Time You do not plan it You can pre-stage sometimes There are techniques to improve the efficiency 10

Faster than light TCP tuning: refers to the proper configuration of buffers that correspond to TCP windowing Pipelining (of commands): speeds up lots of tiny files by stuffing multiple commands into each login session back-to-back without waiting for the first command's response http://www.flickr.com/photos/gioxxswall/7048643477 11

Parallelizing: on wide-area links, using multiple TCP streams in parallel (even between the same source and destination) can improve aggregate bandwidth over using a single TCP stream 12

Striping: data may be striped or interleaved across multiple servers 13

Parallelism and TCP tuning are the keys It is much easier to achieve a given performance level with four parallel connections than with one A good TCP tuning can improve drastically performances Latency interaction is critical Wide area data transfers have much higher latency than LAN transfers Many tools and protocols assume a LAN Examples: SCP/SFTP, HPSS mover protocol 14

Use the right tool SCP/SFTP: 10 Mb/s FTP: 400-500 Mb/s assumes TCP buffer autotuning standard Unix file copy tools Parallel stream FTP: 800-900 Mbps fixed 1 MB TCP window in OpenSSH only 64 KB in OpenSSH versions < 4.7 15

But efficiency is not all Easiness of use High reliability Third-party transfer Possibility to control/limit the transfer throughput to avoid engulfing the network Which priority? Different scenarios, different needs 16

and moving it close to compute resources 17

Who will move the data? Any user, part of a community or citizen scientist. In most part of the cases, a domain expert, such as a biologist, a linguist, a seismologist, not an expert of data transfer software. 18

Wizard gap As Ethernet speeds have increased, there has been a widening gap between ability of the novice to fully use bandwidth capabilities, compared to that of network capabilites achieved by users who have had their network tuned by an expert ('Wizard"). [Diagram c/o MaA Mathis, hap://www.psc.edu/%7emathis/papers/] http://www.minifigure.org/wp-content/uploads/2011/05/wizard_s.jpg 19

Data selection Data Staging Overview 20

How the data sets are selected by the domain experts? Through their local tools, where local means on the community side Through remote services, generic or specific per community 21

Move the data 22

Eudat Transfer options Globus Online Third party transfer, web portal Globus client server to server to server irods specific client server to to server server HTTP options Work in progress 23

GridFTP: third Party Transfer Control channels GridFTP Server Data channels GridFTP Server Site A Site A 24

Globus OnLine Service http://www.globusonline.org 25

EUDAT PID Howto transfer #TB or PB From to HPC which site EUDAT X site Which A,B tool or C to use: GridFTP, Howto Scp, to get irods? access all sites? EUDAT site B GridFTP scp 1Gb/s WAN HPC site X EUDAT 10Gb/s WAN Globus Online irods 10Gb/s OPN EUDAT PID Bittorrent? EUDAT PID EUDAT site C EUDAT site A 26

Stage-out Data transport protocols: GridFTP, HTTP. SCP, FTP, RSYNC, OpenDap? WAN Which tool to manage the transfers: irods, Globus Online, FTS, RSYNC, Others.? PID EUDA T LTA EUDAT SP site A WAN Third party transfers Dedicated Network Third party transfers HPC site X HPC space EUDAT User Forum Barcelona, 7-8 March 2012 27

Stage-out/stage-in Workflow framework Stage 1 Stage 2 Stage # PID EUDAT LTA EUDAT SP site A Data Mining cluster EUDAT User Forum Barcelona, 7-8 March 2012 28

Multi-step staging EPCC EUDAT center Community center VPH ENES CLARIN Lifewatch CINECA BSC EPOS 29

Local staging targets 1. Single step 30

Sample Data Transfer Results Sample Results: RTT = 53 ms, network capacity = 10Gb/s. Tool Throughput scp: 140 Mb/s HPN patched scp: 1.2 Gb/s FTP: 1.4 Gb/s GridFTP, 4 streams 5.4 Gb/s GridFTP, 8 streams 6.6 Gb/s 31

irods irods is a data management system, which integrates a rule engine The rules can be triggered automatically based on specific events (for example a data set is moved to a particular directory) Invoked from remote via irods command line client or integrated into applications based on irods java (Jargon) and python (PyRods) API Data Staging Overview 32

Metadata! "Managing Large Datasets with irods a Performance Analysis" - Proceedings of the Interna-onal Mul-conference on Computer Science and Informa-on Technology pp. 647 654 33

haps://openwiki.uninea.no/norstore:irods:workshop2012 34

Data Staging and registered data PID Data Center Store PID Data Center Store Data Center Store HPC ain of registered data 35

Conclusions The way to move data has to be enough flexible to accomodate different transfer protocols, different access mechanisms. Flexibility means also that the transfer tools can be used as they are, with default parameters, for average performances, but also fine tuned by experts for faster transfers. No solution fits all, so different services are provided 36

eudat-info@postit.csc.fi e- IRG Workshop - Dublin - Ireland 37 37

Thank you! 38