File Transfer: Basics and Best Practices. Joon Kim. Ph.D. PICSciE. Research Computing 09/07/2018

Similar documents
Data Staging: Moving large amounts of data around, and moving it close to compute resources

Data Staging: Moving large amounts of data around, and moving it close to compute resources

Climate Data Management using Globus

Graham vs legacy systems

Globus Online and HPSS. KEK, Tsukuba Japan October 16 20, 2017 Guangwei Che

ECPE / COMP 177 Fall Some slides from Kurose and Ross, Computer Networking, 5 th Edition

COMPUTE CANADA GLOBUS PORTAL

USER MANUAL. VIA IT Deployment Guide for Firmware 2.3 MODEL: P/N: Rev 7.

Enabling a SuperFacility with Software Defined Networking

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Peer-to-Peer Systems. Internet Computing Workshop Tom Chothia

Globus Research Data Management: Campus Deployment and Configuration. Steve Tuecke Vas Vasiliadis

SSH Bulk Transfer Performance. Allan Jude --

Parallel Storage Systems for Large-Scale Machines

Triton file systems - an introduction. slide 1 of 28

Setting up your own private VPN using a cheap VPS server

Beyond File Transfer. Steve Tuecke NCAR September 5, 2018

Isilon Performance. Name

Computing at MIT: Basics

Data Movement and Storage. 04/07/09 1

Ftp Command Line Manual Windows Example Port 22

Report Exec Dispatch System Specifications

Engagement With Scientific Facilities

Requirements and Dependencies

Introduction to CARC. To provide high performance computing to academic researchers.

How to Use a Supercomputer - A Boot Camp

Data Transfer on a Container. Nadya Williams University of California San Diego

XSEDE New User Training. Ritu Arora November 14, 2014

Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Winscp Error Code 8 Not Enough Storage

Introduction. Kevin Miles. Paul Henderson. Rick Stillings. Essex Scales. Director of Research Support. Systems Engineer.

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

Zadara Enterprise Storage in

Introduction to BioHPC Lab

CS 326: Operating Systems. Networking. Lecture 17

Introduction to BioHPC

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

PODShell: Simplifying HPC in the Cloud Workflow

The Internet. Session 3 INST 301 Introduction to Information Science

Ftp Get Command Line Linux Proxy Settings Via

Lecture 2: Internet Structure

Network Architecture

Perform Backup and Restore

SFTP CONNECTIVITY STANDARDS Connectivity Standards Representing Bloomberg s Requirements for SFTP Connectivity.

CS321: Computer Networks FTP, TELNET, SSH

Build your own NAS with OpenMediaVault

Introduction to High-Performance Computing (HPC)

AN OVERVIEW OF COMPUTING RESOURCES WITHIN MATHS AND UON

Zhengyang Liu! Oct 25, Supported by NSF Grant OCI

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

IT INFRASTRUCTURE PROJECT PHASE II INSTRUCTIONS

Graphical Access to IU's Supercomputers with Karst Desktop Beta

World Skills Competition. Trade 39: IT PC and Network Support. Day 2 Competition

Enabling High Performance Bulk Data Transfers With SSH

Comet Virtualization Code & Design Sprint

Silicon House. Phone: / / / Enquiry: Visit:

Andrew Pollack Northern Collaborative Technologies

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Zhengyang Liu University of Virginia. Oct 29, 2012

Introduction to BioHPC New User Training

New User Tutorial. OSU High Performance Computing Center

Introduction. Architecture Overview

Introduction to BioHPC

Data transfer over the wide area network with a large round trip time

Ultra high-speed transmission technology for wide area data movement

Introduction to BioHPC New User Training

Winscp Cannot Create Remote File Error Code 4

3.2 COMMUNICATION AND INTERNET TECHNOLOGIES

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

Ftp Command Line Commands Linux Example Windows Put

Automatically Encapsulating HPC Best Practices Into Data Transfers

Disconnect Network Drive Windows 7 Network Connection Does Not Exist

Lab Exercise Protocol Layers

This section discusses the protocols available for volumes on Nasuni Filers.

Steven Carter. Network Lead, NCCS Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1

Automated Installation Guide for CentOS (PHP 7.x)

Protected Environment at CHPC. Sean Igo Center for High Performance Computing September 11, 2014

Bitnami Apache Solr for Huawei Enterprise Cloud

CSE 333 Lecture C++ final details, networks

Services: Monitoring and Logging. 9/16/2018 IST346: Info Tech Management & Administration 1

FACILITY REQUIREMENTS GUIDE VMware Virtual Data Center (VDC/vClass) Cloud Lab Environment Revision 18 Feb 24th, 2017

CSDA UNIT I. Introduction to the LAB environment. Practical classes Lab 0. Computer Engineering Degree Computer Engineering.

VMware AirWatch Content Gateway for Linux. VMware Workspace ONE UEM 1811 Unified Access Gateway

Parallel File Systems Compared

Onboarding Guide. ipointsolutions.net (800)

High Availability Overview Paper

CS 111. Operating Systems Peter Reiher

FEPS. SSH Access with Two-Factor Authentication. RSA Key-pairs

ECE 650 Systems Programming & Engineering. Spring 2018

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

CSE 461 Module 10. Introduction to the Transport Layer

COURSE OUTLINE: A+ COMPREHENSIVE

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

root.smart Power NET Bandwidth Usage CPU Usage

CT ac2-1n-10g LANforge WiFIRE a/b/g/n/ac 4x4 MU-MIMO 3 radio WiFi Traffic

Box Competitive Sheet January 2014

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

New User Seminar: Part 2 (best practices)

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Hardware and Software Requirements for Server Applications. IVS Enterprise Server Version 12.5+

Transcription:

File Transfer: Basics and Best Practices Joon Kim. Ph.D. PICSciE Research Computing Workshop @Chemistry 09/07/2018

Our goal today Learn about data transfer basics Pick the right tool for your job Know what to expect Overview of widely used tools Learn about RC s resources Globus and Data Transfer Nodes Q&A 2

Why do we care? Time Without good practice, you will waste time and effort 1. Start data transfer using SCP at 10pm. Usually takes 10 hours. 2. At 2am, there was a brief 1-minute network outage. Transfer job aborted. 3. Arrive 8am in the morning. See the damage. Start again, which will take 10 hours. 4. Lost a day of work. Effort 3

Why do we care? Without good practice, you will waste time and effort 1. Start data transfer using SCP at 10pm. Usually takes 10 hours. Is that really the best? Time Effort 4

We want you to Focus on your research, not on transferring data X Time X Effort 5

Use case 1 I have data at Argonne National Lab that I want to process & analyze at Princeton HPC clusters Argonne s Chemical Sciences and Engineering (CSE) division Princeton HPC Clusters 6

Use case 2 I have data on my workstation/laptop that I want to process & analyze at Princeton HPC clusters workstation/laptop Princeton HPC Clusters 7

Data Transfer Basics 8

Data transfer: Overview Three key elements Endpoints Network Transfer tool Source SCP SFTP rsync over ssh 1/10/100 Gbps Destination FTP rsync These will determine how you transfer data and how fast it will be 9

Why is my data transfer slow? Where are the bottlenecks? scp ftp scp ftp Source Destination 10

1. Endpoints Operating system Determines your tool availability Resources CPU, RAM: higher the better Disk I/O Disk type (SATA, SSD), configuration (RAID), file system (ext4, GPFS) Decent server with SATA, ext4, RAID, good transfer tool: ~ 4 Gb/s (500 MB/s) You will likely get less than 1 Gb/s (125 MB/s) with your laptop, most desktops, and un-optimized servers 11

Example Data Transfer Node (DTN) You need a dedicated Data Transfer Node (DTN) to get more than 1 Gb/s performance (ref: http://fasterdata.es.net/) 12

2. Network Network bandwidth Use wired connection when available Get a good network card (NIC) 10-100 Mbps Congestion (heavily dependent on time of day) Distance and latency Things along the way Wired: 1, 10, 40 Gbps Wireless: < 100 Mbps Things you don t have much control over Routers, switches, firewalls, NAT, security devices, 13

3. Transfer tools What transfer tool do you use? 14

Secure Copy (SCP) Secure Copy (SCP) Uses SSH for authentication and data transfer (TCP port 22) If you can SSH into a place, SCP mostly just work. Unix-based systems (including Mac OS X): Should have it by default Windows: WinSCP (https://winscp.net/eng/download.php) 15

Secure Copy (SCP) Secure Copy (SCP) Uses SSH for authentication and data transfer (TCP port 22) Unix-based systems (including Mac OS X): Should have it by default Windows: WinSCP (https://winscp.net/eng/download.php) 16

rsync rsync (rsync over SSH) Sync files and directories between two endpoints. (e.g., backups, only transfer new files) Careful with --delete option (this *mirrors* directories) Unix-based systems (including Mac OS X): Should have it by default Windows: via Cygwin or DeltaCopy (never tried myself) 17

File Transfer Protocol (FTP) Secure File Transfer Protocol ( S FTP) Need FTP server running on the receiving end. SFTP is more secure. Use if you want encrypted transfer. Unix-based systems (including Mac OS X): Should have it by default Windows: FileZilla (https://filezilla-project.org) 18

File Transfer Protocol (FTP) Secure File Transfer Protocol ( S FTP) Widely used for file transfers SFTP is more secure. Use it if available. Unix-based systems (including Mac OS X): Should have it by default Windows: FileZilla (https://filezilla-project.org) 19

Browser click, wget, curl (HTTP/HTTPS) Web browser click wget (in terminal) $ wget http://downloads.sourceforge.net/gnuwin32/wget-1.11.4-1-src.zip curl (in terminal) $ curl -o data_src.zip http://downloads.sourceforge.net/gnuwin32/wget-1.11.4-1-src.zip 20

These tools are okay, but not always Great compatibility. Widely available. Small datasets. Quick transfers. (< 10 mins) Large bulk data transfers. Transfers on unreliable connections and hosts. 21

Transfer tools: scp vs. GridFTP Downloading 500 GB data 8 hours 1.5 hours 10 minutes (ref: http://fasterdata.es.net/data-transfer-tools/) 22

Tools that might perform better Globus/GridFTP (later) BBCP (http://www.slac.stanford.edu/~abh/bbcp/) Mac OS X, Linux-based systems. SSH-based access control Both endpoints need the tool installed $ bbcp -V -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar More info http://www.nersc.gov/users/storage-and-file-systems/transferring-data/bbcp/ Fast Data Transfer (FDT) Java-based tool from Caltech & CERN (http://monalisa.cern.ch/fdt/) Can theoretically run in any Operating System, including Windows Need server-side running in server mode $ java -jar./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu ~/file.633m -d /userdata/hjm 23

Other tools Aria2c (https://aria2.github.io) Faster http/https, ftp, sftp, BitTorrent, and Metalink download tool (x4 faster) Windows, Mac, Linux, Android App http: $./aria2c -x4 -k1m http://foo.com/foo.zip LFTP (http://lftp.tech) Faster download (get) speed (2-5x) for ftp, http, sftp, fish, torrent. Upload (put) speed is same. Compatible with normal FTP, HTTP servers. Mac OS X, Linux-based systems. (apt-get install lftp; yum install lftp; brew install lftp) ftp: $ lftp ftp://speedtest.tele2.net http: $ lftp -e 'pget -n 5 foo.zip' http://foo.com/ HPN-patched SCP/SSH (https://www.psc.edu/hpn-ssh) Need both ends to be patched. 24

How to select a transfer tool Transfer takes less than 10 mins Not that frequent Other than that, your transfer job is a noticeable chunk in your workflow SCP (WinSCP) FTP (FileZilla) rsync wget/curl Globus (GridFTP) BBCP FDT Patched SCP LFTP Aria2c 25

(extra) Transfer settings: Encryption Tool Encrypted Control Encrypted Data FTP HTTP (even password-based access) BBCP BBFTP Globus/GridFTP SCP SFTP rsync over SSH Globus/GridFTP with encryption-on HTTPS Data encryption provides best security, but negatively impacts transfer speed (10-50% slower) 26

Summary and Best Practices If transfer takes ~10 mins and not that frequent, don t sweat much SCP, FTP, SFTP, rsync Used wired instead of wireless for large transfers Know the limitation of your endpoints Encrypt transfer for privacy-sensitive data SCP, SFTP, rsync over SSH, HTTPS, Globus with encryption Know this will slow down transfer speed Ask for help Your department IT staff About using RC resources: cses@princeton.edu 27

What we have at Princeton Physics DTN Tigress DTN Lewis-Sigler DTN PNI DTN CS DTN 10 Gbps (Mar. 2016 -) ESnet Globus (GridFTP) Data Transfer Nodes (DTNs) High-end servers 10 Gbps connections Tuned and optimized. RAID configured Good transfer tools (e.g., Globus) Supported Internet2 10 Gbps (Nov. 2015 -) 28

Tigress DTN Internet 10 Gbps GPFS-based /tigress Scratch disk spaces (/della/scratch/gpfs, /perseus/scra tch/gpfs, /tiger/scratch/gpfs2) Infiniband Princeton Tigress DTN Globus endpoint Infiniband Tools SCP, SFTP, rsync, wget, curl BBCP, Globus/GridFTP Princeton HPC cluster nodes + scratch spaces Infiniband /tigress 29 29

What is Globus? https://www.globus.org Fast, reliable data transfer and management service Uses GridFTP underneath Main advantages Fast transfer speed (multi-stream) Convenient to use: Fire-and-Forget 30

Transfer tools: scp vs. GridFTP Downloading 500 GB data 8 hours 1.5 hours 10 minutes (ref: http://fasterdata.es.net/data-transfer-tools/) 31

Easy to use Use a web browser to request a transfer job 1. Select from and to Fire-and-Forget Get email notification when transfer is complete (or was unsuccessful) 2. Select file or directory 3. Click 4. Wait optional step 32

How it works 1. Go to https://www.globus.org. Pick two endpoints. A 3. Data B 2. Submit transfer request at the Globus website. 3. Dataset is transferred between two endpoints. Your machine s web browser is a remote control But, your machine can be an endpoint too (more later) 4. Get notification when transfer is done. 1. Select endpoints 2. Request transfer Globus.org 4. Get notification 33

Globus has good coverage Lots of Universities/colleges DOE national labs ESnet at CERN, ANL, LBNL, LLNL, LANL, ORNL, and PNNL National computing facilities NERSC, NCSA, SDSC Federal agencies NIH, USDA, NASA/JPL, USGS Over 50,000 registered endpoints at over 500 institutions worldwide 34

Your peers use it 35

Use case 1 Globus.org Internet ANL (Argonne) DTN Globus endpoint Infiniband Princeton Tigress DTN Globus endpoint Infiniband Infiniband CERN file server Princeton HPC cluster nodes /tigress 36

Use case 2 Globus.org Internet Infiniband Princeton Tigress DTN Globus endpoint Infiniband Infiniband My Laptop Princeton HPC cluster nodes /tigress 37

Additional information about Globus Share file or directory with other Globus users https://docs.globus.org/how-to/share-files/ Command Line Interface: https://docs.globus.org/cli/ Python SDK https://globus-sdk-python.readthedocs.io/en/stable/ More information about Globus: https://docs.globus.org/how-to/ 38

Research Computing Mini Course Data Transfer Basics and Globus Transfer Tool Tutorial (Fall, 2018) https://researchcomputing.princeton.edu/education/training 39

Summary and Takeaways Small and quick transfers: basic transfer tools (scp) Large (and small) transfers: Data Transfer Nodes (DTNs) and Globus when possible Large transfers, but Globus is unavailable: Better tools aforementioned Know your environment and limitations Endpoints, network, transfer tool Speak up and reach out to us 40

Q&A joonk@princeton.edu Computational Science and Engineering Support: cses@princeton.edu Research Computing Website: https://researchcomputing.princeton.edu Help sessions: Lewis Science Library 347 Tuesday 10-11am Thursday 2-3pm (visualization emphasis) 41