Internet data transfer record between CERN and California. Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Similar documents
High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Achieving Record Speed Trans- Atlantic End-to-end TCP Throughput

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks p. 1

Data transfer over the wide area network with a large round trip time

Network Design Considerations for Grid Computing

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters, and Grids: A Case Study Λ

CMS Data Transfer Challenges and Experiences with 40G End Hosts

Transport layer issues

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Gigabyte Bandwidth Enables Global Co-Laboratories

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

TCP on High-Speed Networks

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

TCP Congestion Control

TCP Congestion Control

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

GÉANT Support for Research Within and Beyond Europe

Managing Caching Performance and Differentiated Services

CSE 123A Computer Networks

FAST TCP: From Theory to Experiments

FAST TCP: From Theory to Experiments

Diamond Networks/Computing. Nick Rees January 2011

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

TCP Performance. EE 122: Intro to Communication Networks. Fall 2006 (MW 4-5:30 in Donner 155) Vern Paxson TAs: Dilip Antony Joseph and Sukun Kim

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

UltraLight Network Phase I

Evaluation of TCP Based Congestion Control Algorithms Over High-Speed Networks

TCP on High-Speed Networks

8. TCP Congestion Control

Da t e: August 2 0 th a t 9: :00 SOLUTIONS

BicTCP Implemenation in Linux Kernels

Lecture 15: Transport Layer Congestion Control

CS519: Computer Networks. Lecture 5, Part 4: Mar 29, 2004 Transport: TCP congestion control

What is the Future for High-Performance Networking?

Optical Networking Activities in NetherLight

Context-Aware Network Stack Optimization

Measuring end-to-end bandwidth with Iperf using Web100

FPGAs and Networking

ICS 351: Networking Protocols

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

Title: Collaborative research: End-to-End Provisioned Optical Network Testbed for Large-Scale escience Applications

Using QoS for High Throughput TCP Transport over Fat Long Pipes

a. (4pts) What general information is contained in a LSR-PDU update that A might send?

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

SharkFest 17 Europe. My TCP ain t your TCP. Simon Lindermann. Stack behavior back then and today. Miele & Cie KG.

Your Name: Your student ID number:

Cisco Exam Cisco Interconnecting Cisco Networking Devices Part 1 (ICND) Version: 12.0 [ Total Questions: 202 ]

Network Management & Monitoring

End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications

Bandwidth Allocation & TCP

Multiple unconnected networks

ETSF10 Internet Protocols Transport Layer Protocols

Intel SR2612UR storage system

Evaluation of the LDC Computing Platform for Point 2

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

TCP. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli)

Fast Retransmit. Problem: coarsegrain. timeouts lead to idle periods Fast retransmit: use duplicate ACKs to trigger retransmission

Q500-P20 performance report

Routers: Forwarding EECS 122: Lecture 13

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

CS268: Beyond TCP Congestion Control

Congestion control in TCP

The European DataGRID Production Testbed

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Investigating the Use of Synchronized Clocks in TCP Congestion Control

Characterizing Slow Start in High Speed Networks

CS3600 SYSTEMS AND NETWORKS

GÉANT IP Service Description. High Performance IP Services to Support Advanced Research

FAST Kernel: Background Theory and Experimental Results. (Extended Abstract)

LANCOM Techpaper IEEE n Indoor Performance

PLEASE READ CAREFULLY BEFORE YOU START

Improving Packet Processing Performance of a Memory- Bounded Application

ADVANCED COMPUTER NETWORKS

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

CS4700/CS5700 Fundamentals of Computer Networks

Transport layer. UDP: User Datagram Protocol [RFC 768] Review principles: Instantiation in the Internet UDP TCP

The 6net project An IPv6 testbed for the European Community Gunter Van de Velde, Cisco Systems

CS 356: Computer Network Architectures Lecture 19: Congestion Avoidance Chap. 6.4 and related papers. Xiaowei Yang

EXAM TCP/IP NETWORKING Duration: 3 hours With Solutions

Transport layer. Review principles: Instantiation in the Internet UDP TCP. Reliable data transfer Flow control Congestion control

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

Vorlesung Kommunikationsnetze

ECE 435 Network Engineering Lecture 10

Virtualization Practices:

ECE4110, Internetwork Programming, QUIZ 2 - PRACTICE Spring 2006

Report on Transport Protocols over Mismatched-rate Layer-1 Circuits with 802.3x Flow Control

Transcription:

Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Summary Internet2 Land Speed Record Contest New LSR DataTAG project and network configuration Establishing an LSR: hardware and tuning Conclusions Acknowledgements Ravot - Moroni - 2

Internet2 LSR Contest (I) From http://lsr.internet2.edu: A minimum of 100 megabytes must be transferred a minimum terrestrial distance of 100 kilometers with a minimum of two router hops in each direction between the source node and the destination node across one or more operational and production-oriented high-performance research and education networks. The contest unit of measurement is [ ] bitmeters/second. Ravot - Moroni - 3

Internet2 LSR Contest (II) Instances of all hardware units and software modules used to transfer contest data on the source node, the destination node, the links, and the routers must be offered for commercial sale or as open source software to all U.S. members of the Internet community by their respective vendors or developers prior to or immediately after winning the contest. Award classes: single or multiple TCP streams, on top of IPv4 or IPv6 Generally available networks and equipment, vs. lab prototypes Ravot - Moroni - 4

Former LSR TCP/IPv4 single stream By NIKHEF, Caltech and SLAC Established on November 19 th 2002 10978 Km of network: Geneva- Amsterdam-Chicago-Sunnyvale 0.93 Gb/sec 10136.15 terabit-meters/second Ravot - Moroni - 5

Current LSR TCP/IPv4 single stream By Caltech, CERN, LANL and SLAC, within the DataTAG project framework Established on February 27 th -28 th 2003 by Sylvain Ravot (Caltech) using IPERF 23888 Terabit-meters/second 10037 Km of network: Geneva-Chicago- Sunnyvale (shorter distance than the former LSR) 2.38 Gb/sec (sustained: a Terabyte of data moved in an hour) Ravot - Moroni - 6

DataTAG project Full project title: Research and technological development for a transatlantic GRID IST project (EU funded), supported by the NSF and the DoE (Caltech): http://www.datatag.org Partners: PPARC (UK), INRIA (FR), University of Amsterdam (NL), INFN (IT) and CERN (CH) Researchers also from Caltech, Los Alamos, SLAC and Canada Test-bed kernel: transatlantic STM-16 between Geneva (CERN) and Chicago (StarLight), with interconnected workstations at each side. Test-bed extensions provided by GEANT, SURFnet, VTHD and other partners, in Europe and North America Ravot - Moroni - 7

DataTAG as test-bed for LSR Research on TCP as part of the DataTAG programme The Geneva-Chicago link was the main environment for the LSR Network extension made available: Chicago- Sunnyvale STM-64 Router at Sunnyvale 10 GbE interfaces on DataTAG PCs Ravot - Moroni - 8

Cisco 12406 (Cisco loan) LSR network configuration Cisco 7609 (DataTAG) DataTAG network STM-64 (Level3 loan) STM-16 (T-systems) 10 GbE PC (10GbE) (Intel loan) PC (10GbE) (Intel loan) Juniper T640 (TeraGrid) Cisco 7606 (DataTAG) Level3 PoP SUNNYVALE (California USA) StarLight Chicago (Illinois USA) CERN Geneva (Switzerland) Ravot - Moroni - 9

Establishing an LSR: hardware (I) No LSR without good hardware A lot of bandwidth: minimum 2.5 Gb/sec on the whole path (thanks to Level3 for the STM-64 on loan between Chicago and Sunnyvale) Powerful routers (Cisco 7600 and GSR, Juniper T640) Powerful Linux PCs on both sides Intel 10 GbE interfaces Ravot - Moroni - 10

Establishing an LSR: hardware (II) Linux PC at CERN: Dual Intel Xeon processors, 2.40GHz with 512K L2 cache SuperMicro P4DP8-G2 Motherboard Intel E700 chipset 2 GB RAM,PC2100 ECC Reg. DDR Hard drive: 1 x 140 GB - Maxtor ATA-133 On board Intel 82546EB dual port Gigabit Ethernet controller 4U Rack-mounted server Linux PC at Sunnyvale: Dual Intel Xeon processors, 2.40GHz with 512K L2 cache SuperMicro P4DPE-G2 Motherboard 2 GB RAM, PC2100 ECC Reg. DDR 2* 3ware 7500-8 8 RAID controllers 16 Western Digital IDE disk drives for RAID and 1 for system 2 Intel 82550 fast Ethernet 2*SysKonnect Gigabit Ethernet card SK-9843 SK-NET GE SX 4U Rack-mounted server 480W to run 600W to spin up Ravot - Moroni - 11

Establishing an LSR: hardware (III) Intel 10 GbE interfaces: Intel Pro/10 GbE- LR Not yet commercially available when the LSR was set, but announced as commercially available shortly afterwards Ravot - Moroni - 12

Establishing an LSR: standard tuning MTU set to 9000 bytes TCP window size increased from the Linux default of 64K: essential over long distance But standard Linux kernel (2.4.20) Standard tuning is not enough for LSR: why? Ravot - Moroni - 13

TCP WAN problems Responsiveness to packet losses is proportional to the square of the RTT: R=C*(RTT**2)/2*MSS (where C is the link capacity and MSS is the max segment size). This makes it very difficult to take advantage of full capacity over long-distance WAN: not a real problem for standard traffic on a shared link, but a serious penalty for LSR Slow start mode is too slow using default parameters: they are good for standard traffic, but not for LSR Ravot - Moroni - 14

Example: recovering from a packet loss TCP Throughput CERN-Chicago over the 622 Mbit/s link Throughput (Mbit/s) 200 150 100 50 0 6 min 0 200 400 600 800 1000 1200 1400 1600 TCP reactivity Time (s) Time to increase the throughput by 120 Mbit/s is larger than 6 min for a connection between Chicago and CERN. Packet losses is a disaster for the overall throughput Ravot - Moroni - 15

Example: slow start vs. congestion avoidance Cwnd average of the last 10 samples. SSTHRESH Cwnd average over the life of the connection to that point Slow start Congestion Avoidance Ravot - Moroni - 16

Establishing an LSR: what goes wrong Behaviour of the TCP stack: if C is very small, it keeps the responsiveness low enough for any terrestrial RTT. Therefore modern, fast WAN links are bad for TCP performance TCP tries to increase its window size until something breaks (packet loss, congestion, ); then restarts from a half of the previous value until it breaks again. This gradual approximation process takes very long over long distance and degrades performance Ravot - Moroni - 17

Establishing an LSR: further tuning Knowing a priori the available bandwidth, prevent TCP from trying larger windows by restricting the amount of buffers it may use: without buffers, it won t try to use larger windows and packet losses can be avoided The product C*RTT yields the optimal TCP window size for a link of capacity C So, allocate just enough buffers to let TCP squeeze the maximum performance from the existing bandwidth and nothing else Ravot - Moroni - 18

Further tuning: Linux implementation Tuning TCP buffers (numbers for STM-16): echo 4096 87380 128388607 > /proc/sys/net/ipv4/tcp_rmem echo 4096 65530 128388607 > /proc/sys/net/ipv4/tcp_wmem echo 128388607 > /proc/sys/net/core/wmem_max echo 128388607 > /proc/sys/net/core/rmem_max Tuning the network device buffers: /sbin/ifconfig eth1 txqueuelen 10000 /sbin/ifconfig eth1 mtu 9000 Both on sender and receiver Ravot - Moroni - 19

Even further tuning: Linux implementation TCP slow start mode vs. congestion avoidance mode is another performance penalty in the sender for the LSR On Linux (sender side only): sysctl -w w net.ipv4.route.flush=1 This prevents TCP from using any previously cached window value, i.e. speeds up slow start mode and gets to congestion avoidance mode at exponential speed (otherwise the growth of the congestion window would start at half of some previously cached value) Ravot - Moroni - 20

IPERF parameters On the sender: iperf -c c 192.91.239.213 -i i 5 -P P 3 -w w 40M -t t 180 On the receiver: iperf-1.6.5 -s -w w 128M IPERF is available at http://dast.nlanr.net Ravot - Moroni - 21

Conclusions: how useful in practice? The LSR result cannot be immediately translated into practical general-purpose recommendations: it relies on some a priori knowledge (the physical link speed) dedicated bandwidth ad hoc TCP tuning: good for LSR, not for general- purpose traffic Nevertheless, work is ongoing for a more modern TCP stack: the new LSR demonstrates that fast WAN TCP is possible in practice, by tweaking TCP a bit Ravot - Moroni - 22

Conclusions: sustained throughput The LSR definition has only very limited provisioning for requiring sustained throughput (100 Megabytes are not much) However the achieved LSR shows that high sustained throughput is in principle possible with TCP over long distance Former results could sustain the throughput only for 40-60 seconds, before some TCP feedback mechanism kicked in Ravot - Moroni - 23

Other remarks The bottleneck for things like the LSR is now in the end hosts: no non-trivial tuning was needed on the network where the LSR was established Incidentally, although single-stream, the new LSR was also good enough to establish the new LSR for multiple IPv4 streams No TCP packet was lost during the LSR trial window Details of the new record are not published on http://lsr.internet2.edu yet Ravot - Moroni - 24

Acknowledgements: people Sylvain Ravot (working for Caltech at CERN) Wu-Chun Feng (LANL) Les Cottrell (SLAC) Ravot - Moroni - 25

Acknowledgements: industrial partners Ravot - Moroni - 26

Acknowledgements: organisations