Experiences with 40G/100G Applications

Similar documents
Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

Improving Performance of 100G Data Transfer Nodes

Science DMZ Architecture

Experiments on TCP Re-Ordering March 27 th 2017

CMS Data Transfer Challenges and Experiences with 40G End Hosts

Network and Host Design to Facilitate High Performance Data Transfer

Zhengyang Liu University of Virginia. Oct 29, 2012

Programmable Information Highway (with no Traffic Jams)

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Why Your Application only Uses 10Mbps Even the Link is 1Gbps?

ESnet s (100G) SDN Testbed

Network Test and Monitoring Tools

perfsonar Host Hardware

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

please study up before presenting

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

Fermilab WAN Performance Analysis Methodology. Wenji Wu, Phil DeMar, Matt Crawford ESCC/Internet2 Joint Techs July 23, 2008

UNIVERSITY OF CALIFORNIA

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

Networks & protocols research in Grid5000 DAS3

ANI Testbed Project Update

ECE 697J Advanced Topics in Computer Networks

High-Performance TCP Tips and Tricks

Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Internet data transfer record between CERN and California. Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Data transfer over the wide area network with a large round trip time

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Achieving the Science DMZ

Ronald van der Pol

International Climate Network Working Group (ICNWG) Meeting

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

10GE network tests with UDP. Janusz Szuba European XFEL

perfsonar Deployment on ESnet

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

CS268: Beyond TCP Congestion Control

FUJITSU Software Interstage Information Integrator V11

Traffic engineering and GridFTP log analysis. Jan 17, 2013 Project web site:

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

Disk-to-Disk network transfers at 100 Gb/s

Advanced Computer Networks. End Host Optimization

Ultra high-speed transmission technology for wide area data movement

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

DXE-810S. Manual. 10 Gigabit PCI-EXPRESS-Express Ethernet Network Adapter V1.01

Solving the Data Transfer Bottleneck in Digitizers

Congestion Control In the Network

IBM POWER8 100 GigE Adapter Best Practices

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

The Controlled Delay (CoDel) AQM Approach to fighting bufferbloat

SharkFest 17 Europe. My TCP ain t your TCP. Simon Lindermann. Stack behavior back then and today. Miele & Cie KG.

ESnet5 Deployment Lessons Learned

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Steven Carter. Network Lead, NCCS Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Evaluation of TCP Based Congestion Control Algorithms Over High-Speed Networks

Network Processors and their memory

Embedded Network Systems. Internet2 Technology Exchange 2018 October, 2018 Eric Boyd Ed Colone

TCP and BBR. Geoff Huston APNIC

Linux Kernel Hacking Free Course

Title: Collaborative research: End-to-End Provisioned Optical Network Testbed for Large-Scale escience Applications

Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer

Use of containerisation as an alternative to full virtualisation in grid environments.

TCP Performance Analysis Based on Packet Capture

Improving Packet Processing Performance of a Memory- Bounded Application

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Dropping Packets in Ubuntu Linux using tc and iptables

ANIC Host CPU Offload Features Overview An Overview of Features and Functions Available with ANIC Adapters

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

TCP and BBR. Geoff Huston APNIC

Extending the LAN. Context. Info 341 Networking and Distributed Applications. Building up the network. How to hook things together. Media NIC 10/18/10

Network Design Considerations for Grid Computing

Update on National LambdaRail

Extending dynamic Layer-2 services to campuses

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

ESnet Update. Summer 2010 Joint Techs Columbus, OH. Steve Cotter, ESnet Dept. Head Lawrence Berkeley National Lab

Session based high bandwidth throughput testing

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

Enhancing Infrastructure: Success Stories

Improving the Robustness of TCP to Non-Congestion Events

Computer Networks Spring 2017 Homework 2 Due by 3/2/2017, 10:30am

Ronald van der Pol

MidoNet Scalability Report

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

CS 326: Operating Systems. Networking. Lecture 17

Introduction to Networking and Systems Measurements

DYNES: DYnamic NEtwork System

TOC: Switching & Forwarding

Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks p. 1

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

CSCI Computer Networks

OASIS: Self-tuning Storage for Applications

Transcription:

Experiences with 40G/100G Applications Brian L Tierney ESnet, Internet2 Global Summit, April 2014

Outline Review of packet loss Overview SC13 high-bandwidth demos ESnet s 100G testbed Sample of results Challenges 4/10/14 2

The Punch Line Getting stable flows above 10G is hard! Significant tuning is required. Devices are designed for 1000 s of small flows, not 10 s of large flows (packet loss is typical) 4/10/14 3

Remember: Avoiding Packet Loss is Key 4/10/14 4

A small amount of packet loss makes a huge difference in TCP performance Local (LAN) Metro Area With loss, high performance beyond metro distances is essentially impossible Regional Continental International Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss) 4/10/14

SC13 Hero Experiments (thanks to Azher Mughal at Caltech Bill Fink at NASA Goddard for slide material ) 4/10/14 6

ESnet 100G Testbed and SC13 KIT CERN AMS SC13 SCinet Second Testbed Testbed 12x10GE STAR-TB1 ESnet 100G Testbed (Starlight) BNL ANA Europe 3x40GE Testbed NERSC-TB1 GENI Rack (Starlight) Starlight Fabric Open Science Data Cloud 40GE Testbed STAR-CR5 Testbed NEWY-CR5 AOFA-CR5 MANLAN ESnet 100G Testbed (NERSC) FNAL KANS-CR5 CHIC-CR5 WASH-CR5 NASA Demo Infrastructure ESnet5 Western Region DENV-CR5 ALBQ-CR5 Demo Link 2x10GE MAX Demo Switch ELPA-CR5 HOUS-CR5 NASH-CR5 ATLA-CR5 NRL Demo Infrastructure NRL demo Northern path (20G) NRL demo Southern path (20G) NASA demo production path (50G) NASA demo testbed path (50G) OpenFlow/SDN demo ANA path (100G) Caltech demo ANA path (100G) Caltech demo FNAL path (60G) Caltech demo NERSC TB path (100G) SC13 demos ESnet5 map Eli Dart, ESnet 11/14/2013 FILENAME SC13-DEMOS-V24.VSD

WAN Network Layout

Caltech Terabit Demo 7x 100G links 8 x 40G links

Caltech Demo SC13 Results DE-KIT 75Gb from Disk to Disk (servers at KIT, two servers at SC13) BNL over ESnet 80G over two pair of hosts, memory to memory NERSC to SC13 over ESnet Lots of packet loss at first, removed Mellanox switch from path, then path was clean Consistent 90Gbps, reading from 2 SSD host sending to single host in the booth. FNAL over ESnet Lots of packet loss, TCP max around 5Gbps, but UDP could do 15G per flow. Used 'tc' to pace TCP, and then at least single stream TCP behaved well up to 8G. Using multiple streams was still a problem, indicates something in the path with too small buffers, but never ID d issue. Pasadena Internet2 80G read from the disks and write on the servers (disk to memory transfer). Link was lossy in the other direction. CERN over ESnet 4/10/14 About 75Gb memory to memory. Disks about 40Gb 10

Recent Caltech testing to CERN over ANA link 4/10/14 11

NASA Goddard SC13 Demo 4/10/14 12

ESnet s 100G Testbed 4/10/14 13

NERSC Test Hosts StarLight Test Hosts MAN LAN ESnet 100G Testbed 4/10/14 14

ESnet 100G Testbed NERSC VLANS: 4012: All hosts 4020: Loop from NERSC to Chicago and back, all NERSC hosts 10GE (MM) NERSC Site Router 1x40GE exogeni Rack 100G 100G Dedicated 100G Network aofa-cr5 core router To Esnet Production Network 100G To Esnet Production Network 100G MANLAN switch AofA 100G StarLight To Eur (ANA L nersc-diskpt-1 NICs: 4x10G Myricom 1x10G Mellanox nersc-diskpt-1 5x10GE (MM) 100G 100G 100G nersc-diskpt-2 NICs: 4x10G Myricom nersc-diskpt-3 NICs: 4x10G Myricom 1x10G Mellanox nersc-diskpt-6 NICs: 2x40G Mellanox 1x40G Chelsio nersc-diskpt-7 NICs: 2x40G Mellanox 1x40G Chelsio 4x10GE (MM) nersc-diskpt-2 5x10GE (MM) nersc-diskpt-3 2x40GE nersc-diskpt-6 1x40GE 2x40GE nersc-diskpt-7 Alcatel- Lucent 100G SR7750 Router 1x40GE 100G Alcatel- Lucent 100G SR7750 Router Star-cr5 core router 100G 5x 10GE (MM) 4x10GE (MM) StarLight 100G switch star-mempt-1 star-mempt-1 NICs: 4x10G Myricom 1x10G Mellanox star-mempt-2 NICs: 4: 4x10G Myricom nersc-ssdpt-1 NICs: 2x40G Mellanox nersc-ssdpt-2 NICs: 4x40G Mellanox nersc-ssdpt-1 2x40GE 4x40GE exogeni Rack 5x10GE (MM) star-mempt-2 star-mempt-3 star-mempt-3 NICs: 4x10G Myricom 1x10G Mellanox nersc-ssdpt-2 4/10/14 15

100G Testbed Capabilities This testbed is designed to support research in high-performance data transfer protocols and tools. Capabilities: Bare metal access to very high performance hosts up to 100Gbps memory to memory, and 70 Gbps disk to disk Each project gets their own disk image, which root access custom kernels, custom network protocols, etc. Only one experiment on the testbed at a time, scheduled via a shared Google calendar 4/10/14 16

Testbed Access Proposal process to gain access described here: http://www.es.net/randd/100g-testbed/proposal-process/ Testbed is available to anyone: DOE researchers Other government agencies Industry Must submit a short proposal to ESnet (2 pages) Review Criteria: Project readiness Could the experiment easily be done elsewhere? 17

Extreme TCP 4/10/14 18

40G Lessons Learned Single flow: TCP/UDP : 20-25 Gbps, CPU limited Multiple Flows: Easily fill 40G NIC Tuning for 40G is not just 4x Tuning for 10G Some of the conventional wisdom for 10G Networking is not true at 40Gbps e.g.: Parallel streams more likely to hurt than help Sandy Bridge Architectures require extra tuning as well Details at http://fasterdata.es.net/science-dmz/dtn/tuning/ 4/10/14 19

New Test Tool: Iperf3: https://github.com/esnet/iperf iperf3 is a new implementation from scratch, with the goal of a smaller, simpler code base, and a library version of the functionality that can be used in other programs. Some new features in iperf3 include: reports the number of TCP packets that were retransmitted and Congestion Window reports the average CPU utilization of the client and server (-V flag) support for zero copy TCP (-Z flag) JSON output format (-J flag) omit flag: ignore the first N seconds in the results http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/ iperf-and-iperf3/ 4/10/14 20

Sample iperf3 output iperf3 -c 10.28.0.43 Connecting to host 10.28.0.43, port 5201 [ 4] local 10.28.0.11 port 53389 connected to 10.28.0.43 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.09 sec 2.50 MBytes 19.2 Mbits/sec 0 446 KBytes [ 4] 1.09-2.09 sec 18.8 MBytes 157 Mbits/sec 0 3.46 MBytes [ 4] 2.09-3.09 sec 325 MBytes 2.73 Gbits/sec 154 74.9 MBytes [ 4] 3.09-4.09 sec 430 MBytes 3.60 Gbits/sec 1356 39.1 MBytes [ 4] 4.09-5.09 sec 419 MBytes 3.52 Gbits/sec 0 40.6 MBytes [ 4] 5.09-6.09 sec 434 MBytes 3.64 Gbits/sec 0 41.9 MBytes [ 4] 6.09-7.09 sec 454 MBytes 3.80 Gbits/sec 0 42.9 MBytes [ 4] 7.09-8.09 sec 444 MBytes 3.73 Gbits/sec 0 43.8 MBytes [ 4] 8.09-9.09 sec 477 MBytes 4.00 Gbits/sec 0 44.5 MBytes [ 4] 9.09-10.09 sec 468 MBytes 3.93 Gbits/sec 0 45.0 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.09 sec 3.39 GBytes 2.89 Gbits/sec 1510 sender [ 4] 0.00-10.09 sec 3.32 GBytes 2.83 Gbits/sec receiver 4/10/14 21

Sample results: TCP Single vs Parallel Streams: 40G to 40G 1 stream: iperf3 -c 192.168.102.9! [ ID] Interval Transfer Bandwidth Retransmits! [ 4] [ 4] 0.00-1.00 1.00-2.00 sec 3.19 GBytes 27.4 Gbits/sec sec 3.35 GBytes 28.8 Gbits/sec 0 0!! [ 4] 2.00-3.00 sec 3.35 GBytes 28.8 Gbits/sec 0! [ 4] 3.00-4.00 sec 3.35 GBytes 28.8 Gbits/sec 0! [ 4] 4.00-5.00 sec 3.35 GBytes 28.8 Gbits/sec 0!! 2 streams: iperf3 -c 192.168.102.9 -P2! [ ID] Interval Transfer Bandwidth Retransmits! [ 4] [ 6] 0.00-1.00 0.00-1.00 sec 1.37 GBytes 11.8 Gbits/sec sec 1.38 GBytes 11.8 Gbits/sec 7 11!! [SUM] 0.00-1.00! sec 2.75 GBytes 23.6 Gbits/sec 18! - - - - - - - - - - - - - - - - - - - - - - - - -! [ 4] [ 6] 9.00-10.00 sec 1.43 GBytes 12.3 Gbits/sec 9.00-10.00 sec 1.43 GBytes 12.3 Gbits/sec 4 6!! [SUM] 9.00-10.00 sec 2.86 GBytes 24.6 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - -! 10! [ ID] Interval Transfer Bandwidth Retransmits! [ 4] [ 6] 0.00-10.00 sec 13.8 GBytes 11.9 Gbits/sec 0.00-10.00 sec 13.8 GBytes 11.9 Gbits/sec 78! 95! [SUM] 0.00-10.00 sec 27.6 GBytes 23.7 Gbits/sec 173! 4/10/14 22

40G Sender to 10G receiver; Parallel streams decrease throughput even more Single Stream: iperf3 -c 10.12.1.128 -w 128M! [ ID] Interval Transfer Bandwidth Retr! [ 4] 0.00-1.05 sec 238 MBytes 1.90 Gbits/sec 0! [ 4] 1.05-2.05 sec 404 MBytes 3.38 Gbits/sec 0! [ 4] 2.05-3.05 sec 1.06 GBytes 9.10 Gbits/sec 404! [ 4] 3.05-4.05 sec 1.08 GBytes 9.30 Gbits/sec 0! [ 4] 4.05-5.05 sec 1.14 GBytes 9.78 Gbits/sec 0! [ 4] 5.05-6.05 sec 1.15 GBytes 9.89 Gbits/sec 0! 2 Parallel Streams: iperf3 -c 10.12.1.128 -P2 [ ID] Interval Transfer Bandwidth Retr! [ 4] 0.00-10.05 sec 2.69 GBytes 2.30 Gbits/sec 311! [ 6] 0.00-10.05 sec 2.69 GBytes 2.30 Gbits/sec 613! [SUM] 0.00-10.05 sec 5.38 GBytes 4.60 Gbits/sec 924! 4/10/14 23

Intel Sandy/Ivy Bridge

Sample results: TCP On Intel Sandy Bridge Motherboards 30% Improvement using the right core! nuttcp -i 192.168.2.32! 2435.5625 MB / 1.00 sec = 20429.9371 Mbps 0 retrans! 2445.1875 MB / 1.00 sec = 20511.4323 Mbps 0 retrans! 2443.8750 MB / 1.00 sec = 20501.2424 Mbps 0 retrans! 2447.4375 MB / 1.00 sec = 20531.1276 Mbps 0 retrans! 2449.1250 MB / 1.00 sec = 20544.7085 Mbps 0 retrans!! nuttcp -i1 -xc 2/2 192.168.2.32! 3634.8750 MB / 1.00 sec = 30491.2671 Mbps 0 retrans! 3723.8125 MB / 1.00 sec = 31237.6346 Mbps 0 retrans! 3724.7500 MB / 1.00 sec = 31245.5301 Mbps 0 retrans! 3721.7500 MB / 1.00 sec = 31219.8335 Mbps 0 retrans! 3723.7500 MB / 1.00 sec = 31237.6413 Mbps 0 retrans!!! nuttcp: http://lcp.nrl.navy.mil/nuttcp/beta/nuttcp-7.2.1.c 4/10/14 25

Sample results: TCP On Intel Sandy Bridge Motherboards: Fast host to Slower Host! nuttcp -i1 192.168.2.31! 410.7500 MB / 1.00 sec = 3445.5139 Mbps 0 retrans! 339.5625 MB / 1.00 sec = 2848.4966 Mbps 0 retrans! 354.5625 MB / 1.00 sec = 2974.2888 Mbps 350 retrans! 326.3125 MB / 1.00 sec = 2737.3022 Mbps 0 retrans! 377.7500 MB / 1.00 sec = 3168.8220 Mbps 179 retrans!! Reverse direction: nuttcp -i1 192.168.2.31! 2091.0625 MB / 1.00 sec = 17540.8230 Mbps 0 retrans! 2106.7500 MB / 1.00 sec = 17672.0814 Mbps 0 retrans! 2103.6250 MB / 1.00 sec = 17647.0326 Mbps 0 retrans! 2086.7500 MB / 1.00 sec = 17504.7702 Mbps 0 retrans!!!!! 4/10/14 26

Speed Mismatch Issues More and more often we are seeing problems sending from a faster host to a slower host This can look like a network problem (lots of TCP retransmits) The network is rarely the bottleneck anymore for many sites This may be true for: 10G to 1G host 10G host to a 2G circuit 40G to 10G host Fast host to slower host Pacing at the application level does not help The linux tc command does help But only up to speeds of 8Gbps Will this just mask problems with under-buffered switches? http://fasterdata.es.net/host-tuning/packet-pacing/ 4/10/14 27

10G to 1G tests, packets dropped 4/10/14 28

But 10G to 1G can work just fine too. 4/10/14 29

Compare tcpdumps: kans-pt1.es.net (10G) to eqx-chi-pt1.es.net (1G) 4/10/14 30

Compare tcpdumps kans-pt1.es.net (10G) to uct2-net4.uchicago.edu (1G) 4/10/14 31

Compare tcpdumps kans-pt1.es.net (10G) to uct2-net4.uchicago.edu (1G) with pacing 4/10/14 32

Summary At the high end, network flows are often much less than the available network capacity Requires ZERO packet loss Requires considerable host tuning Work at annual Supercomputing conference and on the ESnet 100G testbed is teaching us a lot about issues at the high end 4/10/14 33

More Information http://fasterdata.es.net http://fasterdata.es.net/host-tuning/interrupt-binding/ http://fasterdata.es.net/science-dmz/dtn/tuning/ http://www.es.net/testbed/ email: BLTierney@es.net

Extra Slides 4/10/14 35

Intel Sandy/Ivy Bridge

Single flow 40G Results Tool Protocol Gbps Send CPU Recv CPU netperf xfer_test (IU tool) GridFTP TCP TCP-sendfile UDP TCP TCP-splice RoCE TCP RoCE 17.9 39.5 34.7 22 39.5 39.2 13.3 13 100% 34% 100% 100% 43% 2% 100% 100% 87% 94% 95% 91% 91% 1% 94% 150% 4/10/14 37

Interrupt Affinity Interrupts are triggered by I/O cards (storage, network). High performance means lot of interrupts per seconds Interrupt handlers are executed on a core Depending on the scheduler, core 0 gets all the interrupts, or interrupts are dispatched in a round-robin fashion among the cores: both are bad for performance: Core 0 get all interrupts: with very fast I/O, the core is overwhelmed and becomes a bottleneck Round-robin dispatch: very likely the core that executes the interrupt handler will not have the code in its L1 cache. Two different I/O channels may end up on the same core. 38 ESnet Science Engagement (engage@es.net) - 4/10/14

A simple solution: interrupt binding - Each interrupt is statically bound to a given core (network -> core 1, disk -> core 2) - Works well, but can become an headache and does not fully solve the problem: one very fast card can still overwhelm the core. - Needs to bind application to the same cores for best optimization: what about multi-threaded applications, for which we want one thread = one core? 39 ESnet Science Engagement (engage@es.net) - 4/10/14

Interrupt Binding Considerations On a multi-processor host, your process might run on one processor, but your I/O interrupts on another processor. When this happens there will be a huge performance hit. a single 10G NIC can saturate the QPIC that connects the two processors. You may want to consider getting a single CPU system, and avoid dual-processor motherboards if you only need 3 PCIe slots. If you need to optimize for a small number of very fast (> 6Gbps) flows, this means you will need to manage IQR bindings by hand. If you are optimizing for many 500Mbps-1Gbps flows, this will be less of an issue. You may be better off doing 4x10GE instead of 1x40GE, as you will have more control mapping IRQs to processors. The impact of this is even greater with Sandy/Ivy Bridge Hosts, as the PCI bus slots are connected directly to a processor. 40 ESnet Science Engagement (engage@es.net) - 4/10/14

Experience With 100G Equipment ESnet experiences Advanced Networking Initiative (ANI) ESnet5 production 100G network Helping other people debug their stuff Important takeaways R&E requirements are outside the design spec for most gear Results in platform limitations sometimes can t be fixed You need to be able to identify those limitations before you buy R&E requirements are outside the test scenarios for most vendors Bugs show up when R&E workload is applied You need to be able to troubleshoot those scenarios 4/10/14 41

Platform Limitations We have seen significant limitations in 100G equipment from all vendors with a major presence in R&E 100G single flow not supported Channelized forwarding plane Unexplained limitations Sometimes the senior sales engineers don t know! Non-determinism in the forwarding plane Performance depends on features used (i.e. config-dependent) Packet loss that doesn t show up in counters anywhere If you can t find it, nobody will tell you about it Vendors don t know or won t say Watch how you write your procurements Second-generation equipment seems to be much better Vendors have been responsive in rolling new code to fix problems 4/10/14 42

They Don t Test For This Stuff Most sales engineers and support engineers don t have access to 100G test equipment It s expensive Setup of scenarios is time-consuming R&E traffic profile is different than their standard model IMIX (Internet Mix) traffic is normal test profile Aggregate web browsers, email, YouTube, Netflix, etc. Large flow count, low per-flow bandwidth This is to be expected that s where the market is R&E shops are the ones that get the testing done for R&E profile SCinet at Supercomuting conference provides huge value But, in the end, it s up to us the R&E community 4/10/14 43

New Technology, New Bugs Bugs happen. Data integrity (traffic forwarded, but with altered data payload) Packet loss Interface wedge Optics flaps Monitoring systems are indispensable Finding and fixing issues is sometimes hard Rough guess difficulty exponent is degrees of freedom Vendors/platforms, administrative domains, time zones Takeaway don t skimp on test gear (at least maintain your perfsonar boxes) 4/10/14 44