Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

Similar documents
Experiences with 40G/100G Applications

Improving Performance of 100G Data Transfer Nodes

Zhengyang Liu University of Virginia. Oct 29, 2012

10GE network tests with UDP. Janusz Szuba European XFEL

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

UNIVERSITY OF CALIFORNIA

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

IsoStack Highly Efficient Network Processing on Dedicated Cores

Optimizing the GigE transfer What follows comes from company Pleora.

ANI Testbed Project Update

Achieving the Science DMZ

Performance Optimisations for HPC workloads. August 2008 Imed Chihi

Programmable Information Highway (with no Traffic Jams)

Traffic engineering and GridFTP log analysis. Jan 17, 2013 Project web site:

DTN End Host performance and tuning

IBM POWER8 100 GigE Adapter Best Practices

Achieving the Science DMZ

FPGAs and Networking

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

ntop Users Group Meeting

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Measuring a 25 Gb/s and 40 Gb/s data plane

Data transfer over the wide area network with a large round trip time

OneCore Storage Performance Tuning

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Agilio CX 2x40GbE with OVS-TC

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

NFS on the Fast track - fine tuning and futures

An Intelligent NIC Design Xin Song

Steven Carter. Network Lead, NCCS Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1

Application Acceleration Beyond Flash Storage

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Optimizing Performance: Intel Network Adapters User Guide

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Linux Kernel Hacking Free Course

DPDK Performance Report Release Test Date: Nov 16 th 2016

Advanced Computer Networks. End Host Optimization

The Power of Batching in the Click Modular Router

Netchannel 2: Optimizing Network Performance

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

Interrupt Swizzling Solution for Intel 5000 Chipset Series based Platforms

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

CIS Test 1- Practice - Fall 2011

Data Path acceleration techniques in a NFV world

Using (Suricata over) PF_RING for NIC-Independent Acceleration

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

An FPGA-Based Optical IOH Architecture for Embedded System

NIC-PCIE-4RJ45-PLU PCI Express x4 Quad Port Copper Gigabit Server Adapter (Intel I350 Based)

Network Test and Monitoring Tools

OpenFlow Software Switch & Intel DPDK. performance analysis

PEARL. Programmable Virtual Router Platform Enabling Future Internet Innovation

D1.1 Server Scalibility

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

High-Performance TCP Tips and Tricks

DPDK Intel NIC Performance Report Release 18.02

Network and Host Design to Facilitate High Performance Data Transfer

DXE-810S. Manual. 10 Gigabit PCI-EXPRESS-Express Ethernet Network Adapter V1.01

Accelerating 4G Network Performance

Xilinx Answer QDMA Performance Report

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

Networking Servers made for BSD and Linux systems

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

DPDK Intel NIC Performance Report Release 18.05

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

DPDK Vhost/Virtio Performance Report Release 18.11

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

DPDK Intel NIC Performance Report Release 17.08

perfsonar Host Hardware

Ultra high-speed transmission technology for wide area data movement

Performance Evaluation of Myrinet-based Network Router

Mellanox NIC s Performance Report with DPDK Rev 1.0

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Managing Network Adapters

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

NIC TEAMING IEEE 802.3ad

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

perfsonar Deployment on ESnet

DYNES: DYnamic NEtwork System

Improving DPDK Performance

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

Improving Cluster Performance

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Performance Enhancement for IPsec Processing on Multi-Core Systems

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

TCP Tuning for the Web

Internet data transfer record between CERN and California. Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Much Faster Networking

Tuning Intelligent Data Lake Performance

NetFPGA Hardware Architecture

DPDK Vhost/Virtio Performance Report Release 18.05

TCP Wirespeed Verify TCP Throughput up to 10G Using the T-BERD /MTS-6000A Multi- Services Application Module (MSAM)

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Total Cost of Ownership Analysis for a Wireless Access Gateway

Transcription:

Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams Eric Pouyoul, Brian Tierney ESnet January 25, 2012

ANI 100G Testbed ANI Middleware Testbed NERSC To ESnet 10G To ESnet ANL 1GE nersc-asw1 Site Router (nersc-mr2) ANI 100G Network 10G 1GE anl-asw1 nersc-c2940 switch ANL Site Router 100G 100G anl-c2940 switch nersc-app 100G 100G nersc-diskpt-1 NICs: 2: 2x10G Myricom 1: 4x10G HotLava nersc-diskpt-2 NICs: 1: 2x10G Myricom 1: 2x10G Chelsio 1: 6x10G HotLava nersc-diskpt-1 10GE (MM) 10GE (MM) 4x10GE (MM) nersc-diskpt-2 4x10GE (MM) eth2-5 4x10GE (MM) eth2-5 ANI 100G Router ANI 100G Router eth2-5 4x 10GE (MM) anl-app anl-mempt-1 NICs: 2: 2x10G Myricom nersc-diskpt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox 1: 6x10G HotLava nersc-diskpt-3 4x10GE (MM) eth2-5 eth2-5 4x10GE (MM) anl-mempt-1 anl-mempt-2 NICs: 2: 2x10G Myricom anl-mempt-2 Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability. 4x10GE (MM) eth2-5 anl-mempt-3 anl-mempt-3 NICs: 1: 2x10G Myricom 1: 2x10G Mellanox Updated December 11, 2011 1/25/12 2

Test Configuration Details 48.6ms RTT, 97.9Gbps aggregate TCP throughput* with 10 TCP streams 1/25/12 3

Per-Stream Results 1/25/12 4

Hosts Must be Tuned BIOS Firmware Device Drivers NIC/TCP Process Affinity 7/11/10 Joint Techs, Summer 2010 5

BIOS Tuning: Free the Hardware Hyperthreading: disable, we want real cores. CPU frequency scaling: disable, as well as all energy saving features: we want the full power all the time. Check memory bus speed (force to max.) 7/11/10 Joint Techs, Summer 2010 6

NIC/TCP Tuning We are using Myricom 10G NIC Download latest drive/firmware from vendor site Version of driver in RHEL/CentOS fairly old Enable MSI-X Increase txgueuelen /sbin/ifconfig eth2 txqueuelen 10000! Increase Interrupt coalescence! /usr/sbin/ethtool -C eth2 rx-usecs 100! Standard TCP Tuning: net.core.rmem_max = 67108864! net.core.wmem_max = 67108864! net.core.netdev_max_backlog = 250000 1/25/12 7

100G = It s full of frames! Problem: Interrupts are very expensive Even with jumbo frames and driver optimization, there is still too many interrupts. Solution: Turn off Linux irqbalance (chkconfig irqbalance off) Use /proc/interrupt to get the list of interrupts Dedicate an entire processor core for each 10G interface Use /proc/irq/<irq-number>/smp_affinity to bind rx/tx queues to a specific core. 1/25/12 8

Application Tuning nuttcp from NASA provides command line option to bind to a processor core. http://fasterdata.es.net/fasterdata/network-troubleshooting-tools/nuttcp/ Server command: nuttcp -S -P5200 -p5201 &; nuttcp -S -P5300 -p5301 &; nuttcp -S -P5400 - p5401 &; nuttcp -S -P5500 -p5501 Client command: nuttcp -xc2/2 -I1 -P5200 -p5201 -i1 -T30 -w50m 192.168.2.11 &; nuttcp - xc3/3 -I2 -P5300 -p5301 -i1 -T30 -w50m 192.168.3.11 &; nuttcp -xc4/4 -I3 -P5400 -p5401 -i1 -T30 -w50m 192.168.4.11 & ; nuttcp -xc5/5 -I4 -P5500 - p5501 -i1 -T30 -w50m 192.168.5.11 & Note: hand tune TCP window to 50M 1/25/12 9

More Information ANI 100G Testbed: http://sites.google.com/a/lbl.gov/ani-testbed/ email: ani-testbed-proposal@es.net, BLTierney@es.net Next round of proposals is due April 1 st 2012 Host tuning: http://fasterdata.es.net/fasterdata/host-tuning/ Even more information: lomax@es.net

Extra Slides 1/25/12 11

PCI optimization: MSI-X - Extension to MSI (Message Signaled Interrupts) - Increases the number of interrupt pins per card - Associates rx/tx queues to a given core - Allows to stitch together on the same core, the thread that runs the program and the asynchronous event it may receive (incoming network packets, asynchronous I/O ), resulting in maximizing L1 cache hit. - Requires Chipset, card, and operating support. - Optimized for Linux kernel > 2.6.26 - This is a major optimization: on a system with 4 x 10G ethernet, performance gain can be up to 20% 1/25/12 12

Testbed Access Proposal process to gain access described at: https://sites.google.com/a/lbl.gov/ani-testbed/ Testbed is available to anyone: DOE researchers Other government agencies Industry Must submit a short proposal to the testbed review committee Committee is made up of members from the R&E community and industry Goal is to accept roughly five proposals every 6 month review cycle Next round of proposals is due April 1, 2012 13