10GE network tests with UDP. Janusz Szuba European XFEL

Similar documents
A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

Improving Packet Processing Performance of a Memory- Bounded Application

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

D1.1 Server Scalibility

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Advanced Computer Networks. End Host Optimization

Performance Optimisations for HPC workloads. August 2008 Imed Chihi

DTN End Host performance and tuning

ARISTA: Improving Application Performance While Reducing Complexity

PCIe 10G SFP+ Network Card

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Measuring a 25 Gb/s and 40 Gb/s data plane

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Optimizing the GigE transfer What follows comes from company Pleora.

10Gbps TCP/IP streams from the FPGA

Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

Mellanox NIC s Performance Report with DPDK Rev 1.0

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

FPGAs and Networking

Video capture using GigE Vision with MIL. What is GigE Vision

Network device drivers in Linux

EVPath Performance Tests on the GTRI Parallel Software Testing and Evaluation Center (PASTEC) Cluster

Motivation CPUs can not keep pace with network

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Application Acceleration Beyond Flash Storage

Learning with Purpose

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

Improving Performance of 100G Data Transfer Nodes

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

The NE010 iwarp Adapter

Optimizing Performance: Intel Network Adapters User Guide

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

PE2G6BPi35 Six Port Copper Gigabit Ethernet PCI Express Bypass Server Adapter Intel based

IsoStack Highly Efficient Network Processing on Dedicated Cores

PE310G4BPI40-T Bypass Adapter Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel x540 Based

Suricata Extreme Performance Tuning With Incredible Courage

The Power of Batching in the Click Modular Router

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

INT G bit TCP Offload Engine SOC

Virtualization, Xen and Denali

Stacked Vlan: Performance Improvement and Challenges

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

Performance Tuning NGINX

Mark Falco Oracle Coherence Development

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

Technical. AMD Reference Architecture for SeaMicro SM15000 Server and Hortonworks Hadoop 2.0 (YARN) Solution. Table of Contents

Open Source Traffic Analyzer

Agilio CX 2x40GbE with OVS-TC

Implementation and Analysis of Large Receive Offload in a Virtualized System

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Networking at the Speed of Light

HIGH-DENSITY HIGH-CAPACITY HIGH-DENSITY HIGH-CAPACITY

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

PCIe 10G 5-Speed. Multi-Gigabit Network Card

iwarp Learnings and Best Practices

PacketShader: A GPU-Accelerated Software Router

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

An Intelligent NIC Design Xin Song

PE310G4TSF4I71 Quad Port SFP+ 10 Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

Network Design Considerations for Grid Computing

Benchmarking message queue libraries and network technologies to transport large data volume in

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

PE3G4TSFI35P Quad Port Fiber Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

PE310G4BPI9 Bypass Card Quad Port Fiber 10 Gigabit Ethernet Bypass Server Adapter Intel 82599ES Based

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Impact of Cache Coherence Protocols on the Processing of Network Traffic

QuickSpecs. HP Z 10GbE Dual Port Module. Models

ntop Users Group Meeting

FELIX the new detector readout system for the ATLAS experiment

PCIe 10G SFP+ Network Controller Card

Notes Section Notes for Workload. Configuration Section Configuration

A LynxOS device driver for the ACENic Gigabit Ethernet Adapter

A Next Generation Home Access Point and Router

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Xilinx Answer QDMA Performance Report

FAQ. Release rc2

Simultaneous Multithreading on Pentium 4

Improving DPDK Performance

QuickSpecs. Models. Standard Features Server Support. HP Integrity PCI-e 2-port 10GbE Cu Adapter. HP Integrity PCI-e 2-port 10GbE LR Adapter.

IBM POWER8 100 GigE Adapter Best Practices

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

RoCE vs. iwarp Competitive Analysis

OE2G2I35 Dual Port Copper Gigabit Ethernet OCP Mezzanine Adapter Intel I350BT2 Based

Cisco UCS S3260 System Storage Management

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Transcription:

10GE network tests with UDP Janusz Szuba European XFEL

Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary

10GE UDP 10GE network tests with UDP Initial DAQ architecture 3 This talk focuses here Multiple layers with well defined APIs to allow insertion of new technology layers Multiple slices for partitioning and scaling camera sizes will increase and slice replication will be one solution Enforce data reduction and rejection in all layers early data size reductions and data rejection are needed to reduce storage resources

10GE UDP 10GE network tests with UDP Slice test hardware setup 4 TB prototype PC and compute nodes Storage servers 10 Gbps switch

Hardware specification 5 Final width of slice is 16 inputs from TB to 16 PCL servers covering 1Mpxl detector Specification of the first ½ of slice: h/w function # Spec PowerEdge R610 PC layer 8 3GHz, 2Socket, 6Core, Intel X5675, 96GB, Intel X520 DA2 dual 10GbE NIC PowerEdge R610 Metadata server 2 3GHz, 2Socket, 6Core, Intel X5675, 96GB, Intel X520 DA2 dual 10GbE NIC PowerEdge C6145 Computing cluster 2 2.6GHz, 4Socket, 16Core, AMD 6282SE, 192GB, Intel X520 DA2 dual 10GbE NIC Cisco 5596UP 10GbE switch 1 48 ports (extendible to 96), wire speed, SFP+ PowerEdge C410x PCIe chassis 1 8 HIC Inputs,16 x16 PCIe Sleds NVIDIA M2075 Computing 4 PCIe x16 GPGPU IBM x3650m4 IBM x3650m4 IBM EXP2512 9TB servers (fast SAS) Servers for extension chassis 28TB extension chassis 6 2.6GHz, 2Sock, 8Core, Intel E5-2670 SandyBridge, 128GB, Mellanox ConnectX-3 VPI, PCIe Gen3, QSFP (with adapter to SFP+) Dual NIC, 14x 900GB 10Krpm 6Gbps SAS, RAID6 2 2.6GHz, 2Sock, 8Core, Intel E5-2670 SandyBridge, 128GB, Mellanox ConnectX-3 VPI, PCIe Gen3, QSFP (with adapter to SFP+) Dual NIC 2 3TB 7.2Krpm 6Gbps NL SAS, RAID6, connected through 2 mini SAS connectors Demonstrator TB 2 Board and ATCA crate

Initial network tests with system tools 6 Two hosts (1+2), each dual 10Gbps NIC, connected via switch Tools netperf netperf -H 192.168.142.55 -t tcp_stream -l 60 -T 3,4 -- -s 256K -S 256K mpstat for CPU and interrupts usage Settings Socket buffer size (128kB-4MB) Adjusted kernel parameters MTU 1500 and 9000 (Jumbo frames) CPU affinity 2 modes: PCL Node 1 Unidirectional stream switch PCL Node 2 net.core.rmem_max = 16777216 # 131071 net.core.wmem_max = 16777216 # 131071 net.core.rmem_default = 10000000 # 129024 net.core.wmem_default = 10000000 # 129024 net.core.netdev_max_backlog = 300000 # 1000 Txqueuelen = 10000 # 1000 PCL Node 1 Bidirectional stream switch PCL Node 2

TCP and UDP throughput tests 7 Both TCP and UDP bandwith at wire speed More tuning required to reduce UDP packet loss Further tests performed with developed DAQ software

UDP 10GE network tests with UDP Software architecture - overview 8 Two applications are under development Train builder emulator PCL node application Train builder emulator Timing system Data-driven model (push) Start PC nodes Wait for data UDP feeder 1 UDP feeder 2 UDP feeder m Start UDP feeder Wait for timer signal Start timing system Communication model 1-to-1, N-to-N Node 1 Node 2 Node n PC Layer

SoF=1 EoF=1 000 001 002 003... 10GE network tests with UDP TB data format 9 Train structure (configurable) Header, images & descriptors, detector data, trailer Implementation Plain vanilla UDP The train is divided into chunks of 8k (configurable) Packet structure Header Body Images data Images desc. Trailer Data 8129 bytes # Frame 4b SoF, EoF 1b # Packet 3b

Tuning summary I 10 Driver Newer driver with NAPI enabled - ixgbe-3.9.15 InterruptThrottleRate=0,0 (=disabled, default 16000, 16000) LRO (Large Receive Offload) =1 (default) DCA (Direct Cache Access)=1(default) Linux kernel (Socket buffers) net.core.rmem_max=1342177280 (default 131071) net.core.wmem_max=516777216 (default 131071) net.core.rmem_default=10000000 (default 129024) net.core.wmem_default=10000000 (default 129024) net.core.netdev_max_backlog=416384 (default 1000) net.core.optmem_max= 524288 (default 20480) net.ipv4.udp_mem = 11416320 15221760 22832640 (default 262144 327680 393216) net.core.netdev_budget = 1024 (replaces max_backlog for NAPI driver)

Tuning summary II 11 NIC: Jumbo frame: MTU=9000 Ring buffer RX & TX= 4096 (default 512) rx-usec= 1 (default 1) tx-frames-irq= 4096 (default 512) PCI bus MMBRC (max memory byte read count) = 4096 bytes (default 512) Linux services Services iptables, irqbalance, cpuspeed are stoped IRQ affinity: All Interrupts are handled by Core 0 Hyperthreading: no visible effect (ON/OFF)

Test setup 12 Application setting Core 0 is exclusively used for processing IRQ Sender thread on core 7 Receiver thread on core 2 ( even number, except 0) Some processing is done on received packets Separate thread to compile statistics, write to log file run in parallel on core 5 (rate every 1sec) Run in root mode: necessary to set scheduling policy and thread priority Packet steering thread (soft IRQ) UDP receiver thread Stat/control thread Dual port NIC UDP sender thread NUMA node 0 NUMA node 1

Test results 13 Pairs of sender-receiver run in parallel on all machines UDP packet size 8192, configurable Run length (#packets/total time): 10 6 (12.21 sec, 6.6 seconds without time profile) 10 7 ~ 2.5x10 9 (several times w/out time profile) 5x10 9 (16h37m) Emulating realistic timing profile from TB Each node sends 1 train (~131030 packets), then wait for 0.7 sec Packet loss 1 5 2 6 3 7 4 8 Few 10s~100s packets might be lost When? Loss is observed only in the beginning of a run (warm-up phase?) Packet loss rate <10-9 (excluding warm-up phase) Where? All machines, few machines, no machine

cpu usage (%) 10GE network tests with UDP CPU usage 14 UDP reader thread (core 2) idle user system time (arbitrary units)

Summary 15 Tuning is required to achieve full bandwidth for 10GE Additional kernel, driver and system and application tuning required to reduce UDP packet loss