Learning with Purpose

Similar documents
Understanding The Performance of DPDK as a Computer Architect

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

PacketShader: A GPU-Accelerated Software Router

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Agilio CX 2x40GbE with OVS-TC

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Xen Network I/O Performance Analysis and Opportunities for Improvement

A Look at Intel s Dataplane Development Kit

The Power of Batching in the Click Modular Router

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Xilinx Answer QDMA Performance Report

OpenFlow Software Switch & Intel DPDK. performance analysis

MoonGen. A Scriptable High-Speed Packet Generator. Paul Emmerich. January 31st, 2016 FOSDEM Chair for Network Architectures and Services

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

DPDK Roadmap. Tim O Driscoll & Chris Wright Open Networking Summit 2017

The dark powers on Intel processor boards

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Ziye Yang. NPG, DCG, Intel

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

FAQ. Release rc2

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

IsoStack Highly Efficient Network Processing on Dedicated Cores

Application Acceleration Beyond Flash Storage

Reliably Scalable Name Prefix Lookup

Supporting Fine-Grained Network Functions through Intel DPDK

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

OVS-DPDK: Memory management and debugging

Suricata Extreme Performance Tuning With Incredible Courage

Design and Implementation of Virtual TAP for Software-Defined Networks

DPDK Summit China 2017

An Experimental review on Intel DPDK L2 Forwarding

Advanced Computer Networks. End Host Optimization

Open Source Traffic Analyzer

Much Faster Networking

To Grant or Not to Grant

Understanding and Improving the Cost of Scaling Distributed Event Processing

Networking at the Speed of Light

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

10GE network tests with UDP. Janusz Szuba European XFEL

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

FlexNIC: Rethinking Network DMA

Disruptor Using High Performance, Low Latency Technology in the CERN Control System

Netchannel 2: Optimizing Network Performance

소프트웨어기반고성능침입탐지시스템설계및구현

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Light & NOS. Dan Li Tsinghua University

DPDK Tunneling Offload RONY EFRAIM & YONGSEOK KOH MELLANOX

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Scalable Distributed Memory Machines

Implementation of Software-based EPON-OLT and Performance Evaluation

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Re-architecting Virtualization in Heterogeneous Multicore Systems

DPDK Performance Report Release Test Date: Nov 16 th 2016

A Scalable Event Dispatching Library for Linux Network Servers

Mellanox NIC s Performance Report with DPDK Rev 1.0

INT G bit TCP Offload Engine SOC

How to Build a 100 Gbps DDoS Traffic Generator

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

GPGPU introduction and network applications. PacketShaders, SSLShader

Multimedia Streaming. Mike Zink

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

Basics of UART Communication

High Performance Packet Processing with FlexNIC

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Virtualization, Xen and Denali

Chapter 13: I/O Systems

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

DPDK Load Balancers RSS H/W LOAD BALANCER DPDK S/W LOAD BALANCER L4 LOAD BALANCERS L7 LOAD BALANCERS NOV 2018

URDMA: RDMA VERBS OVER DPDK

Open vswitch DPDK Acceleration Using HW Classification

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Memory Management. Dr. Yingwu Zhu

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Software Routers: NetMap

DPDK Intel NIC Performance Report Release 18.02

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Topic & Scope. Content: The course gives

Agenda How DPDK can be used for your Application DPDK Ecosystem boosting your Development Meet the Community Challenges

Data Path acceleration techniques in a NFV world

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Transcription:

Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts Lowell http://acanets.uml.edu

Introduction Network Measurement Network measurement at 100 Gbps: Dedicated hardware, e.g. SRAM and TCAM: Inflexible to run-time changes of measurement functions Memory requirement is prohibitively large Programmable box on x86 platform: Run-time programmable Rich memory/compute resources Network measurement requirements on x86 platform: High data volume Streaming algorithms such as Sketches Low packet delay (< 27 ns per packet): Cache/memory optimization 2

Motivation Design a high-performance 100 Gbps measurement platform with scalable performance on multicore x86 machines. Design sketch-based streaming measurement functions on a multicore architecture with advanced I/O and memory management techniques. Conduct in-depth performance analysis from both system level and micro-architecture level. Propose practical implementation solutions for high-speed network measurement on x86 architecture. 3

The Intel Data Plane Development Kit (DPDK) Kernel space bypassing Avoids context switching Reduces memory copy Supports multicore framework Employs hugepages Efficient TLB translation Ring buffers Lockless design Poll mode driver (PMD) Avoids interrupts Background The Intel DPDK 4

Background Streaming Algorithms and Sketches Sketch Data Structure Pros Cons Count-Min Sketch (CMS) Reversible Sketch (RS) Simple Hash Table (SHT) (1) Matrix with several rows (2) Each entry of the matrix consists of a counter (1) Matrix with several rows (2) Each entry of the matrix consists of a counter and several sets (1) Matrix with only one row (2) Each entry of the matrix consists of a counter and a 5- tuple Ensure certain accuracy (1) Ensure certain accuracy (2) Can work in distributed measurement (1) Can work in distributed measurement (2) Low memory consumption Can not work alone in distributed measurement Memory consumption is high Accuracy is not guaranteed

Measurement Design The Design Options Based on the RX modes: Non-RSS design (Master-Slave) SYNC: slave nodes work synchronously AYNC: slave nodes work asynchronously RSS design: requires Receive Side Scaling feature from a NIC Based on the Sketch data structure: Shared: shares one central data space across all cores Separate: each core maintains one data space

Measurement Design Sync Design Master core: Receives a batch of packets; Partitions packets; Assigns slave cores; Wait all slaves to finish. Each slave core: Fetches data and processes it; Sets the completion flag 7

Measurement Design Async Design Master core: Receives a batch of packets; Partitions the packets; Sends packets to dedicated ring buffer of each slave core if not empty. Each slave core: Pulls its ring buffer Processes the packets if exist.

Measurement Design RSS Design Maintains N number of DPDK RX queues based on core number; RSS in NIC distributes packets to RX queues; Each core receives and processes the packets from its RX queue in parallel. 9

Measurement Design Separate and Shared Design Separate design: Each core has its own sketch Shared design: Central sketch memory space across cores Data structure operations for Sketches (details in paper) Update a counter concurrently Update a set concurrently Update a critical section concurrently 10

Evaluation Experiment Platform and Methodology CPU: Two 6-core Xeon E5-2643 on each server @ 3.4 GHz L1 Cache: i-$ 32KB, d-$ 32 KB L2 Cache: 256 KB LLC: 20 MB Memory: 16 GB NIC: Mellanox ConnectX-4 EDR 100GbE NIC is on socket #1 Host OS: Ubuntu 16.04 DPDK: Version 16.04 TX RX 11

Evaluation Packet Generator and Packet Size Study Packet generator: modified pktgen-dpdk with random L2/L3/L4 headers and random sized payload. The 64-byte packet has the highest packet rate and packet drop rate. 12

Evaluation DPDK Mempool Size How to decide the DPDK mempool size? 16K mempool size renders optimized performance.

Evaluation NUMA and Cores Study the # of cores and and NUMA effect: Packet drop rate: up to 50% (2 cores to 6 cores). NUMA effect: up to 50% difference on the packet latency.

Evaluation Memory and Cache How does the size of Sketches effect cache: The over-sized separate data structure maintained by CPU core cache can hardly affect the performance. Two reasons: Cache accesses are almost overwhelmed by the size of mempool instead of sketches Processing time only contributes around 40% of the overall packet delay

Evaluation Shared/Separate Design RSS-Separate-SHT RSS-Shared-SHT RSS-Separate-CMS RSS-Shared-CMS RSS-Separate-RS RSS-Shared-RS Packet Drop Rate (PDR) 1.66E-01 2.02E-01 3.86E-01 9.44E-01 9.61E-01 9.91E-01 Packet Processing Time (PPT) 3.80E-08 4.95E-08 8.29E-08 1.48E-06 2.14E-06 9.15E-06 Separate design: PDR is always lower (up to 60% lower for the CMS case). Separate design: Packet processing time is also lower (17.8x speedup for the CMS case).

SYNC-Separate-SHT SYNC-Shared-SHT ASYNC-Separate-SHT ASYNC-Shared-SHT SYNC-Separate-RS SYNC-Shared-RS ASYNC-Separate-RS ASYNC-Shared-RS Evaluation Shared/Separate Design Packet Drop Rate (PDR) 8.06E-01 8.04E-01 8.03E-01 8.02E-01 9.93E-01 9.92E-01 9.93E-01 9.92E-01 Packet Processing Time (PPT) 7.90E-08 1.04E-07 6.66E-08 9.72E-08 7.83E-06 7.70E-06 7.19E-06 8.54E-06 Shared/separate designs do not impact too much on the PDR. On the contrary, PPT is almost always better if applying the separate design (only exception: synchronous RS).

Evaluation Which Design to Choose? Compare three performance metrics: Packet Drop Rate Packet Processing Time Packet Delay

Evaluation Which Sketch to Choose? Which sketch data structure? No accuracy boundary guarantee: choose SHT. Accuracy boundary guarantee: CMS (low PDR and delay) RS (high measurement accuracy and space efficiency). Which parallel design? RSS-enabled: RSS (esp. with separate data structure) For non-rss NICs, we should opt for the ASYNC design. Shared or separate? Separate data structure renders better performance.

Conclusion We explore various software/hardware techniques of packet receiving and processing for network measurement. We propose parallel designs of measurement function by using sketchbased streaming techniques. We conduct in-depth performance analysis of the proposed design options. We provide insights on the optimal practical implementation within 100 Gbps network measurement environment. 20

That s It Questions, comments, thoughts? 21