ntop Users Group Meeting

Similar documents
Using (Suricata over) PF_RING for NIC-Independent Acceleration

Raw Packet Capture in the Cloud: PF_RING and Network Namespaces. Alfredo

ntop Users Group Meeting

FAQ. Release rc2

The Power of Batching in the Click Modular Router

n2disk User s Guide Ultra-high-speed packet recorder with realtime indexing. n2disk v.3.1 Dec

nbox User s Guide ntop Software Web Management Version 2.7 Dec

Title Text. What s New? Since Last Sharkfest

Suricata Extreme Performance Tuning With Incredible Courage

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

DPDK Summit China 2017

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Advanced Computer Networks. End Host Optimization

Solarflare and OpenOnload Solarflare Communications, Inc.

Agilio CX 2x40GbE with OVS-TC

IsoStack Highly Efficient Network Processing on Dedicated Cores

Data Path acceleration techniques in a NFV world

Zhang Tianfei. Rosen Xu

MWC 2015 End to End NFV Architecture demo_

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

The Challenges of XDP Hardware Offload

Virtualization Practices:

Accelerating VM networking through XDP. Jason Wang Red Hat

Programmable NICs. Lecture 14, Computer Networks (198:552)

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

Demystifying Network Cards

Agenda How DPDK can be used for your Application DPDK Ecosystem boosting your Development Meet the Community Challenges

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

with Sniffer10G of Network Adapters The Myricom ARC Series DATASHEET

Virtualization Practices: Providing a Complete Virtual Solution in a Box

Non-uniform memory access (NUMA)

Hardware Flow Offload. What is it? Why you should matter?

Inline Snort multiprocessing with PF_RING

Performance Considerations of Network Functions Virtualization using Containers

A Userspace Packet Switch for Virtual Machines

nscrub User s Guide nscrub v.1.3 May 2018

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

FPGA Acceleration and Virtualization Technology in DPDK ROSEN XU TIANFEI ZHANG

Re-architecting Virtualization in Heterogeneous Multicore Systems

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

Improving DPDK Performance

Fast packet processing in linux with af_xdp

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Passively Monitoring Networks at Gigabit Speeds

Achieve Low Latency NFV with Openstack*

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

An Experimental review on Intel DPDK L2 Forwarding

IBM POWER8 100 GigE Adapter Best Practices

Getting Real Performance from a Virtualized CCAP

Software Routers: NetMap

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Netronome NFP: Theory of Operation

Affordable High-Speed Sensors Everywhere. ntop Meetup Flocon 2016, Daytona Beach Jan 13th 2016

Lessons learned from MPI

Pactron FPGA Accelerated Computing Solutions

6.9. Communicating to the Outside World: Cluster Networking

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

OpenFlow Software Switch & Intel DPDK. performance analysis

Networking at the Speed of Light

Measuring a 25 Gb/s and 40 Gb/s data plane

An FPGA-Based Optical IOH Architecture for Embedded System

소프트웨어기반고성능침입탐지시스템설계및구현

Evolution of the netmap architecture

SmartNIC Programming Models

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

10GE network tests with UDP. Janusz Szuba European XFEL

Xilinx Answer QDMA DPDK User Guide

An Intelligent NIC Design Xin Song

Rack Disaggregation Using PCIe Networking

Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

Open Source Traffic Analyzer

NIC-PCIE-4RJ45-PLU PCI Express x4 Quad Port Copper Gigabit Server Adapter (Intel I350 Based)

Much Faster Networking

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

with Sniffer10G of Network Adapters The Myricom ARC Series DATASHEET Delivering complete packet capture functionality. a cost-effective package

Postprint.

VALE: a switched ethernet for virtual machines

Agilio OVS Software Architecture

Keeping up with the hardware

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Improving Packet Processing Performance of a Memory- Bounded Application

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

SmartNIC Programming Models

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Support for Smart NICs. Ian Pratt

ANIC Host CPU Offload Features Overview An Overview of Features and Functions Available with ANIC Adapters

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Networking Servers made for BSD and Linux systems

Gigabit Ethernet Packet Capture. User s Guide

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

A Look at Intel s Dataplane Development Kit

Xilinx Answer QDMA Performance Report

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Transcription:

ntop Users Group Meeting PF_RING Tutorial Alfredo Cardigliano <cardigliano@ntop.org>

Overview Introduction Installation Configuration Tuning Use cases

PF_RING Open source packet processing framework for Linux. Originally (2003) designed to accelerate packet capture on commodity hardware, using patched drivers and in-kernel filtering. Today it supports almost all Intel adapters with kernelbypass zero-copy drivers and almost all FPGAs capture adapters.

PF_RING's Main Features PF_RING consists of: Kernel module (pf_ring.ko) Userspace library (libpfring) PCAP App Userspace modules implementing multi-vendor support Patched libpcap for legacy applications PF_RING App libpfring modules PF_RING.ko NIC PCAP-Over-PF_RING Userspace Kernel

Standard Drivers Standard kernel drivers, NAPI polling. 1-copy by the NIC into kernel buffers (DMA). 1-copy by the PF_RING kernel module into memory-map ed memory. App PF_RING lib PF_RING.ko mmap() 1-copy Userspace Kernel NIC

PF_RING ZC Drivers Userspace drivers for Intel cards, kernel is bypassed. 1-copy by the NIC into userspace memory (DMA). Packets are read directly by the application in zero-copy. App zc:eth1 PF_RING ZC Userspace NIC 0-copy Kernel DMA

PF_RING ZC API PF_RING ZC is not just a zero-copy driver, it provides a flexible API for creating full zero-copy processing patterns using 3 simple building blocks: Pool Worker Queue Hw Device Queue Sw SPSC Queue Pool: DMA buffers resource. Worker: execution unit able to aggregate traffic from M ingress queues and distribute it to N generic egress queues using custom functions.

PF_RING ZC API - zbalance Example Linux Sw Distribution Pool Pool Pool Consumer Thread 0 Consumer Thread 1 Consumer Thread 2 PF_RING PF_RING PF_RING PF_RING ZC ZC ZC ZC Core 0 Core 1 Core 2 Core 3 Worker 1/10G 1/10G

PF_RING ZC API - zbalance code Code for aggregation and load-balancing using ZC: 1 zc = pfring_zc_create_cluster(id, MTU, MAX_BUFFERS, NULL); 2 for (i = 0; i < num_devices; i++) 3 inzq[i] = pfring_zc_open_device(zc, devices[i], rx_only); 4 for (i = 0; i < num_slaves; i++) 5 outzq[i] = pfring_zc_create_queue(zc, QUEUE_LEN); 6 zw = pfring_zc_run_balancer(inzq, outzq, num_devices, num_slaves, NULL, NULL,!wait_for_packet, core_id);

FPGAs Support Currently PF_RING natively supports the following vendors (1/10/40/100 Gbit) PF_RING-based applications transparently select the module by means of the interface name. Example: pfcount -i eth1 pfcount -i zc:eth1 pfcount -i nt:1 pfcount -i myri:1 [Vanilla Linux adapter] [Intel ZC drivers] [Napatech] [Myricom] pfcount -i exanic:0 [Exablaze]

Many modules, single API. App App eth0 App zc:eth1 App nt:0 App myri:0 App stack:eth2 timeline:/storage libpfring pf_ring mod ZC mod Napatech mod Myricom mod Napatech lib SNF lib Stack mod n2disk mod PCAP Index Kernel pf_ring.ko ring buffer (packet copy) Network Stack Standard Drivers NAPI Standard NIC ZC Drivers 0-copy (DMA) ZC Intel NIC FPGA Napatech Card FPGA Myricom Card

Overview Introduction Installation Configuration Tuning Use cases

Installation Two options for installing PF_RING: Source Code (GitHub) Packages Stable Dev (aka nightly builds )

Installation - Source Code Download # git clone https://github.com/ntop/pf_ring.git Installation: # cd PF_RING/kernel # make && make install # cd../userland # make && make install ZC drivers installation (optional): # cd PF_RING/drivers/intel/<model>/<model>-<version>-zc/src # make && make install Support for FPGAs (Napatech, Myricom, etc) is automatically enabled if drivers are installed.

Installation - Packages CentOS/Debian/Ubuntu stable/devel repositories at http://packages.ntop.org Installation: # wget http://apt.ntop.org/16.04/all/apt-ntop.deb # dpkg -i apt-ntop.deb # apt-get clean all # apt-get update # apt-get install pfring ZC drivers installation (optional): # apt-get install pfring-drivers-zc-dkms Support for FPGAs (Napatech, Myricom, etc) is already there.

Overview Introduction Installation Configuration Optimisation Use cases

Loading PF_RING If you compiled from source code: # cd PF_RING/kernel # insmod./pf_ring.ko If you are using packages: # tree /etc/pf_ring/ -- pf_ring.conf `-- pf_ring.start # /etc/init.d/pf_ring start

Loading ZC Drivers ZC drivers are available for almost all Intel cards based on e1000e, igb, ixgbe, i40e, fm10k ZC needs hugepages for memory allocation, the pf_ring init script takes care of reserving them. A ZC interface acts as a standard interface (e.g. you can set an IP on ethx) until you open it using the zc: prefix (e.g. zc:ethx).

Loading ZC Drivers If you compiled from source code: # cd PF_RING/drivers/intel/<model>/<model>-<version>-zc/src #./load_driver.sh In essence the script loads hugepages and dependencies and load the module with: # insmod <model>.ko RSS=1,1 [other options] You can check that the ZC driver is actually running with: # cat /proc/net/pf_ring/dev/eth1/info grep ZC Polling Mode: ZC/NAPI

Loading ZC Drivers If you are using packages (ixgbe driver in this example): # tree /etc/pf_ring/ -- hugepages.conf -- pf_ring.conf -- pf_ring.start `-- zc `-- ixgbe -- ixgbe.conf `-- ixgbe.start Where: # cat /etc/pf_ring/hugepages.conf node=0 hugepagenumber=1024 # cat /etc/pf_ring/zc/ixgbe/ixgbe.conf RSS=1,1

RSS RSS distributes the load across the specified number of RX queues based on an hash function which is IP-based (or IP/Port-based in case of TCP) CPU Core 0 Core 1 Core 2 Core 3 Queue 0 Queue 1 Queue 2 Queue 3 Network Card

RSS Set the number of RSS queues using the insmod option or ethtool: # ethtool --set-channels eth1 combined 4 # cat /proc/net/pf_ring/dev/eth1/info grep Queues TX Queues: 4 RX Queues: 4 In order to open a specific interface queue, you have to specify the queue ID using the "@<ID>" suffix. # tcpdump -i zc:eth1@0 Note: when using ZC, zc:eth1 is the same as zc:eth1@0! This happens because ZC is a kernel-bypass technology, there is no abstraction (queues aggregation) provided by the kernel.

Indirection Table Destination queue is selected in combination with an indirection table: queue = indirection_table[rss_hash(packet)] It is possible to configure the indirection table using ethtool by simply applying weights to each RX queue.

Indirection Table # ethtool --set-channels eth1 combined 4 # ethtool -x eth1 RX flow hash indirection table for eth1 with 4 RX ring(s): 0: 0 1 2 3 0 1 2 3 8: 0 1 2 3 0 1 2 3 16: 0 1 2 3 0 1 2 3 24: 0 1 2 3 0 1 2 3 32: 0 1 2 3 0 1 2 3 40: 0 1 2 3 0 1 2 3 48: 0 1 2 3 0 1 2 3 56: 0 1 2 3 0 1 2 3 64: 0 1 2 3 0 1 2 3 72: 0 1 2 3 0 1 2 3 80: 0 1 2 3 0 1 2 3 88: 0 1 2 3 0 1 2 3 96: 0 1 2 3 0 1 2 3 104: 0 1 2 3 0 1 2 3 112: 0 1 2 3 0 1 2 3 120: 0 1 2 3 0 1 2 3 destination queue ID hash

Indirection Table # ethtool -X eth1 weight 1 0 0 0 # ethtool -x eth1 RX flow hash indirection table for eth1 with 4 RX ring(s): 0: 0 0 0 0 0 0 0 0 8: 0 0 0 0 0 0 0 0 16: 0 0 0 0 0 0 0 0 24: 0 0 0 0 0 0 0 0 32: 0 0 0 0 0 0 0 0 40: 0 0 0 0 0 0 0 0 48: 0 0 0 0 0 0 0 0 56: 0 0 0 0 0 0 0 0 64: 0 0 0 0 0 0 0 0 72: 0 0 0 0 0 0 0 0 80: 0 0 0 0 0 0 0 0 88: 0 0 0 0 0 0 0 0 96: 0 0 0 0 0 0 0 0 104: 0 0 0 0 0 0 0 0 112: 0 0 0 0 0 0 0 0 120: 0 0 0 0 0 0 0 0 destination queue ID hash

Overview Introduction Installation Configuration Tuning Use cases

Xeon Architecture PCIe PCIe

QPI QPI (Quick Path Interconnect) is the bus that interconnects the nodes of a NUMA system. QPI is used for moving data between nodes when accessing remote memory or PCIe devices. It also carries cache coherency traffic. PCIe PCIe

Memory Each CPU has its local memory directly attached. Accessing remote memory is slow as data flows through the QPI, which has lower bandwidth and adds latency. PCIe PCIe E5-2687WV4 9.6 GT/s QPI 76.8 GB/s RAM DDR4 2400 QPI latency: hundreds of nanosec Example: 8.0 GT/s QPI - bandwidth 32 GiB/s ~32 GB/s

PCIe Each node has its dedicated PCIe lanes. Plug the Network Card (and the RAID Controller) to the right slot reading the motherboard manual. PCIe PCIe

Memory Channels Multi-channel memory increases data transfer rate between memory and memory controller. You can use n2membenchmark as benchmark tool. Check how many channels your CPU supports and use at least as many memory modules as the number of channels (check dmidecode). PCIe PCIe

CPU Cores CPU pinning of a process/thread to a core is important to isolate processing and improve performance. In most cases dedicating a physical core (pay attention to hyperthreading) to each thread is the best choice for optimal performance. PCIe PCIe

Core Affinity All our applications natively support CPU pinning, e.g.: # nprobe -h grep affinity [--cpu-affinity -4] <CPU/Core Id> Binds this process to the specified CPU/Core When not supported, you can use external tools: # taskset -c 1 tcpdump -i eth1

NUMA Affinity You can check your NUMA-related hw configuration with: lstopo numactl --hardware Configuring CPU pinning, usually the application allocates memory to the correct NUMA node, if this is not the case you can use external tools: # numactl --membind=0 --cpunodebind=0 tcpdump -i zc:eth1 You can check your QPI bandwidth with: # numactl --membind=0 --cpunodebind=1 n2membenchmark

PF_RING ZC Driver NUMA Affinity PF_RING ZC drivers allocate data structures (RX/TX ring) in memory, setting NUMA affinity is important. You can do that at insmod: # insmod <model>.ko RSS=1,1,1,1 numa_cpu_affinity=0,0,8,8 Or if you are using packages: # cat /etc/pf_ring/zc/ixgbe/ixgbe.conf RSS=1,1,1,1 numa_cpu_affinity=0,0,8,8

Traffic Recording - Wrong Configuration 2. Processing in memory 3. Memory to Storage 1. NIC to memory PCIe PCIe

Traffic Recording - Correct Configuration PCIe PCIe

Overview Introduction Installation Configuration Tuning Use cases

RSS Load Balancing Linux Consumer Thread 0 Consumer Thread 1 Consumer Thread 2 Consumer Thread 3 PF_RING PF_RING PF_RING PF_RING ZC ZC ZC ZC Core 0 Core 1 Core 2 Core 3 Network Card

RSS: When it can be used Flow-based traffic analysis (multi-threaded or multi-process) and all the applications where Divide and Conquer strategy is applicable. Examples: nprobe (Netflow probe) nprobe Cento Suricata Bro

RSS: nprobe Example nprobe instances example with 4 RSS queues: # nprobe -i zc:eth1@0 --cpu-affinity 0 [other options] # nprobe -i zc:eth1@1 --cpu-affinity 1 [other options] # nprobe -i zc:eth1@2 --cpu-affinity 2 [other options] # nprobe -i zc:eth1@3 --cpu-affinity 3 [other options]

RSS: Bro Example Bro node.cfg example with 8 RSS queues: # [worker-1] type=worker host=10.0.0.1 interface=zc:eth1 This is expanded into zc:eth1@0.. zc:eth1@7 lb_method=pf_ring lb_procs=8 pin_cpus=0,1,2,3,4,5,6,7

RSS: When it can NOT be used Applications where packets order has to be preserved (also across flows), especially if there is no hw timestamping. For example in n2disk (traffic recording) we have to keep the original order for packets dumped on disk.

ZC Load Balancing (zbalance_ipc) Linux Sw Distribution Consumer App A Consumer App B T0 Consumer App B T2 PF_RING PF_RING PF_RING PF_RING ZC ZC ZC ZC Core 0 Core 1 Core 2 Core 3 Worker 1/10G 1/10G

ZC Load Balancing: When it is useful When RSS is not available or not flexible enough (with ZC you can build your distribution function/hash) When you need to send the same traffic to multiple applications (fan-out) while using zero-copy When you need to aggregate from multiple ports and then distribute

ZC Load Balancing - example zbalance_ipc is an example of multi-process load balancing application: # zbalance_ipc -i zc:eth1,zc:eth2 -c 99 -n 1,2 -m 1 -g 0 Ingress Interfaces ZC ID Egress Queues Hash Type CPU Core Consumer applications example: # taskset -c 1 tcpdump -i zc:99@0 # nprobe -i zc:99@1 --cpu-affinity 2 [other options] # nprobe -i zc:99@2 --cpu-affinity 3 [other options]

ZC Load Balancing and Bro Bro node.cfg example with 8 ZC queues: # [worker-1] type=worker host=10.0.0.1 interface=zc:99 This is expanded into zc:99@0.. zc:99@7 lb_method=pf_ring lb_procs=8 pin_cpus=0,1,2,3,4,5,6,7

Other processing patterns Using the ZC API you can create any multithreaded or multi-process processing pattern. Pipeline example: Linux Packet Dispatcher App A (e.g. DDoS) App B (e.g. IPS) Packet Forwarder PF_RING PF_RING PF_RING PF_RING ZC ZC ZC ZC Core 0 Core 1 Core 2 Core 3 1/10G 1/10G

ZC & Virtualisation: PCI Passthrough Any hypervisor is supported: KVM, VMWare (Direct I/ O), Xen, etc. App zc:eth1 PF_RING ZC 0-copy DMA NIC Userspace Kernel VM NIC PCI Passthrough Host

ZC & Virtualisation: Host to VM (KVM) (Host) $ zpipeline_ipc -i zc:eth2,0 -o zc:eth3,1 -n 2 -c 99 -r 0 -t 2 -Q /tmp/qmp0 (VM) $ zbounce_ipc -c 99 -i 0 -o 1 -u Packet Forwarder PF_RING App (e.g. IPS) PF_RING ZC Packet Forwarder PF_RING ZC KVM ZC Core 0 Core 1 Core 2 1/10G Core 3 1/10G

Stack Injection ZC is a Kernel-Bypass technology: what if we want to forward some traffic to the Linux Stack? stack:eth1 App zc:eth1 Network Stack pfring_send() pfring_recv() PF_RING ZC 0-copy DMA Userspace Kernel NIC NIC

Thank you!