BPF Hardware Offload Deep Dive

Similar documents
Comprehensive BPF offload

The Challenges of XDP Hardware Offload

SmartNIC Data Plane Acceleration & Reconfiguration. Nic Viljoen, Senior Software Engineer, Netronome

ebpf Offload to Hardware cls_bpf and XDP

XDP Hardware Offload: Current Work, Debugging and Edge Cases

A practical introduction to XDP

Host Dataplane Acceleration: SmartNIC Deployment Models

ebpf Debugging Infrastructure Current Techniques and Additional Proposals

SmartNIC Programming Models

SmartNIC Programming Models

Netronome NFP: Theory of Operation

ebpf Tooling and Debugging Infrastructure

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Accelerating Load Balancing programs using HW- Based Hints in XDP

Open-NFP Summer Webinar Series: Session 4: P4, EBPF And Linux TC Offload

XDP now with REDIRECT

Accelerating VM networking through XDP. Jason Wang Red Hat

Programming Netronome Agilio SmartNICs

ebpf Offload Getting Started Guide

Bringing the Power of ebpf to Open vswitch

Comparing Open vswitch (OpenFlow) and P4 Dataplanes for Agilio SmartNICs

Programming NFP with P4 and C

Ethernet switch support in the Linux kernel

DPDK+eBPF KONSTANTIN ANANYEV INTEL

Combining ktls and BPF for Introspection and Policy Enforcement

Linux Network Programming with P4. Linux Plumbers 2018 Fabian Ruffy, William Tu, Mihai Budiu VMware Inc. and University of British Columbia

On getting tc classifier fully programmable with cls bpf.

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

ENG. WORKSHOP: PRES. Linux Networking Greatness (part II). Roopa Prabhu/Director Engineering Linux Software/Cumulus Networks.

File access-control per container with Landlock

XDP For the Rest of Us

Quick Introduction to ebpf & BCC Suchakrapani Sharma. Kernel Meetup (Reserved Bit, Pune)

XDP: a new fast and programmable network layer

Suricata Performance with a S like Security

XDP: 1.5 years in production. Evolution and lessons learned. Nikita V. Shirokov

OVS Acceleration using Network Flow Processors

Design, Verification and Emulation of an Island-Based Network Flow Processor

Landlock LSM: toward unprivileged sandboxing

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

The Path to DPDK Speeds for AF XDP

Section 7: Wait/Exit, Address Translation

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Programmable NICs. Lecture 14, Computer Networks (198:552)

Fast packet processing in linux with af_xdp

INT G bit TCP Offload Engine SOC

Using (Suricata over) PF_RING for NIC-Independent Acceleration

Software Datapath Acceleration for Stateless Packet Processing

OpenContrail, Real Speed: Offloading vrouter

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

XDP: The Future of Networks. David S. Miller, Red Hat Inc., Seoul 2017

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Combining ktls and BPF for Introspection and Policy Enforcement

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

nftables switchdev support

ebpf & Switch Abstractions

A 400Gbps Multi-Core Network Processor

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

Messaging Overview. Introduction. Gen-Z Messaging

Final Step #7. Memory mapping For Sunday 15/05 23h59

High Performance Memory Opportunities in 2.5D Network Flow Processors

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

What is an L3 Master Device?

GPUfs: Integrating a file system with GPUs

Super Fast Packet Filtering with ebpf and XDP Helen Tabunshchyk, Systems Engineer, Cloudflare

Addressing the Memory Wall

P4 Introduction. Jaco Joubert SIGCOMM NETRONOME SYSTEMS, INC.

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Stacked Vlan: Performance Improvement and Challenges

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Much Faster Networking

Xen Network I/O Performance Analysis and Opportunities for Improvement

High-Level Language VMs

ebpf-based tracing tools under 32 bit architectures

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

Virtual RDMA devices. Parav Pandit Emulex Corporation

OCP Engineering Workshop - Telco

Network Design Considerations for Grid Computing

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Memory Mapping. Sarah Diesburg COP5641

13th ANNUAL WORKSHOP 2017 VERBS KERNEL ABI. Liran Liss, Matan Barak. Mellanox Technologies LTD. [March 27th, 2017 ]

CSE 560 Computer Systems Architecture

High Speed Packet Filtering on Linux

Software Routers: NetMap

Networking in a Vertically Scaled World

Chapter 8: Main Memory

IO virtualization. Michael Kagan Mellanox Technologies

Agilio OVS Software Architecture

Contiguous memory allocation in Linux user-space

Review: Hardware user/kernel boundary

I/O. Disclaimer: some slides are adopted from book authors slides with permission 1

Chapter 8: Memory-Management Strategies

A Userspace Packet Switch for Virtual Machines

CUDA Programming Model

System Call. Preview. System Call. System Call. System Call 9/7/2018

Transcription:

BPF Hardware Offload Deep Dive Jakub Kicinski 2018 NETRONOME SYSTEMS, INC.

BPF Sandbox As a goal of BPF IR JITing of BPF IR to most RISC cores should be very easy BPF VM provides a simple and well understood execution environment Designed by Linux kernel-minded architects making sure there are no implementation details leaking into the definition of the VM and ABIs Unlike higher level languages BPF is a intermediate representation (IR) which provides binary compatibility, it is a mechanism BPF is extensible through helpers and maps allowing us to make use of special HW features (when gain justifies the effort) 2018 NETRONOME SYSTEMS, INC. 2

BPF Ecosystem Kernel infrastructure improves, including verifier/analyzer, JIT compilers for all common host architectures and some common embedded architectures like ARM or x86 Linux kernel community is very active in extending performance and improving BPF feature set, with AF_XDP being a most recent example Android APF targets smaller processors in mobile handsets for filtering wake ups from remote processors (most likely network interfaces) to improve battery life 2018 NETRONOME SYSTEMS, INC. 3

BPF Universe user space program data storage maps r0 = 0 r2 = *(u32 *)(r1 + 4) r1 = *(u32 *)(r1 + 0) r3 = r1 r3 += 14 if r3 > r2 goto 7 r0 = 1 r2 = *(u8 *)(r1 + 12) if r2!= 34 goto 4 r1 = *(u8 *)(r1 + 13) r0 = 2 if r1 == 34 goto 1 r0 = 1... helpers array device LPM anchor maps perf event hash socket 2018 NETRONOME SYSTEMS, INC. 4

BPF Universe Translate the program code into device s native machine code Use advanced instructions User space Optimize instruction scheduling Optimize I/O Provide device-specific program implementation of the helpers data storage maps Use hardware accelerators for maps helpers Use of richer memory architectures Algorithmic lookup engines TCAMs Filter packets directly in the NIC Handle advanced switching/routing Application-specific packet reception policies r0 = 0 r2 = *(u32 *)(r1 + 4) r1 = *(u32 *)(r1 + 0) r3 = r1 r3 += 14 if r3 > r2 goto 7 r0 = 1 r2 = *(u8 *)(r1 + 12) if r2!= 34 goto 4 r1 = *(u8 *)(r1 + 13) r0 = 2 if r1 == 34 goto 1 r0 = 1... array device LPM anchor maps perf event hash socket 2018 NETRONOME SYSTEMS, INC. 5

Optimized for standard server based cloud data centers Based on the Netronome Network Flow Processor 4xxx line Low profile, half length PCIe form factor for all versions Memory: 2GB DRAM <25W Power, typical 15-20W 2018 NETRONOME SYSTEMS, INC. 6

SoC Architecture-Conceptual Components ASIC based packet processors, crypto engines, etc Flow processing cores distributed into islands of 12 (up to 7 islands) Multi-threaded transactional memory engines and accelerators optimized for network processing Up to 100 Gbps 4x PCIe Gen 3x8 14Tbps distributed fabric-crucial foundation for many core architectures 2018 NETRONOME SYSTEMS, INC. 7

NFP SoC Architecture 2018 NETRONOME SYSTEMS, INC. 8

NFP SoC Architecture BPF programs BPF maps 2018 NETRONOME SYSTEMS, INC. 9

NFP SoC Packet Flow 2018 NETRONOME SYSTEMS, INC. 10

NFP SoC Packet Flow 2018 NETRONOME SYSTEMS, INC. 11

NFP SoC Packet Flow 2018 NETRONOME SYSTEMS, INC. 12

NFP SoC Packet Flow 2018 NETRONOME SYSTEMS, INC. 13

NFP SoC Packet Flow 2018 NETRONOME SYSTEMS, INC. 14

Memory Architecture - Latencies GPRS/xfer regs - 1 cycle LMEM - 1-3 cycles GPRs LMEM (1 KB) x50 BPF workers Thread (x4 per Core) 800Mhz Core CLS - 20-50 cycles CTM - 50-100 cycles CTM (256 KB) CLS (64 KB) DRAM (2+GB) Island (x6 per Chip) IMEM - 150-250 cycles Chip IMEM (4 MB) DRAM - 150-500 cycles NIC 2018 NETRONOME SYSTEMS, INC. 15

Memory Architecture Flow Processing Core Cluster Target Memory Thread 0 Thread 1 Thread 2 Dispatcher Thread Thread CPP Read X and Yield Yield Yield Yield CPP Write Y Push Value X Pull Value Y Return Value Y Yield Multithreaded Transactional Memory Architecture Hides Latency 2018 NETRONOME SYSTEMS, INC. 16

Kernel Offload - BPF Offload Memory Mapping 10 Registers (64-bit, 32-bit subregisters) GPRs LMEM (1 KB) Thread (x4 per Core) 800Mhz Core x50 BPF workers 512 byte stack Driver CTM (256 KB) CLS (64 KB) DRAM (2+GB) Island (x6 per Chip) Maps, varying sizes Chip IMEM(4 MB) NIC 2018 NETRONOME SYSTEMS, INC. 17

Programming Model Program is written in standard manner LLVM compiled as normal iproute/tc/libbpf loads the program requesting offload The nfp_bpf_jit.c converts the BPF bytecode to NFP machine code (and we mean the actual machine code :)) Translation reuses a significant amount of verifier infrastructure 2018 NETRONOME SYSTEMS, INC. 18

BPF Object Creation (maps) 1. Get map file descriptors: a. For existing maps - get access to a file descriptor: i. from bpffs (pinned map) - open a pseudo file ii. by ID - use BPF_MAP_GET_FD_BY_ID bpf syscall command b. Create new maps - BPF_MAP_CREATE bpf syscall command: union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ u32 map_type; /* one of enum bpf_map_type */ u32 key_size; /* size of key in bytes */ u32 value_size; /* size of value in bytes */ u32 max_entries; /* max number of entries in a map */ u32 map_flags; /* BPF_MAP_CREATE related * flags defined above. */ u32 inner_map_fd; /* fd pointing to the inner map */ u32 numa_node; /* numa node (effective only if * BPF_F_NUMA_NODE is set). */ char map_name[bpf_obj_name_len]; u32 map_ifindex; /* ifindex of netdev to create on */ u32 btf_fd; /* fd pointing to a BTF type data */ u32 btf_key_type_id; /* BTF type_id of the key */ u32 btf_value_type_id; /* BTF type_id of the value */ }; 2018 NETRONOME SYSTEMS, INC. 19

BPF object creation (programs) 1. Get program instructions; 2. Perform relocations (replace map references with file descriptors IDs); 3. Use BPF_PROG_LOAD to load the program; union bpf_attr { struct { /* anonymous struct used by BPF_PROG_LOAD command */ u32 prog_type; /* one of enum bpf_prog_type */ u32 insn_cnt; aligned_u64 insns; aligned_u64 license; u32 log_level; /* verbosity level of verifier */ u32 log_size; /* size of user buffer */ aligned_u64 log_buf; /* user supplied buffer */ u32 kern_version; /* checked when prog_type=kprobe */ u32 prog_flags; }; char prog_name[bpf_obj_name_len]; u32 prog_ifindex; /* ifindex of netdev to prep for */ /* For some prog types expected attach type must be known at * load time to verify attach type specific parts of prog * (context accesses, allowed helpers, etc). */ u32 expected_attach_type; 2018 NETRONOME SYSTEMS, INC. 20

BPF Object Creation (libbpf) With libbpf use the extended attributes to set the ifindex: struct bpf_prog_load_attr { const char *file; enum bpf_prog_type prog_type; enum bpf_attach_type expected_attach_type; int ifindex; }; int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, struct bpf_object **pobj, int *prog_fd); Normal kernel BPF ABIs are used, opt-in for offload by setting ifindex. 2018 NETRONOME SYSTEMS, INC. 21

Map Offload BPF syscall kernel/ bpf/ syscall.c BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM, BPF_MAP_DELETE_ELEM, BPF_MAP_GET_NEXT_KEY, kernel/ bpf/ offload.c Linux network config lock netdevice ops (ndo_bpf) drivers/ net/ ethernet/ netronome/ nfp/ bpf/ is offloaded? create, free struct bpf_map_dev_ops lookup, update, delete, get_next_key Device Control message handler 2018 NETRONOME SYSTEMS, INC. 22

Map Offload Maps reside entirely in device memory Programs running on the host do not have access to offloaded maps and vice versa (because host cannot efficiently access device memory) User space API remains unchanged Kernel NFP maps PCIe offloaded maps Ethernet XDP bpfilter cls_bpf offloaded program 2018 NETRONOME SYSTEMS, INC. 23

Map Offload Each map in the kernel has set of ops associated: /* map is generic key/value storage optionally accessible by ebpf programs */ struct bpf_map_ops { /* funcs callable from userspace (via syscall) */ int (*map_alloc_check)(union bpf_attr *attr); struct bpf_map *(*map_alloc)(union bpf_attr *attr); void (*map_release)(struct bpf_map *map, struct file *map_file); void (*map_free)(struct bpf_map *map); int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key); }; /* funcs callable from userspace and from ebpf programs */ void *(*map_lookup_elem)(struct bpf_map *map, void *key); int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags); int (*map_delete_elem)(struct bpf_map *map, void *key); Each map type (array, hash, LRU, LPM, etc.) has its own set of ops which implement the map specific logic If map_ifindex is set the ops are pointed to an empty set of offload ops regardless of the type (bpf_offload_prog_ops) Only calls from user space will now be allowed 2018 NETRONOME SYSTEMS, INC. 24

Program Offload Kernel verifier performs verification and some of common JIT steps for the host architectures For offload these steps cause loss of context information and are incompatible with the target Allow device translator to access the loaded program as-is: IDs/offsets not translated: structure field offsets functions map IDs No prolog/epilogue injected No optimizations made For offloaded devices the verifier skips the extra host-centric rewrites. 2018 NETRONOME SYSTEMS, INC. 25

Program Offload Lifecycle BPF syscall kernel/ bpf/ syscall.c kernel/ bpf/ verifier.c kernel/ bpf/ core.c kernel/ bpf/ offload.c bpf_prog_offload_init() allocate data structures for tracking offload device association bpf_prog_offload_- -verifier_prep() per-instruction verification callback bpf_prog_offload_- -translate() bpf_prog_offload_- -destroy() netdevice ops :: ndo_bpf() netdevice ops :: ndo_bpf() NFP driver BPF_OFFLOAD_VERIFIER_PREP allocate and construct driver- -specific program data structures nfp_verify_insn() perform extra dev-specific checks; gather context information BPF_OFFLOAD_TRANSLATE BPF_OFFLOAD_DESTROY run optimizations and machine code free all data structures and generation machine code image 2018 NETRONOME SYSTEMS, INC. 26

Program Offload After program has been loaded into the kernel the subsystem specific handling remains unchanged For network programs offloaded program can be attached to device ingress to XDP (BPF_PROG_TYPE_XDP) or cls_bpf (BPF_PROG_TYPE_SCHED_CLS) Program can be attached to any of the ports of device for which it was loaded Actually loading program to device memory only happens when it s being attached 2018 NETRONOME SYSTEMS, INC. 27

BPF Offload - Summary BPF VM/sandbox is well suited for a heterogeneous processing engine BPF offload allows loading a BPF program onto a device instead of host CPU All user space tooling and ABIs remain unchanged No vendor-specific APIs or SDKs BPF offload is part of the upstream Linux kernel (recent kernel required) BPF programs loaded onto device can take advantage of HW accelerators such as HW memory lookup engines Try it out today on standard NFP server cards! (academic pricing available on open-nfp.org ) Reach out with BPF-related questions: https://help.netronome.com/a/forums/ https://groups.google.com/forum/#!forum/open-nfp xdp-newbies@vger.kernel.org Register for the next webinar in this series! 2018 NETRONOME SYSTEMS, INC. 28