Networking at the Speed of Light

Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017

Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices scale out arch DevOps automation Tenant isolation & security Visibility and telemetry vmem vcpu vstor (SDS) vaccelerator vnetwork (SDN) 2017 Mellanox Technologies 2

Networking at the Heart of the Datacenter HPC Cloud Web 2.0 Database Storage Big Data Artificial Intelligence Intelligent Network Compute Storage GPU FPGA 2017 Mellanox Technologies 3

Feeds and Speeds One Generation Ahead Mellanox Roadmap of Data Speed 10G 20G 40G 56G 100G 200/400G 2000 2005 2010 2015 2020 Enabling the Use of Data 2017 Mellanox Technologies 4

200 Gigabit Ethernet 1518B Packet 64B Packet 16M packets per second 62ns/packet 298M packets per second 3.3ns/packet 2017 Mellanox Technologies 5

Network Design Tradeoffs Extreme Performance PPS Switch Accelerator Model NIC FPGA CoProcessor Model Smart NIC NPU Moderate Performance PPS Rigid Partially Programmable x86 Fully Programmable 2017 Mellanox Technologies 6

Bandwidth (%) 64 82 128 146 164 182 200 256 1518 9216 64 82 128 146 164 182 200 256 1518 9216 Bandwidth (Gb/s) Switch Hardware Design Challenges Tomahawk Spectrum Achieving Line Rate 100 90 80 100 90 80 Ability to process any packet size at line rate Impacts: long tail for application, data loss, etc. 70 70 60 60 50 50 Packet Size (Bytes) Packet Size (Bytes) Tomahawk Spectrum Fairness One shared buffer vs multiple buffers Impacts: unequal bandwidth per ingress port 2017 Mellanox Technologies 8

OPS / PACKET OPS / PACKET 200 Gigabit Ethernet CPU Perspective 1518B Packet 64B Packet 6 5 Single 1518B Packet Software Budget 5.4 Single 64B Packet Software Budget 0.35 0.295 0.3 4 3.8 0.25 0.2 0.21 3 0.15 2 1 1.45 0.85 0.1 0.05 0.08 0.05 0 L3 cache access Spin lock/unlock Syscall Memory access 0 L3 cache access Spin lock/unlock Syscall Memory access 2017 Mellanox Technologies 9

OPS / PACKET 200Gigabit Storage Trends Storage at 200GE 6M I/O per second (4KB I/Os) 167ns per I/O 100,000,000 10,000,000 Storage Access Latency (ns) Storage I/O Software Budget 16 14.45 14 12 10.175 10 8 1,000,000 100,000 10,000 1,000 100 6 4 2 3.9 2.35 10 HDD SAS SSD NMVe NVDIMM 0 L3 cache access Spin lock/unlock Syscall Memory access 2017 Mellanox Technologies 10

Building a Balanced Element for the Scale Out Storage System Network I/O should match Disk I/O 2017 Mellanox Technologies 11

Addressing The Speed Challenges HW Offloads New Protocols New APIs New Programming Models 2017 Mellanox Technologies 12

Hardware Offloads Scalability to multi CPU cores Reduce CPU cycles per operation Improve CPU efficiency (NUMA, affinity, etc) RDMA Examples T10/DIF Offload Stateless offloads Kernel bypass Erasure Code Offload SRIOV Collective Offloads Trusted VM NIC SW VNIC VM VM VM SW VNIC SW VNIC SW VNIC Hypervisor vswitch Single Root I/O Virtualization (SRIOV) Trusted VM NIC VM VM VM NIC Hypervisor NIC NIC Direct assignment of devices to VMs Improve VM I/O efficiency Avoid hypercalls Virtual switch offloads Enable advanced protocols ASAP 2 - Overlay Network Offloads PF PF VF VF VF Legacy NIC SRIOV I/O NIC eswitch 2017 Mellanox Technologies 13

Ethernet Performance Optimizations Driver Features Checksum, TSO, RSS, VXLAN, Cache line alignment LRO HW offload Stride RX queue CQE compression CQE based Interrupt Moderation DIM Description Standard NIC offloads Writing integral cache lines over the PCIe Partial cache lines write to the PCI wastes up to 40% of the throughput Aggregate received TCP segments to a larger packet Reduce CPU RX overhead Use a single descriptor for handling multiple packets on the same virtual contiguous buffer Optimize memory footprint, RX CPU utilization (replenishing WQEs and buffers) Merge several completion descriptors (CQEs) to a single CQE Optimize PCIe bandwidth and CPU utilization Completion event moderation, based on timer from last completion Obtain optimal latency and bandwidth Dynamically tuned Interrupt moderation algorithm (per core) Latency In Full Throughput optimized 2017 Mellanox Technologies 14

ASAP 2 Accelerated Steering and Packet Processing SDN data path acceleration Virtual switch hardware offload Improved performance Maintain same management model 2017 Mellanox Technologies 15

Million Packet Per Second Number of Dedicated Cores OVS over DPDK VS. ASAP 2 Performance Test 1 Flow VLAN 60K flows VXLAN ASAP2 Direct OVS DPDK Benefit 33M PPS 7.6M PPS 4.3X 16.4M PPS 1.9M PPS 8.6X Zero! CPU utilization on hypervisor compared to 4 cores with OVS over DPDK Same CPU load on VM 35 30 25 20 15 10 5 0 7.6 MPPS 4 Cores OVS over DPDK Message Rate 0 Cores 33 MPPS OVS Offload Dedicated Hypervisor Cores 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 2017 Mellanox Technologies 16

Perimeter Security Clouded Datacenter Security Crypto DDoS prevention Stateful firewalls DPI and microsegmentation Anomaly detection Virtualized Servers VM VM VM VF VF VF Virtualized Servers VM VM VM VF VF VF Secure NIC Secure Mellanox NIC eswitch Secure Mellanox NIC eswitch Safe channel Safe channel Secure Network Core Network 2017 Mellanox Technologies 17

CPU% / 1GbE CPU% / 1GbE IPSEC Acceleration Inline crypto for IPSEC Transparent acceleration of IPSEC workloads Same management used (software compatible) Line rate 38Gb/s (Innova) vs 13Gb/s (SW) 3x speedup at ¼ CPU demand CPU Utilization per 1GbE on ip gateway Tx 40G Link CPU Utilization per 1GbE on ip gateway Rx 40G Link Security Applications (Iproute2, libreswan, strongswan etc.) 1.4 1.2 1 3 2.5 2 DPDK 0.8 1.5 Software Mellanox IPsec Kernel Module Kernel Mellanox FPGA Accelerator Mellanox NIC Driver Linux kernel IP- XFRN & Offload API Mellanox Poll Mode Driver 0.6 0.4 0.2 1 0.5 Hardware Ethernet 0 1 strm 2 strm 3 strm 4 strm Number of IPsec NO IPsec HW IPSEC IPSEC in CPU 0 1 strm 2 strm 3 strm 4 strm Number of IPsec streams NO IPsec HW IPSEC IPSEC in CPU Mellanox software: ~4x CPU utilization decrease 2017 Mellanox Technologies 18

TLS/SSL Acceleration Inline Crypto for TLS Integrates with Kernel TLS Offload symmetric crypto (AES-GCM) Software stack unchanged (even simplified) 1.6xBW, 40% lower CPU with TLS acceleration! KTLS TCP TLS record plaintext byte stream TCP segments of plaintext TLS records NIC TCP segments of ciphertext TLS records Network 2017 Mellanox Technologies 19

New Protocols Reduce protocol stack overheads Add new functionality and operations Examples RDMA over Converged Ethernet (RoCE) NVMe over Fabrics Remote CUDA Overlay Networks NVMe Over Fabrics Block Storage Protocol App Virtual Filesystem Block Layer SCSI Mid Layer iscsi iser App Virtual Filesystem Block Layer NVMe over Fabrics Lightweight protocol stack Bypass storage stack layers Native mapping into storage devices Native multi queueing Low latency using RDMA (<10us additional latency vs native) More HW offloads RDMA Verbs RDMA Verbs 2017 Mellanox Technologies 20

Breaking the Latency Barrier 900 ConnectX Family Latency (ns) A B 800 700 600 500 400 T roundtrip Poll post 300 200 100 0 ConnectX-4 ConnectX-5 ConnectX-5EX InfiniBand Ethernet (RoCE) Latency = T roundtrip / 2 Latency measurement Ping-pong test Can we improve 2x??? 2017 Mellanox Technologies 21

Optimize Higher Level Operation 2x Latency Improvement Data reduction operation Commonly used in Scientific applications Deep learning 10 8 6 128 Node All Reduce Latency (usec) 2.5 2 1.5 4 1 2 0.5 Switch 0 0 B 8 B 16 B 32 B 64 B 128 B Host SHARP Improvement 0 Aggregated Data Aggregated Result 128 Node Pipelined All Reduce Latency (usec) Data Switch Switch Host Host Host Host Host Aggregated Result 8 7 6 5 4 3 2 1 0 8 B 16 B 32 B 64 B 128 B 2.3 2.25 2.2 2.15 2.1 2.05 2 1.95 1.9 1.85 Host SHARP Improvement 2017 Mellanox Technologies 22

SHArP is a Component of the Intelligent Interconnect CPU-Centric Data-Centric Limited to Main CPU Usage Results in Performance Limitation Must Wait for the Data Creates Performance Bottlenecks Creating Synergies Enables Higher Performance and Scale Work on The Data as it Moves Enables Performance and Scale 2017 Mellanox Technologies 23

New APIs Reduce protocol stack overheads Enable usermode protocol stacks Add new functionality and operations Enable offloading RDMA Examples DPDK UCX RDMA (RoCE and InfiniBand) GPU Direct App ULP/Middleware RDMA Aware App RDMA I/F & Infrastructure Kernel bypass Message semantics Zero copy Protocol offload Batching operations Polling and interrupts RDMA Verbs Provider 2017 Mellanox Technologies 24

Frame rate [mpps] DPDK Efficient User-Mode Access for Ethernet Control Path User Verbs I/F User Kernel Kernel Verbs DPDK mlx5 PMD User Verbs (libibverbs+libmlx5) New APIs, opensource, NFV optimized Usermode enables extreme packet rate Optimizations Minimal CPU cycles/packet, batching, RSS, cache line, data structures, descriptors, vectored operations, TSO, flow classification, VLAN, etc. HW memory protection mlx5 verbs provider (mlx5_ib) Data Path Direct PRM I/F 160.00 140.00 ConnectX-5 Ex 100GbE Frame Rate[Mpps] 131.74 HW mlx5_core PRM ConnectX NIC 120.00 100.00 80.00 60.00 40.00 20.00 84.18 45.23 23.44 11.97 9.61 8.12 15 Cores Line Rate 0.00 64 128 256 512 1024 1280 1518 Frame Size [B] 2017 Mellanox Technologies 25

New Programming Models - NPU Extreme Single Thread Performance General CPU (x86) 3Ghz 16 cores 10s threads Huge caches Low Single Thread Performance Minimal Parallelism NPU Massive Parallelism 1 Ghz 256 cores 1000s threads Small caches 2017 Mellanox Technologies 26

NPU Programming 101 General Purpose CPU Targeting ANY application 10s threads General purpose code Limited number of large heavy apps Flow per core Standard memory subsystem Huge caches NPU Targeting L2-7 network applications Scalable to millions of flows 1000s threads massive parallelism Custom code network oriented ISA, accelerators and more 100M of light weight data driven applications (i.e. Flows) Flow per many cores HW RX packet spray to many cores + TX reorder Network optimized memory system Small caches, extreme threading and high memory bandwidth compensate on mem access Example NPU performance: 500Gb/s stateful firewall 400Gb/s deep packet inspection (15x vs x86) 2017 Mellanox Technologies 27

Summary I/O performance must keep up with compute and storage development CPUs require accelerators & co-processor to enable efficient network and storage processing Software defined enables flexibility which can be HW accelerated Adding intelligence to the network enable efficient application scaling 2017 Mellanox Technologies 28

Thank You! 2017 Mellanox Technologies 29