Networking at the Speed of Light

Similar documents
Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

N V M e o v e r F a b r i c s -

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

IsoStack Highly Efficient Network Processing on Dedicated Cores

Advanced Computer Networks. End Host Optimization

Storage Protocol Offload for Virtualized Environments Session 301-F

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Interconnect Your Future

Ethernet. High-Performance Ethernet Adapter Cards

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Agilio CX 2x40GbE with OVS-TC

Data Path acceleration techniques in a NFV world

Application Acceleration Beyond Flash Storage

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

Programmable NICs. Lecture 14, Computer Networks (198:552)

EXPERIENCES WITH NVME OVER FABRICS

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

DPDK Summit China 2017

RoCE Update. Liran Liss, Mellanox Technologies March,

Network Function Virtualization Using Data Plane Developer s Kit

InfiniBand Networked Flash Storage

Introduction to Infiniband

Sharing High-Performance Devices Across Multiple Virtual Machines

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Open vswitch DPDK Acceleration Using HW Classification

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

An Intelligent NIC Design Xin Song

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

The Future of High Performance Interconnects

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Single Root I/O Virtualization (SR-IOV) and iscsi Uncompromised Performance for Virtual Server Environments Leonid Grossman Exar Corporation

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Toward a Memory-centric Architecture

Achieve Low Latency NFV with Openstack*

Designing Next Generation FS for NVMe and NVMe-oF

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

PLUSOPTIC NIC-PCIE-2SFP+-V2-PLU

Memory Management Strategies for Data Serving with RDMA

08:End-host Optimizations. Advanced Computer Networks

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

PARAVIRTUAL RDMA DEVICE

The NE010 iwarp Adapter

Paving the Road to Exascale

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Getting Real Performance from a Virtualized CCAP

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD

QorIQ Intelligent Network Interface Card (inic) Solution SDK v1.0 Update

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Support for Smart NICs. Ian Pratt

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

NVMe over Universal RDMA Fabrics

High Performance Packet Processing with FlexNIC

Corporate Update. OpenVswitch hardware offload over DPDK. DPDK summit 2017

In-Network Computing. Paving the Road to Exascale. June 2017

Accelerating Contrail vrouter

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Building the Most Efficient Machine Learning System

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

RoCE vs. iwarp Competitive Analysis

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

10GE network tests with UDP. Janusz Szuba European XFEL

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

WIND RIVER TITANIUM CLOUD FOR TELECOMMUNICATIONS

Interconnect Your Future

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Messaging Overview. Introduction. Gen-Z Messaging

Why AI Frameworks Need (not only) RDMA?

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

ALLNET ALL0141-4SFP+-10G / PCIe 10GB Quad SFP+ Fiber Card Server

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

Agilio OVS Software Architecture

End to End SLA for Enterprise Multi-Tenant Applications

Accessing NVM Locally and over RDMA Challenges and Opportunities

MWC 2015 End to End NFV Architecture demo_

All Roads Lead to Convergence

SmartNIC Data Plane Acceleration & Reconfiguration. Nic Viljoen, Senior Software Engineer, Netronome

OpenFlow Software Switch & Intel DPDK. performance analysis

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Windows Server System Center Azure Pack

Transcription:

Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017

Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices scale out arch DevOps automation Tenant isolation & security Visibility and telemetry vmem vcpu vstor (SDS) vaccelerator vnetwork (SDN) 2017 Mellanox Technologies 2

Networking at the Heart of the Datacenter HPC Cloud Web 2.0 Database Storage Big Data Artificial Intelligence Intelligent Network Compute Storage GPU FPGA 2017 Mellanox Technologies 3

Feeds and Speeds One Generation Ahead Mellanox Roadmap of Data Speed 10G 20G 40G 56G 100G 200/400G 2000 2005 2010 2015 2020 Enabling the Use of Data 2017 Mellanox Technologies 4

200 Gigabit Ethernet 1518B Packet 64B Packet 16M packets per second 62ns/packet 298M packets per second 3.3ns/packet 2017 Mellanox Technologies 5

Network Design Tradeoffs Extreme Performance PPS Switch Accelerator Model NIC FPGA CoProcessor Model Smart NIC NPU Moderate Performance PPS Rigid Partially Programmable x86 Fully Programmable 2017 Mellanox Technologies 6

Network Design Tradeoffs Extreme Performance PPS Switch Accelerator Model NIC FPGA CoProcessor Model Smart NIC NPU Moderate Performance PPS Rigid Partially Programmable x86 Fully Programmable 2017 Mellanox Technologies 7

Bandwidth (%) 64 82 128 146 164 182 200 256 1518 9216 64 82 128 146 164 182 200 256 1518 9216 Bandwidth (Gb/s) Switch Hardware Design Challenges Tomahawk Spectrum Achieving Line Rate 100 90 80 100 90 80 Ability to process any packet size at line rate Impacts: long tail for application, data loss, etc. 70 70 60 60 50 50 Packet Size (Bytes) Packet Size (Bytes) Tomahawk Spectrum Fairness One shared buffer vs multiple buffers Impacts: unequal bandwidth per ingress port 2017 Mellanox Technologies 8

OPS / PACKET OPS / PACKET 200 Gigabit Ethernet CPU Perspective 1518B Packet 64B Packet 6 5 Single 1518B Packet Software Budget 5.4 Single 64B Packet Software Budget 0.35 0.295 0.3 4 3.8 0.25 0.2 0.21 3 0.15 2 1 1.45 0.85 0.1 0.05 0.08 0.05 0 L3 cache access Spin lock/unlock Syscall Memory access 0 L3 cache access Spin lock/unlock Syscall Memory access 2017 Mellanox Technologies 9

OPS / PACKET 200Gigabit Storage Trends Storage at 200GE 6M I/O per second (4KB I/Os) 167ns per I/O 100,000,000 10,000,000 Storage Access Latency (ns) Storage I/O Software Budget 16 14.45 14 12 10.175 10 8 1,000,000 100,000 10,000 1,000 100 6 4 2 3.9 2.35 10 HDD SAS SSD NMVe NVDIMM 0 L3 cache access Spin lock/unlock Syscall Memory access 2017 Mellanox Technologies 10

Building a Balanced Element for the Scale Out Storage System Network I/O should match Disk I/O 2017 Mellanox Technologies 11

Addressing The Speed Challenges HW Offloads New Protocols New APIs New Programming Models 2017 Mellanox Technologies 12

Hardware Offloads Scalability to multi CPU cores Reduce CPU cycles per operation Improve CPU efficiency (NUMA, affinity, etc) RDMA Examples T10/DIF Offload Stateless offloads Kernel bypass Erasure Code Offload SRIOV Collective Offloads Trusted VM NIC SW VNIC VM VM VM SW VNIC SW VNIC SW VNIC Hypervisor vswitch Single Root I/O Virtualization (SRIOV) Trusted VM NIC VM VM VM NIC Hypervisor NIC NIC Direct assignment of devices to VMs Improve VM I/O efficiency Avoid hypercalls Virtual switch offloads Enable advanced protocols ASAP 2 - Overlay Network Offloads PF PF VF VF VF Legacy NIC SRIOV I/O NIC eswitch 2017 Mellanox Technologies 13

Ethernet Performance Optimizations Driver Features Checksum, TSO, RSS, VXLAN, Cache line alignment LRO HW offload Stride RX queue CQE compression CQE based Interrupt Moderation DIM Description Standard NIC offloads Writing integral cache lines over the PCIe Partial cache lines write to the PCI wastes up to 40% of the throughput Aggregate received TCP segments to a larger packet Reduce CPU RX overhead Use a single descriptor for handling multiple packets on the same virtual contiguous buffer Optimize memory footprint, RX CPU utilization (replenishing WQEs and buffers) Merge several completion descriptors (CQEs) to a single CQE Optimize PCIe bandwidth and CPU utilization Completion event moderation, based on timer from last completion Obtain optimal latency and bandwidth Dynamically tuned Interrupt moderation algorithm (per core) Latency In Full Throughput optimized 2017 Mellanox Technologies 14

ASAP 2 Accelerated Steering and Packet Processing SDN data path acceleration Virtual switch hardware offload Improved performance Maintain same management model 2017 Mellanox Technologies 15

Million Packet Per Second Number of Dedicated Cores OVS over DPDK VS. ASAP 2 Performance Test 1 Flow VLAN 60K flows VXLAN ASAP2 Direct OVS DPDK Benefit 33M PPS 7.6M PPS 4.3X 16.4M PPS 1.9M PPS 8.6X Zero! CPU utilization on hypervisor compared to 4 cores with OVS over DPDK Same CPU load on VM 35 30 25 20 15 10 5 0 7.6 MPPS 4 Cores OVS over DPDK Message Rate 0 Cores 33 MPPS OVS Offload Dedicated Hypervisor Cores 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 2017 Mellanox Technologies 16

Perimeter Security Clouded Datacenter Security Crypto DDoS prevention Stateful firewalls DPI and microsegmentation Anomaly detection Virtualized Servers VM VM VM VF VF VF Virtualized Servers VM VM VM VF VF VF Secure NIC Secure Mellanox NIC eswitch Secure Mellanox NIC eswitch Safe channel Safe channel Secure Network Core Network 2017 Mellanox Technologies 17

CPU% / 1GbE CPU% / 1GbE IPSEC Acceleration Inline crypto for IPSEC Transparent acceleration of IPSEC workloads Same management used (software compatible) Line rate 38Gb/s (Innova) vs 13Gb/s (SW) 3x speedup at ¼ CPU demand CPU Utilization per 1GbE on ip gateway Tx 40G Link CPU Utilization per 1GbE on ip gateway Rx 40G Link Security Applications (Iproute2, libreswan, strongswan etc.) 1.4 1.2 1 3 2.5 2 DPDK 0.8 1.5 Software Mellanox IPsec Kernel Module Kernel Mellanox FPGA Accelerator Mellanox NIC Driver Linux kernel IP- XFRN & Offload API Mellanox Poll Mode Driver 0.6 0.4 0.2 1 0.5 Hardware Ethernet 0 1 strm 2 strm 3 strm 4 strm Number of IPsec NO IPsec HW IPSEC IPSEC in CPU 0 1 strm 2 strm 3 strm 4 strm Number of IPsec streams NO IPsec HW IPSEC IPSEC in CPU Mellanox software: ~4x CPU utilization decrease 2017 Mellanox Technologies 18

TLS/SSL Acceleration Inline Crypto for TLS Integrates with Kernel TLS Offload symmetric crypto (AES-GCM) Software stack unchanged (even simplified) 1.6xBW, 40% lower CPU with TLS acceleration! KTLS TCP TLS record plaintext byte stream TCP segments of plaintext TLS records NIC TCP segments of ciphertext TLS records Network 2017 Mellanox Technologies 19

New Protocols Reduce protocol stack overheads Add new functionality and operations Examples RDMA over Converged Ethernet (RoCE) NVMe over Fabrics Remote CUDA Overlay Networks NVMe Over Fabrics Block Storage Protocol App Virtual Filesystem Block Layer SCSI Mid Layer iscsi iser App Virtual Filesystem Block Layer NVMe over Fabrics Lightweight protocol stack Bypass storage stack layers Native mapping into storage devices Native multi queueing Low latency using RDMA (<10us additional latency vs native) More HW offloads RDMA Verbs RDMA Verbs 2017 Mellanox Technologies 20

Breaking the Latency Barrier 900 ConnectX Family Latency (ns) A B 800 700 600 500 400 T roundtrip Poll post 300 200 100 0 ConnectX-4 ConnectX-5 ConnectX-5EX InfiniBand Ethernet (RoCE) Latency = T roundtrip / 2 Latency measurement Ping-pong test Can we improve 2x??? 2017 Mellanox Technologies 21

Optimize Higher Level Operation 2x Latency Improvement Data reduction operation Commonly used in Scientific applications Deep learning 10 8 6 128 Node All Reduce Latency (usec) 2.5 2 1.5 4 1 2 0.5 Switch 0 0 B 8 B 16 B 32 B 64 B 128 B Host SHARP Improvement 0 Aggregated Data Aggregated Result 128 Node Pipelined All Reduce Latency (usec) Data Switch Switch Host Host Host Host Host Aggregated Result 8 7 6 5 4 3 2 1 0 8 B 16 B 32 B 64 B 128 B 2.3 2.25 2.2 2.15 2.1 2.05 2 1.95 1.9 1.85 Host SHARP Improvement 2017 Mellanox Technologies 22

SHArP is a Component of the Intelligent Interconnect CPU-Centric Data-Centric Limited to Main CPU Usage Results in Performance Limitation Must Wait for the Data Creates Performance Bottlenecks Creating Synergies Enables Higher Performance and Scale Work on The Data as it Moves Enables Performance and Scale 2017 Mellanox Technologies 23

New APIs Reduce protocol stack overheads Enable usermode protocol stacks Add new functionality and operations Enable offloading RDMA Examples DPDK UCX RDMA (RoCE and InfiniBand) GPU Direct App ULP/Middleware RDMA Aware App RDMA I/F & Infrastructure Kernel bypass Message semantics Zero copy Protocol offload Batching operations Polling and interrupts RDMA Verbs Provider 2017 Mellanox Technologies 24

Frame rate [mpps] DPDK Efficient User-Mode Access for Ethernet Control Path User Verbs I/F User Kernel Kernel Verbs DPDK mlx5 PMD User Verbs (libibverbs+libmlx5) New APIs, opensource, NFV optimized Usermode enables extreme packet rate Optimizations Minimal CPU cycles/packet, batching, RSS, cache line, data structures, descriptors, vectored operations, TSO, flow classification, VLAN, etc. HW memory protection mlx5 verbs provider (mlx5_ib) Data Path Direct PRM I/F 160.00 140.00 ConnectX-5 Ex 100GbE Frame Rate[Mpps] 131.74 HW mlx5_core PRM ConnectX NIC 120.00 100.00 80.00 60.00 40.00 20.00 84.18 45.23 23.44 11.97 9.61 8.12 15 Cores Line Rate 0.00 64 128 256 512 1024 1280 1518 Frame Size [B] 2017 Mellanox Technologies 25

New Programming Models - NPU Extreme Single Thread Performance General CPU (x86) 3Ghz 16 cores 10s threads Huge caches Low Single Thread Performance Minimal Parallelism NPU Massive Parallelism 1 Ghz 256 cores 1000s threads Small caches 2017 Mellanox Technologies 26

NPU Programming 101 General Purpose CPU Targeting ANY application 10s threads General purpose code Limited number of large heavy apps Flow per core Standard memory subsystem Huge caches NPU Targeting L2-7 network applications Scalable to millions of flows 1000s threads massive parallelism Custom code network oriented ISA, accelerators and more 100M of light weight data driven applications (i.e. Flows) Flow per many cores HW RX packet spray to many cores + TX reorder Network optimized memory system Small caches, extreme threading and high memory bandwidth compensate on mem access Example NPU performance: 500Gb/s stateful firewall 400Gb/s deep packet inspection (15x vs x86) 2017 Mellanox Technologies 27

Summary I/O performance must keep up with compute and storage development CPUs require accelerators & co-processor to enable efficient network and storage processing Software defined enables flexibility which can be HW accelerated Adding intelligence to the network enable efficient application scaling 2017 Mellanox Technologies 28

Thank You! 2017 Mellanox Technologies 29