FlexNIC: Rethinking Network DMA

Similar documents
High Performance Packet Processing with FlexNIC

Towards a Software Defined Data Plane for Datacenters

Arrakis: The Operating System is the Control Plane

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Advanced Computer Networks. End Host Optimization

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Much Faster Networking

Networking at the Speed of Light

P51: High Performance Networking

Performance Evaluation of Myrinet-based Network Router

OpenFlow Software Switch & Intel DPDK. performance analysis

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

6.9. Communicating to the Outside World: Cluster Networking

Programmable NICs. Lecture 14, Computer Networks (198:552)

An Intelligent NIC Design Xin Song

SANGFOR AD Product Series

Flexible Architecture Research Machine (FARM)

Learning with Purpose

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Baidu s Best Practice with Low Latency Networks

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Data Path acceleration techniques in a NFV world

I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more

SANGFOR AD Product Series

Internet Technology. 06. Exam 1 Review Paul Krzyzanowski. Rutgers University. Spring 2016

Internet Technology 3/2/2016

The Power of Batching in the Click Modular Router

PARAVIRTUAL RDMA DEVICE

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

I/O. CS 416: Operating Systems Design Department of Computer Science Rutgers University

BlueDBM: An Appliance for Big Data Analytics*

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

08:End-host Optimizations. Advanced Computer Networks

The NE010 iwarp Adapter

An O/S perspective on networks: Active Messages and U-Net

The Tofu Interconnect 2

Arrakis: The Operating System is the Control Plane

Network Requirements for Resource Disaggregation

OVS Acceleration using Network Flow Processors

FaRM: Fast Remote Memory

Rack Disaggregation Using PCIe Networking

IsoStack Highly Efficient Network Processing on Dedicated Cores

A-GEAR 10Gigabit Ethernet Server Adapter X520 2xSFP+

Impact of Cache Coherence Protocols on the Processing of Network Traffic

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

FPGA Augmented ASICs: The Time Has Come

Dual Port Fiber 100 Gigabit Ethernet PCI Express Content Director Bypass Server Adapter Intel FM10420 Based

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

Lessons learned from MPI

libvnf: building VNFs made easy

PacketShader: A GPU-Accelerated Software Router

Tile Processor (TILEPro64)

Chapter 4 Network Layer: The Data Plane

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

143A: Principles of Operating Systems. Lecture 6: Address translation (Paging) Anton Burtsev October, 2017

Support for Smart NICs. Ian Pratt

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

A Transport Friendly NIC for Multicore/Multiprocessor Systems

Reference Design: NVMe-oF JBOF

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

DPDK Summit China 2017

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

Memory Management Strategies for Data Serving with RDMA

Optimizing Performance: Intel Network Adapters User Guide

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Application Acceleration Beyond Flash Storage

An Introduction to the QorIQ Data Path Acceleration Architecture (DPAA) AN129

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Live Migration: Even faster, now with a dedicated thread!

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach

SmartNIC Programming Models

High-resolution Measurement of Data Center Microbursts

Disruptive Innovation in ethernet switching

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

NVM PCIe Networked Flash Storage

Near Memory Key/Value Lookup Acceleration MemSys 2017

Demystifying Network Cards

Accelerating Load Balancing programs using HW- Based Hints in XDP

jverbs: Java/OFED Integration for the Cloud

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Cloud Networking (VITMMA02) Server Virtualization Data Center Gear

238P: Operating Systems. Lecture 5: Address translation. Anton Burtsev January, 2018

Chapter 13: I/O Systems

Accelerated Programmable Services. FPGA and GPU augmented infrastructure.

Getting Real Performance from a Virtualized CCAP

Intelligent NIC Queue Management in the Dragonet Network Stack

Today s Data Centers. How can we improve efficiencies?

RDMA-like VirtIO Network Device for Palacios Virtual Machines

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Lightweight Remote Procedure Call. Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat

Transcription:

FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015

Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s] 100 G 10 G 1 G 1 GbE 10 GbE 40 GbE 100 GbE 100 M 100 MbE 1995 2000 2005 2010 2015 2020 Year 5ns inter-arrival time for 64B packets at 100Gbps

Software needs to catch up Many cloud apps dominated by packet processing Key-value stores, graph analytics, load balancers

Software needs to catch up Many cloud apps dominated by packet processing Key-value stores, graph analytics, load balancers Redis on Arrakis with kernel-bypass: 4µs Still a long way to go to 5ns

Software needs to catch up Many cloud apps dominated by packet processing Key-value stores, graph analytics, load balancers Redis on Arrakis with kernel-bypass: 4µs Still a long way to go to 5ns 5ns is not a lot of time Cache access: 15ns for L3, 40ns if dirty in other L1

Software needs to catch up Many cloud apps dominated by packet processing Key-value stores, graph analytics, load balancers Redis on Arrakis with kernel-bypass: 4µs Still a long way to go to 5ns 5ns is not a lot of time Cache access: 15ns for L3, 40ns if dirty in other L1 Careful memory system use is paramount Ideal: Data always in L1/L2 cache

NIC & SW are not well integrated Poor cache locality, extra synchronization NIC steers packets to cores by connection Application locality may not match connection Wasted CPU cycles Packet parsing repeated in software High memory system pressure Packet formatted for network, not SW access High packet processing overhead

Our Proposal: FlexNIC Flexible NIC DMA interface Applications insert packet matching rules Rules control DMA actions

Our Proposal: FlexNIC Flexible NIC DMA interface Applications insert packet matching rules Rules control DMA actions Use multi-stage match+action (M+A) processing Similar to that found in next-generation SDN switches

Our Proposal: FlexNIC Flexible NIC DMA interface Applications insert packet matching rules Rules control DMA actions Use multi-stage match+action (M+A) processing Similar to that found in next-generation SDN switches M+A is both efficient and a flexible abstraction Packet steering based on app-defined match App-level packet validation Customized packet transformations: add/remove/modify header fields Can be stateful

Example: Key-Value Store Receive-Side Scaling Client 1 Client 2 Client 3 NIC Core 1 Core 2 1 2 3 4 5 6 7 8 Hash Table

Example: Key-Value Store Receive-Side Scaling Client 1 Client 2 Client 3 Lock contention NIC Core 1 Core 2 1 2 3 4 5 6 7 8 Hash Table

Example: Key-Value Store Receive-Side Scaling Client 1 Client 2 NIC Client 3 Lock contention 3,4,7 Core 1 Core 2 1,4,7,8 1 2 3 4 5 6 7 8 Hash Table Poor cache utilization

FlexNIC Steering Client 1 Client 2 Client 3 NIC Key-based steering Match: IF protocol = memcached Action: PUT packet Core 1TO hash(key) Core 2 1 2 3 4 5 6 7 8 Hash Table

FlexNIC Steering Client 1 Client 2 NIC Client 3 Key-based steering No locks needed Core 1 Core 2 1 2 3 4 5 6 7 8 Hash Table

FlexNIC Steering Client 1 Client 2 NIC Client 3 Key-based steering No locks needed 1,3,4 Core 1 Core 2 7,8 1 2 3 4 5 6 7 8 Hash Table Better cache utilization

Traditional NIC DMA Buffers Descriptor Queue

Traditional NIC DMA Buffers * * * Descriptor Queue

Traditional NIC DMA Buffers * * Descriptor Queue

Traditional NIC DMA Buffers * Descriptor Queue

Traditional NIC DMA Buffers * * Descriptor Queue

Traditional NIC DMA Buffers * * Descriptor Queue Many PCIe round-trips

Traditional NIC DMA Buffers * * Descriptor Queue Many PCIe round-trips Key-Value Store: Copy items to item log

FlexNIC Key-Value Store Custom DMA interface Item 1 Item Log Event Queue Combine steering, validation, and transformation

FlexNIC Key-Value Store Custom DMA interface Item Log Item 1 Item 2 S2 Event Queue Combine steering, validation, and transformation

FlexNIC Key-Value Store Custom DMA interface Item Log Item 1 Item 2 SET, Client ID, Item Pointer S2 Event Queue Combine steering, validation, and transformation

FlexNIC Key-Value Store Custom DMA interface Item Log Item 1 Item 2 S2 G1 Event Queue Combine steering, validation, and transformation

FlexNIC Key-Value Store Custom DMA interface Item Log Item 1 Item 2 GET, Client ID, Hash, Key S2 G1 Event Queue Combine steering, validation, and transformation

Preliminary Evaluation Lightweight key-value store implementation 4 core Sandy Bridge 2.2GHz Receive-side scaling: 1110 cycles/req Key-based steering: 690 cycles/req (38% speedup) Emulate using RSS and IPv6 header FlexNIC KVS: 450 cycles/req (60% speedup) Emulate in software on dedicated core

Summary Networks are becoming faster Server applications need to keep up FlexNIC eliminates processing inefficiencies Application control over where packets are processed Efficient steering/validation/transformation on NIC Promising preliminary performance evaluation Reduce request processing time by 60% Next step: evaluate more use-cases, delivery directly to cache

Backup

Further Use-case: Improving RDMA RDMA requires shared memory to be pinned Problematic for virtualization Usually mapped, but need to gracefully handle if not Idea: Add rule to divert access to slow-path Data structure consistency with RDMA is hard Often need to use messages Idea: Implement data structure operations (e.g. log append, hash table insert)