Xen Network I/O Performance Analysis and Opportunities for Improvement

Similar documents
Netchannel 2: Optimizing Network Performance

Network optimizations for PV guests

Xenoprof overview & Networking Performance Analysis

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Enabling Fast, Dynamic Network Processing with ClickOS

Support for Smart NICs. Ian Pratt

Keeping up with the hardware

Optimizing TCP Receive Performance

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

To Grant or Not to Grant

Advanced Computer Networks. End Host Optimization

Software Routers: NetMap

Learning with Purpose

Implementation and Analysis of Large Receive Offload in a Virtualized System

Virtualization, Xen and Denali

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Xenrelay: An Efficient Data Transmitting Approach for Tracing Guest Domain

Introduction to Oracle VM (Xen) Networking

Enabling innovation in the Internet: Main Achievements of the CHANGE Project. Felipe Huici, NEC Europe

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

A Case for High Performance Computing with Virtual Machines

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

Redesigning Xen's Memory Sharing Mechanism for Safe and Efficient I/O Virtualization Kaushik Kumar Ram, Jose Renato Santos, Yoshio Turner

The Price of Safety: Evaluating IOMMU Performance

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

IsoStack Highly Efficient Network Processing on Dedicated Cores

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Re-architecting Virtualization in Heterogeneous Multicore Systems

TX bulking and qdisc layer

Advanced RDMA-based Admission Control for Modern Data-Centers

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

High Performance Packet Processing with FlexNIC

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

A low-overhead networking mechanism for virtualized high-performance computing systems

Tutorial: Profiling in Xen

Virtualization with XEN. Trusted Computing CS599 Spring 2007 Arun Viswanathan University of Southern California

Xen and the Art of Virtualization

DPDK Summit China 2017

Advanced Operating Systems (CS 202) Virtualization

VALE: a switched ethernet for virtual machines

Xen on ARM. How fast is it, really? Stefano Stabellini. 18 August 2014

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

PacketShader: A GPU-Accelerated Software Router

OpenFlow Software Switch & Intel DPDK. performance analysis

Network device virtualization: issues and solutions

Much Faster Networking

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Interrupt Coalescing in Xen

Initial Evaluation of a User-Level Device Driver Framework

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe)

Arrakis: The Operating System is the Control Plane

Xen. past, present and future. Stefano Stabellini

Designing High Performance DSM Systems using InfiniBand Features

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Message Passing Architecture in Intra-Cluster Communication

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

ebpf Offload to Hardware cls_bpf and XDP

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Introduction to TCP/IP Offload Engine (TOE)

RDMA-like VirtIO Network Device for Palacios Virtual Machines

6.9. Communicating to the Outside World: Cluster Networking

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Knut Omang Ifi/Oracle 6 Nov, 2017

Kemari: Virtual Machine Synchronization for Fault Tolerance using DomT

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

A Look at Intel s Dataplane Development Kit

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Evaluation and improvements of I/O Scalability for Xen. Jun Kamada, Fujitsu Limited Simon Horman, VA Linux Systems Japan

An FPGA-Based Optical IOH Architecture for Embedded System

An Extensible Message-Oriented Offload Model for High-Performance Applications

Xen on ARM. Stefano Stabellini

CS 3516: Computer Networks

CS 3516: Advanced Computer Networks

2 nd Half. Memory management Disk management Network and Security Virtual machine

CSCI-GA Operating Systems. Networking. Hubertus Franke

Link Virtualization based on Xen

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

CSE 120 Principles of Operating Systems

TCP/misc works. Eric Google

On the cost of tunnel endpoint processing in overlay virtual networks

Network stack specialization for performance

Open Source Traffic Analyzer

Fast packet processing in linux with af_xdp

Quiz. Segment structure and fields Flow control (rwnd) Timeout interval. Phases transition ssthresh setting Cwnd setting

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Operating Systems. 11. Memory Management Part 3 Kernel Memory Allocation. Paul Krzyzanowski Rutgers University Spring 2015

Review: Hardware user/kernel boundary

A comparative analysis of Precision Time Protocol in native, virtual machines and container-based environments for consolidating automotive workloads

Data Center Virtualization: Xen and Xen-blanket

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Transcription:

Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Motivation CPU cost for TCP connection at 1 Gbps (xen-unstable (3/16/27) ; PV Linux guest; X86-32bit) 14 12 1 CPU utilization 8 6 4 xen xen 2 linux linux RX TX Network I/O has high CPU cost TX: 35% cost of linux RX: 31% cost of linux page 2

Outline Performance Analysis for network I/O RX path (netfront/netback) Network I/O RX optimizations Network I/O TX optimization page 3

Performance Analysis For Network I/O RX Path page 4

Experimental Setup Machines: HP Proliant DL58 (client and server) P4 Xeon 2.8 Ghz, 4 CPU (MT disabled), 64 GB, 512 KB L2, 2MB L3 NIC: Intel E1 (1 Gbps) Network configuration Single switch connecting client and server Server configuration Xen unstable (c.s.14415 March 16, 27) (default xen/xenu configs) Single guest (512MB), dom also with 512MB Benchmark Simple UDP micro benchmark (1gbps, 15 bytes packets) page 5

CPU Profile for network RX path 14 CPU util (%) 12 1 8 6 4 2 app usercopy kernel xenu grantcopy kernel xen linux xen Cost of data copy is significant both in Linux and Xen Xen has the cost of an additional data copy Xen guest kernel alone uses more CPU than linux Most cost for Xen code is in dom page 6

Xen Code Cost 2 CPU util (%) 18 16 14 12 1 8 6 4 2 other mm interrupt event time schedule hypercall domain_page grant_table xen xenu Major Xen overhead is: grant table & dom page map/unmap Copy grant: several expensive atomic operations (lock instr.prefix): Atomic cmpxchg operation for updating status field (grant in use) Increment/decrement grant usage counters (multiple spinlock op) page 7

Dom Kernel Cost CPU util. (%) 35 3 25 2 15 1 5 linux dom other dma+swiotlb hypercall interrupt syscall schedule mm tcp bridge network netback driver Bridge/network is large component in dom cost (Xen summit 26) Can be reduced if netfilter bridge config option is disabled Xen new code: Netback, hypercall, swiotlb Higher interrupt overhead in Xen: extra code in evtchn.c Additional high cost functions in Xen (accounted in other ) spin_unlock_irqrestore(), spin_trylock() page 8

Bridge netfilter cost CPU util. (%) 35 3 25 2 15 1 5 other dma+swiotlb hypercall interrupt syscall schedule mm tcp bridge network netback driver linux nf_br no nf_br What do we need to do to disable bridge netfilter by default? Should we add a netfilter hook in netback? page 9

Overhead in guest CPU util. (%) 3 25 2 15 1 5 linux domu other grant_table hypercall interrupt syscall schedule mm tcp network driver Netfront: 5 times more expensive than e1 driver in Linux Memory op (mm): 2 times more expensive in Xen (?) grant table: high cost of atomic cmpxchg operation to revoke grant access Other : spin_unlock_irqrestore(), spin_trylock() (same as dom) page 1

Source of Netfront Cost CPU util. (%) 3 25 2 15 1 5 linux domu other grant_table hypercall interrupt syscall schedule mm tcp network headcopy driver Netback copy packet data into netfront page fragments Netfront copies first 2 bytes of packet from fragment into main socket buffer data area Large netfront cost is due to this extra data copy page 11

Opportunities for Improvement on RX path page 12

Reduce RX head copy size 3 CPU util. (%) 25 2 15 1 5 linux headcopy 2 bytes headcopy 14 bytes other grant_table hypercall interrupt syscall schedule mm tcp network headcopy driver No need to have all headers in main SKB data area Copy only Ethernet header (14 bytes) Network stack copies more data as needed page 13

Move grant data copy into guest CPU 14 CPU util. (%) 12 1 8 6 4 2 app usercopy kernel xenu grantcopy kernel xen copy in dom copy in guest Dom grant access to data and guest copy it using copy grant Cost of 2 nd copy to user buffer is reduced as data is already in the guest CPU cache (assumes cache is not evicted due to user process delaying read) Additional benefit: Improves dom (driver domain) scalability as more work is done at the guest side Data copy is more expensive in guest (alignment problem) page 14

Grant copy align problem Packet starts at half word copy in guest netback SKB nettfront SKB grant copy 16 bytes 2 packet netback SKB copy in dom netfront fragment page grant copy copy is expensive when destination start is not at word boundary Fix: Copy also 2 prefix bytes source and destination now aligned page 15

Fixing grant copy alignment 14 cpu UTIL. (%) 12 1 8 6 4 2 app usercopy kernel xenu grantcopy kernel xen copy in dom copy in guest align Grant copy in guest becomes more efficient than in current Xen Grant copy in dom: destination is word aligned but Source is not word aligned Can also be improved by copying additional prefix data Either 2 or (2+16) bytes page 16

Fixing alignment in current Xen 14 CPU util. (%) 12 1 8 6 4 2 app usercopy kernelu xenu grantcopy kernel xen original word align buffer align copy in guest Source alignment reduces copy cost Source and dest. at same buffer offset has better performance Reason (?): maybe because same offset in cache? Copy cost in dom is still more expensive than copy in guest. Different cache behavior page 17

Copy in guest has better cache locality grant copy in guest grant copy in dom 12 12 1 8 6 4 2 app usercopy kernelu xenu grantcopy kernel xen CPU util (%) 1 8 6 4 2 app usercopy kernelu xenu grantcopy kernel xen cycles L3 misses L2 misses cycles L3 misses L2 misses Dom copy has more L2 cache misses than guest copy Dom copy has lower cache locality Guest post multiple pages on IO ring. All pages in ring must be used before the same page can be reused For guest copy, pages are allocated on demand and reused more often improving cache locality page 18

Possible grant optimizations Define new simple copy grant: Allow only one copy operation at a time No need to keep grant usage counters (remove lock) avoid cost of atomic cmpxchg operations Separate fields used for enabling grant and usage status Avoid incrementing/decrementing page ref counters Use an RCU scheme for page deallocation (lazy deallocation) page 19

Potential savings in grant modifications 16 14 12 1 8 6 4 2 other mm time schedule hypercall domain_page grant_table original no status update no src page pin no dst page pin Results are optimistic Still need to implement grant modifications Results are based on eliminating current operations page 2

Coalescing netfront RX interrupts 8 CPU util. (%) 7 6 5 4 3 2 1 app usercopy kernelu xenu grantcopy kernel xen no batch 8 pkts 16 pkts 32 pkts 64 pkts NIC (e1) already coalescing HW interrupts (~1 packets/int) Batching packets can provide additional benefit 1% for 32 packets But adds extra latency Dynamic coalescing scheme should be beneficial page 21

Coalescing effect on Xen cost 12 Xen in guest context CPU util. (%) 1 8 6 4 2 other mm time schedule hypercall domain_page grant_table no batch 8 pkts 16 pkts 32 pkts 64 pkts Except grant and domain page map/unmap, all other Xen costs are amortized by larger batches An additional reason for optimizing grant page 22

Combining all RX optimizations 14 CPU util (%) 12 1 8 6 4 2 app usercopy kernelu xenu grantcopy kernel xen linux current optimized Cost of network I/O for RX can be significantly reduced From ~25% to ~7% ovehead (compared to linux) Largest improvement comes for moving grant copy to guest CPU page 23

Optimization for TX path page 24

Lazy page mapping on TX Dom only needs to access packet headers No need to map guest pages with packet payload NIC device access memory directly through DMA Avoid mapping guest page on packet TX Copy packet headers using I/O ring Modified grant operation returns machine address (for DMA) but does not map page in dom. Provide page fault handler to deal with cases in which dom needs to access payload Packet to dom/domu; netfilter rules page 25

Benefit of lazy TX page mapping 1 Gbps TCP TX (64KB msg) 1 Gbps TCP RX 9 18 CPU util. (%) 8 7 6 5 4 3 2 1 app usercopy kernelu xenu grantcopy kernel xen CPU util (%) 16 14 12 1 8 6 4 2 app usercopy kernelu xenu grantcopy kernel xen original TX optimization original tx optimiation Performance improvement for TX optimization ~1% for large TX ~8% for TCP RX due to ACKs Some additional improvement may be possible with grant optimizations page 26

Questions? page 27