Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Similar documents
Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Linux Storage System Bottleneck Exploration

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Using Transparent Compression to Improve SSD-based I/O Caches

Linux Storage System Analysis for e.mmc With Command Queuing

IN order to increase multi-core hardware utilization and

Netchannel 2: Optimizing Network Performance

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Extremely Fast Distributed Storage for Cloud Service Providers

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

IsoStack Highly Efficient Network Processing on Dedicated Cores

Accelerate Applications Using EqualLogic Arrays with directcache

Agilio CX 2x40GbE with OVS-TC

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

An NVMe-based FPGA Storage Workload Accelerator

The Power of Batching in the Click Modular Router

High Performance Packet Processing with FlexNIC

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Learning with Purpose

Keeping up with the hardware

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Arrakis: The Operating System is the Control Plane

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

FaRM: Fast Remote Memory

Advanced Computer Networks. End Host Optimization

Clotho: Transparent Data Versioning at the Block I/O Level

Understanding and Improving the Cost of Scaling Distributed Event Processing

Memory Management Strategies for Data Serving with RDMA

Building a High IOPS Flash Array: A Software-Defined Approach

Dongjun Shin Samsung Electronics

The Transition to PCI Express* for Client SSDs

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Xen Network I/O Performance Analysis and Opportunities for Improvement

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Measurement-based Analysis of TCP/IP Processing Requirements

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

6.9. Communicating to the Outside World: Cluster Networking

Network Design Considerations for Grid Computing

RDMA and Hardware Support

Designing a True Direct-Access File System with DevFS

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RIGHTNOW A C E

Virtual Switch Acceleration with OVS-TC

Fine-grained Metadata Journaling on NVM

Evaluation of the Chelsio T580-CR iscsi Offload adapter

G-NET: Effective GPU Sharing In NFV Systems

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Distributed caching for cloud computing

Oracle Database 12c: JMS Sharded Queues

A Look at Intel s Dataplane Development Kit

Optimizing TCP Receive Performance

Exploiting the benefits of native programming access to NVM devices

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Non-Volatile Memory Through Customized Key-Value Stores

CS533 Concepts of Operating Systems. Jonathan Walpole

NVM PCIe Networked Flash Storage

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Utilizing the IOMMU scalably

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

PacketShader: A GPU-Accelerated Software Router

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Improving Ceph Performance while Reducing Costs

Architectural Principles for Networked Solid State Storage Access

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Best Practices for Setting BIOS Parameters for Performance

Ziye Yang. NPG, DCG, Intel

Accelerating Data Centers Using NVMe and CUDA

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Comparison of Storage Protocol Performance ESX Server 3.5

Beyond Block I/O: Rethinking

RAIN: Reinvention of RAID for the World of NVMe

Capabilities and System Benefits Enabled by NVDIMM-N

Tuning Intelligent Data Lake Performance

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Optimizing Performance: Intel Network Adapters User Guide

Software Routers: NetMap

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Toward SLO Complying SSDs Through OPS Isolation

OpenMPDK and unvme User Space Device Driver for Server and Data Center

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Transcription:

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems and Technology MSST 2015 June 5 th, Santa Clara, California

Motivation Small I/O requests limited by host I/O path overhead Small I/O requests are important for a large number of workloads Most files are 4kB or smaller Metadata requests are typically small and 50% of the I/Os Storage devices, specially HDDs, dominate all I/O overheads Techniques for improving throughput focused on large requests NVM technologies exhibit performance similar to DRAM Device overhead of small I/Os a few microseconds Bottleneck shifts from device to host I/O stack Most systems today use some form of networked storage Ethernet NICs have improved significantly latency as well Our goal Reduce host-level overhead for small I/O requests 2

Tyche Semantics Tyche is an in-house network storage protocol over raw Ethernet that achieves high throughput without any hardware support [MSST14] Connection-oriented protocol over raw Ethernet that can be deployed in existing infrastructures Create and present a local view of a remote storage device (NBD-like) Support any existing file system Transparent bundling of multiple NICs (tested up to 6) Reliable delivery Provide RDMA-like operations without any hardware support Copy reduction via page remapping in the I/O path NUMA affinity management Storage-specific packet processing Pre-allocation of memory 3

Netwok layer NIC NIC NIC NIC NIC NIC Physical devices NIC NIC NIC NIC NIC NIC Tyche Design Tyche achieves up to 6.7GB/s by using six 10Gbit/s NICS without hardware support [MSST14] Send path (Initiator) Tyche block layer VFS File System Tyche network layer Ethernet Driver buffers for requests and data tx_ring Receive path (Target) Block storage device Tyche block layer Tyche network layer Ethernet Driver Kernel Space buffers for requests and data not_ring rx_ring 4

However... Tyche still exhibits low throughput and low network link utilization for small I/O requests FIO - Read requests Link utilization (%) 120 100 80 60 40 20 0 4kB 8kB 32kB 64kB 8 16 32 64 128 # outstanding I/O requests 128kB NBD-4kB Tyche provides up to 5x the link utilization of NBD, but Tyche achieves only up to 56% for 4kB requests, while up to 90% for 8kB requests Link utilization 5

Therefore, our goal is... We analyze the host CPU overheads in the networked I/O path We find that small I/O requests are limited by: Context switch overhead Network packet processing We reduce overheads and increase throughput for small I/O requests: Low I/O concurrency Context switches High I/O concurrency Dynamic batching of requests 6

1 Motivation 2 Overhead Analysis 3 Reducing Context Switches 4 Adaptive Batching 5 Conclusions

Experimental Testbed Two nodes 4-core Intel Xeon E5520 @2.7GHz Initiator: 12GB DDR-III DRAM Target: 48GB DDR-III DRAM, but 36GB used as ramdisk 1 Myri10ge card each node, connected back to back CentOS 6.3 Tyche implemented in Linux kernel 2.6.32 Benchmark: FIO Direct I/O, 4kB requests Queue depth: single thread with 1 outstanding request 8

Storage device Block layer Network layer Ethernet driver End-to-end Overhead Analysis Ty-IS tx A B I/O request Message buffers Initiator In/Out Kernel CS-Out CS-Rec Application not rx Kernel space Ty-IR User space F CS-WQ Ty-TS Ty-TS tx Ramdisk I/O request Work queue Target Ty-TR CS-Rec not rx E Overheads measured Total Tyche Send Path: Ty-IS, Ty-TS Tyche Receive Path: Ty-IR, Ty-TR Context switches: CS-WQ, CS-Rec, CS-IRQ Ramdisk C NIC tx H NIC rx NIC tx G D NIC rx Overheads computed CS-IRQ Network CS-IRQ In/Out Kernel Link+NIC 9

Storage device Block layer Network layer Ethernet driver 4kB Read Requests At Low Concurrency Total: 73.69µs and Throughput: 52.50MB/s B I/O request Ty-IS tx A Message buffers C NIC tx Initiator In/Out Kernel CS-Out CS-Rec CS-IRQ Application not rx Kernel space NIC rx Ty-IR H Network User space F CS-WQ Ty-TS Ty-TS tx NIC tx Ramdisk G (queue depth=1) I/O request Work queue Target Ty-TR CS-Rec D CS-IRQ not rx E NIC rx Overhead µs % I/O kernel 13.19 17.9 Ty-IS 2.75 Ty-TR 3.00 Ty-TS 4.00 20.0 Ty-IR 5.00 CS-WQ 4.00 CS-Rec 8.00 27.3 CS-IRQ 8.15 Ramdisk 1.00 1.4 Link+NIC 24.60 33.4 10

Storage device Block layer Network layer Ethernet driver 4kB Write Requests At Low Concurrency Total: 73.80µs and Throughput: 52.50MB/s B I/O request Ty-IS tx A Message buffers C NIC tx Initiator In/Out Kernel CS-Out CS-Rec CS-IRQ Application not rx Kernel space NIC rx Ty-IR H Network User space F CS-WQ Ty-TS Ty-TS tx NIC tx Ramdisk G (queue depth=1) I/O request Work queue Target Ty-TR CS-Rec D CS-IRQ not rx E NIC rx Overhead µs % I/O kernel 12.80 17.3 Ty-IS 4.75 Ty-TR 5.00 Ty-TS 3.00 20.3 Ty-IR 2.25 CS-WQ 4.00 CS-Rec 8.00 27.3 CS-IRQ 8.13 Ramdisk 1.00 1.4 Link+NIC 24.87 33.7 11

1 Motivation 2 Overhead Analysis 3 Reducing Context Switches 4 Adaptive Batching 5 Conclusions

Storage device Block layer Network layer Ethernet driver Overhead of Context Switches Each context switch costs 4µs and up to 27.5% of total overhead Ty-IS tx A B I/O request Message buffers C NIC tx Initiator In/Out Kernel CS-Out CS-Rec CS-IRQ Application not rx Kernel space NIC rx Ty-IR H Network User space F CS-WQ Ty-TS Ty-TS tx NIC tx Ramdisk G I/O request Work queue Target Ty-TR CS-Rec D CS-IRQ not rx E NIC rx Two threads just performing context switches: Same NUMA node: 2.5µs Different node: 5µs Overhead from saving, restoring processor registers, pipeline flush, caches... Jongmin et.al., Transactions on Storage 11(2), 2015 13

Storage device Block layer Network layer Ethernet driver Avoiding Context Switch at the Receive Path A single thread runs network layer tasks and block layer tasks Ty-IS tx A B I/O request Message buffers C NIC tx Initiator In/Out Kernel CS-Out Application rx Kernel space NIC rx Ty-IR H User space F CS-WQ Ty-TS Ty-TS tx NIC tx Ramdisk G I/O request Work queue Target Ty-TR D rx E NIC rx Shared structures to communicate both layers are not needed Overhead of receive path is significantly reduced Total overhead is reduced by 18% Throughput increases by 22% CS-IRQ Network CS-IRQ 14

Storage device Block layer Network layer Ethernet driver Avoiding Context Switch at the Target Send Path No work queue to send the completion back at the target Ty-IS tx A B I/O request Message buffers C NIC tx Initiator In/Out Kernel CS-Out Application rx Kernel space NIC rx Ty-IR H User space F Ty-TS Ty-TS tx NIC tx Ramdisk G I/O request Work queue Target Ty-TR D rx E NIC rx No sync with the work queue threads is needed Overhead of target send path is also reduced Total overhead is reduced by 22% Throughput increases by 45% CS-IRQ Network CS-IRQ 15

Summary For 4kB requests, at low concurrency: Total I/O overhead per I/O request is reduced by 27% Throughput improves by 45% Context switch overhead is reduced by up to 60% Tyche processing is reduced by 56%/61% for reads/writes 16

1 Motivation 2 Overhead Analysis 3 Reducing Context Switches 4 Adaptive Batching 5 Conclusions

Adaptive Batching Problem Under high concurrency, Tyche still cannot achieve high throughput for small I/O requests Link utilization (%) 120 100 80 60 40 20 4kB 8kB FIO - Read requests 32kB 64kB 128kB NBD-4kB Batch several requests into a single request message Batching reduces Number of messages Message processing 0 8 16 32 64 128 # outstanding I/O requests Number of context switches 18

Tyche network layer Network driver Batching Request Messages A batch request message includes I/O requests or I/O completions D I/O requests H Batch queue Batch message tx ring NIC tx ring Applications Tyche block layer Initiator has a batch queue and thread Initiator: Target: Applications enqueue I/O requests into the batch queue A thread dequeues these requests and inserts into a batch message Issues a regular I/O requests per request in the batch message Batches completions as well 19

Batching Data Messages Data of a 4kB request is sent by using a single data packet in a Jumbo Ethernet frame of 4kB We batch data messages: Data for 4kB request messages is sent together in a single data message We use data packets of 8kB Jumbo Ethernet frame of 8kB is used 20

Dynamic Batching Problem Should we send a batch message or wait for more I/O requests? Proposal We define N as the number of requests to put in a message We calculate N using feedback from achieved (measured) throughput and available concurrency (current queue depth) N is calculated constantly (every second) We send a message every N requests or when a timeout occurs To avoid local minimum we artificially increase/decrease N if it stays constant for some time 21

Batching Requests/Responses (no data) Dynamic Tyche-Batch improves link utilization by up to 49.5% Batching achieves up to 80.7% of link utilization 100 NoB DyB B-2 B-8 B-64 100 NoB DyB B-2 B-8 B-64 Link utilization (%) 80 60 40 20 Link utilization (%) 80 60 40 20 0 32 64 128 256 512 # outstanding I/O requests 0 32 64 128 256 512 # outstanding I/O requests Read requests Write requests FIO, 4kB requests, 1, 2,..., and 128 threads, and 4 outstanding I/O per thread 22

Batching Data + Requests/Responses Dynamic Tyche-Batch improves link utilization by up to 56.9% Batching achieves up to 88.0% of link utilization 100 NoB DyB B-2 B-8 B-64 100 NoB DyB B-2 B-8 B-64 Link utilization (%) 80 60 40 20 Link utilization (%) 80 60 40 20 0 32 64 128 256 512 # outstanding I/O requests 0 32 64 128 256 512 # outstanding I/O requests Read requests Write requests FIO, 4kB requests, 1, 2,..., and 128 threads, and 4 outstanding I/O per thread 23

Batching Requests/Responses (no data): Batch Level Reads: batch level increases as number of requests does Writes: batch level is kept constant up to more than 256 requests 80 DyB B-2 B-8 B-64 80 DyB B-2 B-8 B-64 60 60 Batch level 40 Batch level 40 20 20 0 32 64 128 256 512 # outstanding I/O requests 0 32 64 128 256 512 # outstanding I/O requests Read requests Write requests FIO, 4kB requests, 1, 2,..., and 128 threads, and 4 outstanding I/O per thread 24

Mixed Request Sizes 32 threads issue 4kB, every 60s a new thread issues 128kB Batching achieves up to 91.3% of link utilization 100 DyB NoB 100 DyB NoB Link utilization (%) 80 60 40 20 Link utilization (%) 80 60 40 20 0 0 60 120 180 240 300 360 0 0 60 120 180 240 300 360 Time (s) Read requests Time (s) Write requests 25

Conclusions We analyze overhead of Tyche [MSST14] CPU overhead is up to 65% of total overhead (including OS etc) I/O protocol is up to 47% of total overhead (excluding OS etc) At low concurrency: we reduce I/O protocol overhead by up to 61% via reducing context switches At high concurrency: we improve link utilization by up to 57% via adaptive batching We achieve: 14µs overhead per I/O request, about 7µs on each of initiator/target (excluding network link) 91% of theoretical maximum of link utilization: 287K out of 315K IOPS on 10GigE No hardware support required 26

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas pilar@ditec.um.es bilas@ics.forth.gr