SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Similar documents
Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Ben Walker Data Center Group Intel Corporation

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Ziye Yang. NPG, DCG, Intel

Jim Harris Principal Software Engineer Intel Data Center Group

Jim Harris. Principal Software Engineer. Intel Data Center Group

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Jim Harris. Principal Software Engineer. Data Center Group

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Daniel Verkamp, Software Engineer

A Real-Time, Low Latency, Key-Value Solution Combining Samsung Z-SSD and Levyx s Helium Data Store. January 2018

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Storage Performance Tuning for FAST! Virtual Machines

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Intel Builder s Conference - NetApp

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Ceph Optimizations for NVMe

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

March NVM Solutions Group

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

A New Key-Value Data Store For Heterogeneous Storage Architecture

Applying Polling Techniques to QEMU

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Intel optane memory as platform accelerator. Vladimir Knyazkin

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

Designing a True Direct-Access File System with DevFS

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

The Transition to PCI Express* for Client SSDs

Improving Ceph Performance while Reducing Costs

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

Yuan Zhou Chendi Xue Jian Zhang 02/2017

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

An NVMe-based FPGA Storage Workload Accelerator

NVMe SSDs with Persistent Memory Regions

AutoStream: Automatic Stream Management for Multi-stream SSDs in Big Data Era

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Windows Support for PM. Tom Talpey, Microsoft

Userspace NVMe Driver in QEMU

Accelerate Ceph By SPDK on AArch64

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Intel SSD Data center evolution

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

DPDK Roadmap. Tim O Driscoll & Chris Wright Open Networking Summit 2017

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Windows Support for PM. Tom Talpey, Microsoft

Beyond Block I/O: Rethinking

Exploiting the benefits of native programming access to NVM devices

H.J. Lu, Sunil K Pandey. Intel. November, 2018

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Important new NVMe features for optimizing the data pipeline

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Using FPGAs to accelerate NVMe-oF based Storage Networks

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

ReFlex: Remote Flash Local Flash

Intel Distribution for Python* и Intel Performance Libraries

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

RAIN: Reinvention of RAID for the World of NVMe

Non-Blocking Writes to Files

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY

Hardware Based Compression in Ceph OSD with BTRFS

SFS: Random Write Considered Harmful in Solid State Drives

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Linux Storage System Bottleneck Exploration

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

January 28-29, 2014 San Jose

Systems Architecture II

CS3600 SYSTEMS AND NETWORKS

CS 416: Opera-ng Systems Design March 23, 2012

Presented by: Nafiseh Mahmoudi Spring 2017

Operating Systems Design Exam 2 Review: Spring 2011

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Using NVDIMM under KVM. Applications of persistent memory in virtualization

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Developing Extremely Low-Latency NVMe SSDs

Flash-Conscious Cache Population for Enterprise Database Workloads

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Transcription:

SPDK Blobstore: A Look Inside the NVM Optimized Allocator Paul Luse, Principal Engineer, Intel Vishal Verma, Performance Engineer, Intel 1

Outline Storage Performance Development Kit What, Why, How? Blobstore Overview Motivation, Design A Blobstore Example SPDK Key Elements, Hello Blob Walkthrough RocksDB Performance Summary 2

What? Storage Performance Development Kit Software Building Blocks Open Source BSD Licensed Userspace and Polled Mode http://spdk.io 3

SPDK Architecture Released Q4 17 Storage Protocols NVMe-oF* Target NVMe iscsi Target SCSI vhost-scsi Target vhost-blk Target Integration Block Device Abstraction (BDEV) RocksDB Storage Services NVMe 3 rd Party Linux Async IO Logical Volumes Ceph RBD BlobFS Blobstore Ceph fio NVMe Devices Core Drivers Intel QuickData NVMe-oF * NVMe * PCIe Technology Driver Initiator Driver Application Framework 4

Why? Efficiency & Performance Up to 10X MORE IOPS/core for NVMeoF* vs Linux kernel Up to 8X MORE IOPS/core for NVMe vs Linux kernel Up to 50% BETTER tail latency for some RocksDB workloads More EFFICIENT use of development resources * Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 5

How? SPDK Community Github : https://github.com/spdk/spdk Trello : https://trello.com/spdk GerritHub : https://review.gerrithub.io/#/q/project:spdk/spdk+status:open IRC: https://freenode.net/ we re on #spdk Home Page: http://www.spdk.io/ 6

1 st SPDK Hackathon!! Nov 6-8 2017, Phoenix 7

Blobstore - Motivation SPDK came on the scene and enabled applications that consumed block to realize incredible gains But what about applications that don t consume blocks? 8

Blobstore Design Design Goals Minimalistic for targeted storage use cases like RocksDB & Logical Volumes Deliver only the basics to enable another class of application Design for fast storage media 9

Blobstore Design High Level Application interacts with chunks of data called blobs Mutable array of pages of data, accessible via ID Asynchronous No blocking, queuing or waiting Fully parallel No locks in IO path 10

Blobstore Design - Layout A blob is an array of pages organized as an ordered list of clusters Pages 0-255 256-511 512-767 768-1023 Cluster 507 Cluster 48 Cluster 800 Cluster 78 LBA 0 LBA N 11

Blobstore Design - Metadata Stored in pages in reserved region of disk Not shared between blobs One blob may have multiple metadata pages Blob 1 Blob 6 Blob 1 Blob 9 Metadata Data 12

Blobstore Design - API Open, read, write, close, delete, resize, sync Asynchronous, callback driven Read/write in pages, allocate in clusters Data is direct Metadata is cached Minimal support for xattrs 13

Hello Blob Walkthrough 14

Hello Blob Walkthrough 15

Hello Blob Walkthrough 16

Hello Blob Walkthrough 17

Hello Blob Walkthrough 18

Hello Blob Walkthrough 19

Hello Blob Walkthrough 20

Hello Blob Walkthrough 21

Hello Blob Walkthrough 22

Hello Blob Walkthrough 23

SPDK and RocksDB RocksDB SPDK Env (backend) BlobFS Blobstore Bdev NVMe Driver 24

Performance Comparisons (SPDK vs. Linux Kernel) 25

System, Dataset & Workload Configuration System Configuration Processor: Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz Total Physical CPU (HT disabled): 20 OS Configuration Distro: Ubuntu 16.04.1 LTS Kernel: 4.12.4 (built) Arch: x86_64 SSD Details SSD: 1x Intel Optane SSD DC P4800X 375GB Dataset: 242GB (250 million uncompressed records) Dataset size kept higher (4:1) than main memory size Fills ~70% of disk capacity Key_size: 16 Bytes Value_size: 1000 Bytes Total Memory: 64GB SPDK Tuning parameters Cache Size: 30GB Cache_buffer_size: 64KB Intel Optane SSD DC P4800X Linux Kernel 4.12.0 Tuning parameters Page Cache Size: 30GB (using cgroups) Bytes per sync: 64KB XFS filesystem, agcount=32, mount with discard Db_bench: Part of RocksDB: https://github.com/spdk/rocksdb Test Duration: 5 minutes No. of Runs for each test: 3 Compression: None Workloads: ReadWrite Fillsync Overwrite 26

Performance & Latency Workload # 1: Readwrite Read Ops/sec Higher is better Cpu util. Lower is better ops/sec 15000 10000 5000 3772 10571 2.8x cpu util (%) 100 50 42.3 20.4 20% 90% Reads 10% Writes Up to 2.8x performance improvement 0 Linux Kernel SPDK 0 Linux Kernel SPDK Up to 53% improvement in tail latency Avg. Lat 1500 1000 500 Avg. Lat (us) Lower is better 1060 378 65% p99_99 40000 30000 20000 10000 p99_99 (us) Lower is better 36333 17089 53% Up to 20% improvement in CPU utilization 0 Linux Kernel SPDK 0 Linux Kernel SPDK 27

Performance & Latency Workload # 2: Fillsync ops/sec 40000 30000 20000 10000 0 Ops/sec Higher is better 7334 Linux Kernel Avg. Lat (us) Lower is better 31452 SPDK 4.2x cpu util (%) 100 50 0 Cpu util. Lower is better 3.17 7.31 Linux Kernel p99_99 (us) Lower is better SPDK 4% Fillsync writes values in random key order in sync mode Up to 4.2x performance improvement Up to 77% improvement in average latency Avg. Lat 150 100 50 136 31 77% p99_99 1500 1000 500 1106 1027 7% 0 0 Linux Kernel SPDK Linux Kernel SPDK 28

Performance & Latency Workload # 3: Overwrite ops/sec 250000 200000 150000 100000 50000 0 Avg. Lat 6 4 2 Ops/sec Higher is better 175884 Linux Kernel Avg. Lat (us) Lower is better 5.6 199345 SPDK 5 12% 1.1x 100 Updates values in random key p99_99 cpu util (%) 50 0 1500 1000 500 Cpu util. Lower is better 19 19 Linux Kernel p99_99 (us) Lower is better SPDK 1193 1184 order in async mode Workload comprised of large block I/Os to the disk Compaction & flush activity happening in background so not much potential of performance improvement 0 Linux Kernel SPDK 0 Linux Kernel SPDK 29

Summary SPDK allows storage applications to take full advantage of today s fastest CPUs and SSDs New features and functions are always coming SPDK is an open source community that s growing strong! For more info: http://www.spdk.io/ or catch us on IRC! 30

Backup 31

Benchmark Configuration Db_bench: From RocksDB v5.4.5 Workload Overwrite ReadWrite Insert Fillsync db_bench parameters overwrite, threads=1, disable_wal=1 readwhilewriting, threads=4, disable_wal=1 fillseq, threads=1, disable_wal=1 fillsync, threads=1, disable_wal=0, sync=1 Common parameters --disable_seek_compaction=1,--mmap_read=0, --statistics=1, --histogram=1, --key_size=16, -- value_size=1000, --cache_size=10737418240, --block_size=4096, -, bloom_bits=10, -- open_files=500000, --verify_checksum=1, --db=/mnt/rocksdb, --sync=0, --compression_type=none, -- stats_interval=1000000, --compression_ratio=1, --disable_data_sync=0, -,target_file_size_base=67108864, --max_write_buffer_number=3, --max_bytes_for_level_multiplier=10, --max_background_compactions=10, --num_levels=6, --delete_obsolete_files_period_micros=3000000, --max_grandparent_overlap_factor=10, --stats_per_interval=1, --max_bytes_for_level_base=10485760, --stats_dump_period_sec=60 32

The Problem: Software is becoming the bottleneck Latency >2ms >500,000 IO/s >400,000 IO/s <100µs <100µs I/O Performance <500 IO/s >25,000 IO/s <10µs HDD SATA NAND SSD NVMe* NAND SSD Intel Optane SSD The Opportunity: SPDK unlocks new media potential 33

SPDK Updates: 17.07 Release (Aug 2017) 32 Unique Contributors! Userspace vhost-blk Target Vhost-scsi target extended to also support vhost-blk GPT Partition table support Exports partitions as SPDK bdevs Build system improvements Added configure script which simplifies build time configuration Improvements to existing features API cleanup and simplification 34