Ben Walker Data Center Group Intel Corporation

Similar documents
SPDK Blobstore: A Look Inside the NVM Optimized Allocator

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Daniel Verkamp, Software Engineer

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Ziye Yang. NPG, DCG, Intel

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Jim Harris. Principal Software Engineer. Data Center Group

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Jim Harris. Principal Software Engineer. Intel Data Center Group

Jim Harris Principal Software Engineer Intel Data Center Group

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Out-of-band (OOB) Management of Storage Software through Baseboard Management Controller Piotr Wysocki, Kapil Karkra Intel

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Data and Intelligence in Storage Carol Wilder Intel Corporation

A New Key-Value Data Store For Heterogeneous Storage Architecture

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Extremely Fast Distributed Storage for Cloud Service Providers

WITH INTEL TECHNOLOGIES

Notices and Disclaimers

OPENSHMEM AND OFI: BETTER TOGETHER

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Intel optane memory as platform accelerator. Vladimir Knyazkin

Distributed File Systems II

ISA-L Performance Report Release Test Date: Sept 29 th 2017

Achieve Low Latency NFV with Openstack*

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Fan Yong; Zhang Jinghai. High Performance Data Division

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

DPDK Performance Report Release Test Date: Nov 16 th 2016

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

The Transition to PCI Express* for Client SSDs

Intel s Architecture for NFV

Red Hat Enterprise 7 Beta File Systems

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Intel Inside. amazing windows 10 Outside

Recovering Disk Storage Metrics from low level Trace events

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Non-Blocking Writes to Files

<Insert Picture Here> Btrfs Filesystem

Enterprise Volume Management System Project. April 2002

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Beyond Block I/O: Rethinking

The Google File System

INTEL PENTIUM Gold AND CELERON PROCESSORS

April 2 nd, Bob Burroughs Director, HPC Solution Sales

IXPUG 16. Dmitry Durnov, Intel MPI team

WHITEPAPER. Improve PostgreSQL Performance with Memblaze PBlaze SSD

Intel Speed Select Technology Base Frequency - Enhancing Performance

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

RAIN: Reinvention of RAID for the World of NVMe

FAST FORWARD TO YOUR <NEXT> CREATION

March NVM Solutions Group

Applying Polling Techniques to QEMU

Mohan J. Kumar Intel Fellow Intel Corporation

Intel Clear Containers. Amy Leeland Program Manager Clear Linux, Clear Containers And Ciao

Intel SSD Data center evolution

CA485 Ray Walshe Google File System

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Intel RaId Intel Product Quick Reference Matrix - Servers Q1,

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

SSD/Flash for Modern Databases. Peter Zaitsev, CEO, Percona November 1, 2014 Highload Moscow,Russia

An Efficient Memory-Mapped Key-Value Store for Flash Storage

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Exploiting the benefits of native programming access to NVM devices

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Ravindra Babu Ganapathi

Distributed Filesystem

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Enjoy better computing performance with faster data transfer

Transcription:

Ben Walker Data Center Group Intel Corporation

Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 2017 Intel Corporation.

Agenda Introduction Use Cases Design Benchmarks 3

Lots of applications want to use SPDK But they aren t designed to directly use the block device 5

What does a filesystem do? Partitions Permissions Caching RAID Access Times Sparse Allocation Byte Granularity Snapshots Directories Checksums TRIM I/O Scheduling 6

What can SPDK do to help? Let s build some new components! 7

What sort of application benefits from SPDK? Lots of I/O Latency Sensitive SAN? Database? Cache? We picked two use cases: RocksDB Dynamic Block Allocation 9

RocksDB Log-structured merge tree Written in C++, Open Source Pluggable storage backend Broadly adopted Makes minimal use of XFS Directory structure I/O pattern Minimal caching needs Recommends XFS No other file system features required! 10

Glossary Of Terms File: Array of bytes Mutable, Resizable String name Object: Array Of bytes Immutable, replaceable String name Page 4 KiB 12

Design Goals Simple and efficient BlobFS Design for fast storage media Support file & object-like semantics Blobstore BDEV 13

Blobstore Basics The user interacts with chunks of data called blobs Array of pages Mutable, resizable ID Asynchronous No blocking, queueing, or waiting Fully parallel operations No locks in I/O path I m very efficient 14

Blobstore Space Allocation Cluster 0 Page 0 Page 255 LBA 0 LBA 1LBA 0LBA 2 LBA 3 LBA 252 LBA 253 LBA 255 LBA 254 LBA 255 15

Blobstore Design Blob: array of pages implemented as an ordered list of clusters: 0 1 2 3 Page Offsets: 0-255 256-511 512-767 768-1023 Cluster 905 Cluster 52 Cluster 87 Cluster 455 LBA 0 LBA N 16

Blobstore Sample I/O Blobs are read/written by specifying a relative page offset and a page count Page Offset 254 Page Offset 255 Blob Write Page (Offset 254, Page 6 pages) Offset Offset 256 257 Page Offset 258 Page Offset 259 0 Cluster 905 Disk Write(Offset 232583, 2 LBAs) 1 Cluster 52 Disk Write(Offset 13312, 4 LBAs) LBA 232583 LBA 232584 LBA 13312 LBA 13313 LBA 13314 LBA 13315 17

Blobstore Metadata Metadata is stored in pages in a reserved region Metadata pages are not shared between blobs A blob may have multiple pages of metadata Page 0 (Blob 1) Page 1 (Blob 2) Page 2 (Blob 3) Page 3 (Blob 1) Page 4 (Blob 4) Metadata Region SSD 18

Blobstore API open, close, read, write, sync, resize Asynchronous, callback-driven Read/write in units of pages, space allocation in clusters Data is direct Metadata is cached Minimal support for xattrs Independent of BlobFS 19

Async I/O Thread BlobFS Design Core 0 Core 1 Layered on Blobstore User interacts with files Data can be cached Synchronous API* open() write() Core 2 open() read() * Asynchronous API possible I/O Device 20

Async I/O Thread BlobFS Caching Core 0 Core 1 Core 2 Not a general purpose page cache Read ahead Sequential write buffering All other access bypasses cache open() write() write() write() open() read() read() I/O Device 21

Latency (ms) Latency (ms) Benchmark: db_bench Read/Write Latency 2 20 1.8 18 1.6 16 1.4 14 1.2 12 1 10 0.8 8 0.6 6 0.4 4 0.2 2 0 0 50 75 99 99.9 Percentile Latency Percentile Latency Kernel SPDK Kernel SPDK System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 23

Transactions Per Second Benchmark: db_bench Read/Write Throughput 30000 25000 20000 15000 10000 5000 0 Kernel SPDK System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 24

Next Steps Major API clarifications More & better benchmarking Use blobstore as a dynamic partitioner (bdev) BlobFS caching strategy is RocksDB-centric Asynchronous BlobFS API Sparse allocation of blobs More open source application integration? 26

Latency us Latency us SPDK Blobstore Vs. Kernel: Latency 7000 db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 140000 db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 6000 5000 44% 120000 100000 4000 80000 3000 60000 372% 2000 28% 40000 1000 21% 20000 0 Insert Randread Overwrite 0 Readwrite SPDK Blobstore reduces tail latency by 3.7X System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2

SPDK Blobstore Vs. Kernel: Transactions Per Second Keys per second db_bench Key Transactions Higher is Better 1200000 1000000 85% 800000 600000 400000 200000 8% 4% ~0% 0 Insert Randread Overwrite Readwrite System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2