Ben Walker Data Center Group Intel Corporation
Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 2017 Intel Corporation.
Agenda Introduction Use Cases Design Benchmarks 3
Lots of applications want to use SPDK But they aren t designed to directly use the block device 5
What does a filesystem do? Partitions Permissions Caching RAID Access Times Sparse Allocation Byte Granularity Snapshots Directories Checksums TRIM I/O Scheduling 6
What can SPDK do to help? Let s build some new components! 7
What sort of application benefits from SPDK? Lots of I/O Latency Sensitive SAN? Database? Cache? We picked two use cases: RocksDB Dynamic Block Allocation 9
RocksDB Log-structured merge tree Written in C++, Open Source Pluggable storage backend Broadly adopted Makes minimal use of XFS Directory structure I/O pattern Minimal caching needs Recommends XFS No other file system features required! 10
Glossary Of Terms File: Array of bytes Mutable, Resizable String name Object: Array Of bytes Immutable, replaceable String name Page 4 KiB 12
Design Goals Simple and efficient BlobFS Design for fast storage media Support file & object-like semantics Blobstore BDEV 13
Blobstore Basics The user interacts with chunks of data called blobs Array of pages Mutable, resizable ID Asynchronous No blocking, queueing, or waiting Fully parallel operations No locks in I/O path I m very efficient 14
Blobstore Space Allocation Cluster 0 Page 0 Page 255 LBA 0 LBA 1LBA 0LBA 2 LBA 3 LBA 252 LBA 253 LBA 255 LBA 254 LBA 255 15
Blobstore Design Blob: array of pages implemented as an ordered list of clusters: 0 1 2 3 Page Offsets: 0-255 256-511 512-767 768-1023 Cluster 905 Cluster 52 Cluster 87 Cluster 455 LBA 0 LBA N 16
Blobstore Sample I/O Blobs are read/written by specifying a relative page offset and a page count Page Offset 254 Page Offset 255 Blob Write Page (Offset 254, Page 6 pages) Offset Offset 256 257 Page Offset 258 Page Offset 259 0 Cluster 905 Disk Write(Offset 232583, 2 LBAs) 1 Cluster 52 Disk Write(Offset 13312, 4 LBAs) LBA 232583 LBA 232584 LBA 13312 LBA 13313 LBA 13314 LBA 13315 17
Blobstore Metadata Metadata is stored in pages in a reserved region Metadata pages are not shared between blobs A blob may have multiple pages of metadata Page 0 (Blob 1) Page 1 (Blob 2) Page 2 (Blob 3) Page 3 (Blob 1) Page 4 (Blob 4) Metadata Region SSD 18
Blobstore API open, close, read, write, sync, resize Asynchronous, callback-driven Read/write in units of pages, space allocation in clusters Data is direct Metadata is cached Minimal support for xattrs Independent of BlobFS 19
Async I/O Thread BlobFS Design Core 0 Core 1 Layered on Blobstore User interacts with files Data can be cached Synchronous API* open() write() Core 2 open() read() * Asynchronous API possible I/O Device 20
Async I/O Thread BlobFS Caching Core 0 Core 1 Core 2 Not a general purpose page cache Read ahead Sequential write buffering All other access bypasses cache open() write() write() write() open() read() read() I/O Device 21
Latency (ms) Latency (ms) Benchmark: db_bench Read/Write Latency 2 20 1.8 18 1.6 16 1.4 14 1.2 12 1 10 0.8 8 0.6 6 0.4 4 0.2 2 0 0 50 75 99 99.9 Percentile Latency Percentile Latency Kernel SPDK Kernel SPDK System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 23
Transactions Per Second Benchmark: db_bench Read/Write Throughput 30000 25000 20000 15000 10000 5000 0 Kernel SPDK System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.02, RocksDB 5.1.2 24
Next Steps Major API clarifications More & better benchmarking Use blobstore as a dynamic partitioner (bdev) BlobFS caching strategy is RocksDB-centric Asynchronous BlobFS API Sparse allocation of blobs More open source application integration? 26
Latency us Latency us SPDK Blobstore Vs. Kernel: Latency 7000 db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 140000 db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 6000 5000 44% 120000 100000 4000 80000 3000 60000 372% 2000 28% 40000 1000 21% 20000 0 Insert Randread Overwrite 0 Readwrite SPDK Blobstore reduces tail latency by 3.7X System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2
SPDK Blobstore Vs. Kernel: Transactions Per Second Keys per second db_bench Key Transactions Higher is Better 1200000 1000000 85% 800000 600000 400000 200000 8% 4% ~0% 0 Insert Randread Overwrite Readwrite System Configuration: 2x Intel Xeon E5-2699v3, Intel Speed Step enabled, Intel Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, Fedora* Linux 25, Linux kernel 4.10.8, Intel P3700 NVMe SSD (800GB), FW 8DV101H0, SPDK 17.03, DPDK 17.03, RocksDB 5.1.2