Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc. August 2017 1
DISCLAIMER This presentation and/or accompanying oral statements by Samsung representatives collectively, the Presentation ) is intended to provide information concerning the SSD and memory industry and Samsung Electronics Co., Ltd. and certain affiliates (collectively, Samsung ). While Samsung strives to provide information that is accurate and up-to-date, this Presentation may nonetheless contain inaccuracies or omissions. As a consequence, Samsung does not in any way guarantee the accuracy or completeness of the information provided in this Presentation. This Presentation may include forward-looking statements, including, but not limited to, statements about any matter that is not a historical fact; statements regarding Samsung s intentions, beliefs or current expectations concerning, among other things, market prospects, technological developments, growth, strategies, and the industry in which Samsung operates; and statements regarding products or features that are still in development. By their nature, forward-looking statements involve risks and uncertainties, because they relate to events and depend on circumstances that may or may not occur in the future. Samsung cautions you that forward looking statements are not guarantees of future performance and that the actual developments of Samsung, the market, or industry in which Samsung operates may differ materially from those made or suggested by the forward-looking statements in this Presentation. In addition, even if such forward-looking statements are shown to be accurate, those developments may not be indicative of developments in future periods August 2017 2
NVMe SSD NVMe: High performance, scalable interface for PCIe SSD High-performance through parallelization: Large number of deep submission/completion queues NVMe-SSDs deliver lots of IOPS/BW 1MIOPS, 6GB/s from a single device 5x more than SAS-SSD, 20x more than SATA- SSD Industry standard supported by all major players August 2017 3
NVMe-over-Fabrics (NVMe-oF) A protocol interface to NVMe that enables operation over other interconnects (e.g., Ethernet, InfiniBand, Fibre Channel, etc.) Shares the same base architecture and NVMe Host Software as NVMe Parallelism: extends the multiple queue-paired design of NVMe over network Enables NVMe scale-out and low latency (<10µs latency target) operations on Data Center Fabrics Avoids unnecessary protocol translations CPU DRAM Server CPU PCIe NVMe -of Bridge RDMA Fabric Ethernet (iwarp/roce) Infiniband Intel OmniPath NVMe Host Software NVMe Interface NVMe Subsystem August 2017 More details @ FMS Talk By Rob Davis (Mellanox) and Ilker Celebi (Samsung) 4 NVMeoF Drive NVMeoF Bridge NVMe Storage System CPU PCIe Driver PCIe
NVMe-oF Use Case Scenarios Server Server Server NVMe-oF Network CPU CPU CPU CPU CPU CPU PCIe PCIe PCIe NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Network CPU CPU CPU CPU PCIe PCIe NVMe-oF NVMe-oF Bridge Bridge PCIe PCIe NVMe-oF Bridge PCIe 1. Software-Defined Storage (SDS) 2. Hyper-Converged Disaggregated JBOF Storage Direct Attached JBOF SAS DAS Replacement August 2017 5 THIS TALK
Resource Utilization NVMe Flash is Underutilized Compute saturates before IOPs Need to enable sharing for SSDs Need to scale CPU independently Capacity also underutilized Single drive densities are growing 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Resource Utilization with TPCC/MySQL System CPU NVMe Bandwidth Requirement: Scale & Manage IOPS, Bandwidth and CPU independently August 2017 6
Storage Disaggregation Separates compute and storage to different nodes High Speed Networks (25/50/100Gb) H/W accelerated low overhead protocols (RoCE, iwarp) High density flash Enables independent resource scaling Allows flexible infrastructure tuning to dynamic loads Reduces resource underutilization Improves cost-efficiency by eliminating waste Remote access introduces overhead Additional interconnect latencies Network/protocol processing affect both storage and compute nodes August 2017 7
NVMe SSD Disaggregation Applications Layer Clients Data Management Layer NVMe-oF Initiators Hosts / DataStore Servers Storage Layer NVMe-oF Target Servers TPC-C NVMe-oF Bridge PCIe DB-Bench TCP/IP NVMe-oF NVMe-oF Bridge NVMe-oF Bridge PCIe PCIe NVMe disaggregation is more challenging ~90μs latency Network/protocol latencies are more pronounced ~1MIOPS / Device Protocol overhead tax the CPU and degrade performance August 2017 8
Performance Analysis FIO Synthetic test to establish performance limits RocksDB Representing KV Stores and NoSQL databases MySQL Representing traditional RDBMS August 2017 9
Three configurations: FIO Methodology 1. Baseline: Local (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iscsi: Tuned by applying best known methods Hardware setup: 3 host servers (a.k.a. compute nodes, or datastore servers) 1 target server (a.k.a. storage server) Samsung PM1725 NVMe-SSDs (8 drives max) Random: 750/120 KIOPS read/write Sequential: 3000/2000 MB/sec read/write Network: ConnectX-4 100Gb Ethernet NICs with RoCE support 100Gb top-of-rack switch FIO FIO Tests Data Management Layer 3x Host Servers Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz HT enable, Turbo Disabled, Total 56 CPU threads RAM: 128GB DDR4 Ubuntu 16.04 with kernel 4.9.13 25Gb Nic 100G TOR NVMe-oF Storage Layer - NVMe-oF Target Server Version : 2.1.11 Dell PowerEdge R930 4x Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz Per Host : QD = 32, Jobs = 16 HT disabled, Turbo Disabled Total 96 CPU threads RAM: 512GB DDR4 Ubuntu 16.04 with kernel 4.9.13 August 2017 100Gb NIC 10
IOPS FIO Maximum Throughput NVMe-oF throughput is the same as DAS 2,500,000 2,250,000 2,000,000 1,750,000 1,500,000 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 4K Random : Read/Write Instruction Mix DAS NVMe-oF iscsi August 2017 11
Host CPU Utiliaiton [%] FIO Host CPU Overhead CPU processing overhead is minimal 50 45 40 35 30 25 20 15 10 5 0 100/0 80/20 50/50 20/80 0/100 4K Random : Read/Write Instruction Mix DAS NVMe-oF iscsi August 2017 12
IOPS Target CPU Utilization [%] FIO Target Server Overhead CPU processing requirements are low 90% of DAS read-only throughput with 1/12 th of the target cores NVMe-oF requires less CPU to deliver consistent perf 2,500,000 2,000,000 1,500,000 1,000,000 500,000 20 iscsi 8 cores 10 0 0 NVMe-oF CPU% 100/0 80/20 Read/Write Instruction Mix iscsi CPU% August 2017 13 100 90 80 70 60 50 40 30 DAS NVMe-oF 32 cores NVMe-oF 16 cores NVMe-oF 8 cores iscsi 32 cores iscsi 16 cores
Latnecy [usec] Latnecy [usec] FIO Latency Under Load NVMe-oF latencies similar to DAS for all practical loads Both average and tail 10,000 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 DAS avg latency DAS 95th percentile NVMe-oF avg latency NVMe-oF 95th percentile iscsi avg latency 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 Zoomed in view IOPS IOPS August 2017 4K Random Read Load Latency 14 1,200 1,000 800 600 400 200 0 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000
RocksDB Persistent Key-Value Store optimized for Flash Used as data store in many applications MongoDB, MySQL, Redis-On-Flash, etc. Benchmark: DB_Bench 800B and 10KB objects 80%-20% read-write mix Host/Initiator: Dell PowerEdge R730 25 Gb Host/Initiator: Dell PowerEdge R730 25 Gb Host/Initiator: Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz PowerManagement, Turbo, HT: Disabled 128GB DDR4 2400: 4x 32GB Mellanox ConnectX-4 25GbE NIC Ubuntu 16.04 with kernel 4.9.13 100GbE Top of Rack Switch 100 Gb 25 Gb Storage Server: Dell PowerEdge R930 4x Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz PowerManagement, Turbo: Disabled 512GB DDR4 2400: 16x 32GB2x Mellanox ConnectX-4 100GbE NIC Ubuntu 16.04 with kernel 4.9.13 PM1725 PM1725 PM1725 August 2017 15
Performance (OPS) RocksDB Throughput NVMe-oF performance on-par with DAS 2% throughput difference RocksDB Performance Disk Bandwidth over Time on the Target 350,000 300,000 250,000 200,000 150,000 100,000 50,000-800B Object Size 10K DAS NVMe-oF iscsi August 2017 16
Percentage RocksDB Latency NVMe-oF performance on-par with DAS 2% throughput difference Average latency increase by 11%, tail latency increase by 2% Average Latency: 507μs 568μs 99 th percentile: 3.6ms 3.7ms 10% CPU utilization overhead on host 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Read Latency CDF Latency [us] DAS NVMe-oF August 2017 17
MySQL and TPC-C Setup MySQL Version 5 TPC-C 500 Warehouse Setup with 150 Connections tpcc-mysql ODBC SQL Server Filesyste m InnoDB filesystem Docker NVMe-oF NVMe-oF Target data dir log dir Storage NVMe SSD August 2017 18
Ethernet 100GbE Top of Rack Switch Storage Server 2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz (Skylake, 18 cores) Disaggregated Storage Setup Client 1 10Gb 10Gb 10Gb 10Gb MySQL Server 1 MySQL Server 2 MySQL Server 3 25 Gb 25 Gb 25 Gb PM1725 PM1725 10Gb MySQL Server 4 25 Gb PM1725 10Gb 10Gb 10Gb 10Gb MySQL Server 5 MySQL Server 6 MySQL Server 7 MySQL Server 8 25 Gb 25 Gb 25 Gb 25 Gb 100 Gb 100 Gb PM1725 PM1725 PM1725 Client 2 10Gb 10Gb MySQL Server 9 25 Gb PM1725 10Gb MySQL Server 10 25 Gb PM1725 Client & Hosts Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz HT enable, turbo enabled, Total 88 CPU threads RAM: 128GB DDR4. Ubuntu 16.04 with kernel 4.9.13 Clients 10Gbe NIC Hosts : 1x Mellanox ConnectX-4 25GbE NIC Target Server Supermicro 2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz HT enable, turbo enabled, Total 72 CPU threads RAM: 384GB DDR4. Ubuntu 16.04 with kernel 4.9.13 2x Mellanox ConnectX-4 100GbE NIC August 2017 19
TpmC MySQL TPC-C Performance MySQL/TPCC Performance NVMe-oF % CPU Utilization 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000-0 5 10 15 20 25 # of MySQL/TPCC instances DAS TpmC (proj.) NVMe-oF TpmC iscsi TpmC NVMe-oF delivers scalable performance August 2017 20 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00-0 5 10 15 20 25 Host % Target % # MySQL/TPCC Instances 2 Instance/Host
TpmC TpmC TpmC 3,000,000 2,500,000 2,000,000 1,500,000 MySQL/TPCC: Storage Analysis MySQL/TPCC Performance 4 Instances 2 Instances Data Disk BW 4,000.00 3,000.00 2,000.00 1,000.00-0 5 10 15 20 25 # of MySQL/TPCC instances 1,000,000 NVMe-oF Read BW NVMe-oF Write BW 500,000-0 2 4 6 8 10 12 14 16 18 # of PM1725 DAS TpmC ( (Proj.) NVMe-oF TpmC Fewer Drives : Efficient utilization of NVMe SSDs Scalable Performance Log Disk BW 1,000.00 500.00-0 5 10 15 20 25 # of MySQL/TPCC instances NVMe-oF Write BW August 2017 21
Throughput Normalized to DAS CPU% MySQL Sysbench Performance 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 - MySQL/Sysbench Performance 8 Data Disk 8 Log Disk 4 Data Disk 2 Log Disk 0 2 4 6 8 10 12 14 16 18 # of MySQL/Sysbench Instances DAS TPS (proj.) 45R NVMe-oF TPS 65R NVMe-oF TPS 80R NVMe-oF TPS NVMe-oF delivers scalable performance with fewer drives Low target CPU utilization August 2017 22 35 30 25 20 15 10 5 0 Total Target CPU% 0 4 8 12 16 20 45R55W 65R35W 80R20W
BW (MB/s) BW (MB/s) MySQL / Sysbench - Storage Analysis 7000 6000 5000 4000 3000 2000 1000 0 Data Disk Read Bandwidth 0 5 10 15 20 # of MySQL/Sysbench instances BW 45R55W BW 65R35W BW 80R20W 3500 3000 2500 2000 1500 1000 500 0 Data Disk Write Bandwidth 0 4 8 12 16 20 # of MySQL/Sysbench instances BW 45R55W BW 65R35W BW 80R20W Efficient utilization of disk bandwidth Scale number of disks as required by application August 2017 23
NVMe-oF Ecosystem Maturing Applications Layer Clients TPC-C DB-Bench TCP/IP Data Management Layer NVMe-oF Initiators Hosts / DataStore Servers Storage Layer NVMe-oF Target Servers NVMe-oF Bridge PCIe NVMe-oF NVMe-oF Bridge NVMe-oF Bridge PCIe PCIe Drivers Operating Systems NVMe SSD High Speed Networks RDMA Enabled Hardware August 2017 24
Conclusions NVMe-oF reduces remote storage overhead to a bare minimum Low processing overhead on both host and target Applications (host) gets the same performance Storage server (target) can support more drives with fewer cores NVMe SSD + NVMe-oF enables efficient disaggregation architecture for flash August 2017 25
Thanks http://www.nvmexpress.org/ August 2017 26
vijay.bala@samsung.com August 2017 27