Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Similar documents
Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Developing Extremely Low-Latency NVMe SSDs

AutoStream: Automatic Stream Management for Multi-stream SSDs in Big Data Era

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

N V M e o v e r F a b r i c s -

Accelerating OLTP performance with NVMe SSDs Veronica Lagrange Changho Choi Vijay Balakrishnan

Application Acceleration Beyond Flash Storage

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Designing Next Generation FS for NVMe and NVMe-oF

EXPERIENCES WITH NVME OVER FABRICS

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

NVMe Over Fabrics (NVMe-oF)

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Ziye Yang. NPG, DCG, Intel

I/O Determinism and Its Impact on Data Centers and Hyperscale Applications

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Key Value Storage Standardization Progress Bill Martin Samsung

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Standards for improving SSD performance and endurance

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

At the heart of a new generation of data center infrastructures and appliances. Sept 2017

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

From Rack Scale to Network Scale: NVMe Over Fabrics Enables Exabyte Applica>ons. Zivan Ori, CEO & Co-founder, E8 Storage

NVMe over Universal RDMA Fabrics

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Lenovo - Excelero NVMesh Reference Architecture

Using FPGAs to accelerate NVMe-oF based Storage Networks

NVMe Direct. Next-Generation Offload Technology. White Paper

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

The 3D-Memory Evolution

Emerging Technologies for HPC Storage

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Deep Learning Performance and Cost Evaluation

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Making Sense of Artificial Intelligence: A Practical Guide

Accelerate Applications Using EqualLogic Arrays with directcache

Deep Learning Performance and Cost Evaluation

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Maximizing Data Center and Enterprise Storage Efficiency

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

Session 201-B: Accelerating Enterprise Applications with Flash Memory

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

WINDOWS SERVER 2012 ECHOSTREAMS FLACHESAN2

The NE010 iwarp Adapter

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Annual Update on Flash Memory for Non-Technologists

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

How Flash-Based Storage Performs on Real Applications Session 102-C

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

InfiniBand Networked Flash Storage

March NVM Solutions Group

Open storage architecture for private Oracle database clouds

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

Next Generation Small Form Factor (NGSFF) SSD Proposal

Why NVMe/TCP is the better choice for your Data Center

NVMe SSDs with Persistent Memory Regions

Extremely Fast Distributed Storage for Cloud Service Providers

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

Samsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Please give us your feedback

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Consolidating Microsoft SQL Server databases on PowerEdge R930 server

G2M Research Fall 2017 NVMe Market Sizing Webinar

Architecting For Availability, Performance & Networking With ScaleIO

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Reference Architectures for designing and deploying Microsoft SQL Server Databases in Active System800 Platform

Accelerating Ceph with Flash and High Speed Networks

Reference Design: NVMe-oF JBOF

An Intelligent & Optimized Way to Access Flash Storage Increase Performance & Scalability of Your Applications

Hammerdb Test for OracleRAC with Memblaze PBlaze SSD

NetApp AFF A300 Review

Transcription:

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc. August 2017 1

DISCLAIMER This presentation and/or accompanying oral statements by Samsung representatives collectively, the Presentation ) is intended to provide information concerning the SSD and memory industry and Samsung Electronics Co., Ltd. and certain affiliates (collectively, Samsung ). While Samsung strives to provide information that is accurate and up-to-date, this Presentation may nonetheless contain inaccuracies or omissions. As a consequence, Samsung does not in any way guarantee the accuracy or completeness of the information provided in this Presentation. This Presentation may include forward-looking statements, including, but not limited to, statements about any matter that is not a historical fact; statements regarding Samsung s intentions, beliefs or current expectations concerning, among other things, market prospects, technological developments, growth, strategies, and the industry in which Samsung operates; and statements regarding products or features that are still in development. By their nature, forward-looking statements involve risks and uncertainties, because they relate to events and depend on circumstances that may or may not occur in the future. Samsung cautions you that forward looking statements are not guarantees of future performance and that the actual developments of Samsung, the market, or industry in which Samsung operates may differ materially from those made or suggested by the forward-looking statements in this Presentation. In addition, even if such forward-looking statements are shown to be accurate, those developments may not be indicative of developments in future periods August 2017 2

NVMe SSD NVMe: High performance, scalable interface for PCIe SSD High-performance through parallelization: Large number of deep submission/completion queues NVMe-SSDs deliver lots of IOPS/BW 1MIOPS, 6GB/s from a single device 5x more than SAS-SSD, 20x more than SATA- SSD Industry standard supported by all major players August 2017 3

NVMe-over-Fabrics (NVMe-oF) A protocol interface to NVMe that enables operation over other interconnects (e.g., Ethernet, InfiniBand, Fibre Channel, etc.) Shares the same base architecture and NVMe Host Software as NVMe Parallelism: extends the multiple queue-paired design of NVMe over network Enables NVMe scale-out and low latency (<10µs latency target) operations on Data Center Fabrics Avoids unnecessary protocol translations CPU DRAM Server CPU PCIe NVMe -of Bridge RDMA Fabric Ethernet (iwarp/roce) Infiniband Intel OmniPath NVMe Host Software NVMe Interface NVMe Subsystem August 2017 More details @ FMS Talk By Rob Davis (Mellanox) and Ilker Celebi (Samsung) 4 NVMeoF Drive NVMeoF Bridge NVMe Storage System CPU PCIe Driver PCIe

NVMe-oF Use Case Scenarios Server Server Server NVMe-oF Network CPU CPU CPU CPU CPU CPU PCIe PCIe PCIe NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Bridge NVMe-oF Network CPU CPU CPU CPU PCIe PCIe NVMe-oF NVMe-oF Bridge Bridge PCIe PCIe NVMe-oF Bridge PCIe 1. Software-Defined Storage (SDS) 2. Hyper-Converged Disaggregated JBOF Storage Direct Attached JBOF SAS DAS Replacement August 2017 5 THIS TALK

Resource Utilization NVMe Flash is Underutilized Compute saturates before IOPs Need to enable sharing for SSDs Need to scale CPU independently Capacity also underutilized Single drive densities are growing 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Resource Utilization with TPCC/MySQL System CPU NVMe Bandwidth Requirement: Scale & Manage IOPS, Bandwidth and CPU independently August 2017 6

Storage Disaggregation Separates compute and storage to different nodes High Speed Networks (25/50/100Gb) H/W accelerated low overhead protocols (RoCE, iwarp) High density flash Enables independent resource scaling Allows flexible infrastructure tuning to dynamic loads Reduces resource underutilization Improves cost-efficiency by eliminating waste Remote access introduces overhead Additional interconnect latencies Network/protocol processing affect both storage and compute nodes August 2017 7

NVMe SSD Disaggregation Applications Layer Clients Data Management Layer NVMe-oF Initiators Hosts / DataStore Servers Storage Layer NVMe-oF Target Servers TPC-C NVMe-oF Bridge PCIe DB-Bench TCP/IP NVMe-oF NVMe-oF Bridge NVMe-oF Bridge PCIe PCIe NVMe disaggregation is more challenging ~90μs latency Network/protocol latencies are more pronounced ~1MIOPS / Device Protocol overhead tax the CPU and degrade performance August 2017 8

Performance Analysis FIO Synthetic test to establish performance limits RocksDB Representing KV Stores and NoSQL databases MySQL Representing traditional RDBMS August 2017 9

Three configurations: FIO Methodology 1. Baseline: Local (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iscsi: Tuned by applying best known methods Hardware setup: 3 host servers (a.k.a. compute nodes, or datastore servers) 1 target server (a.k.a. storage server) Samsung PM1725 NVMe-SSDs (8 drives max) Random: 750/120 KIOPS read/write Sequential: 3000/2000 MB/sec read/write Network: ConnectX-4 100Gb Ethernet NICs with RoCE support 100Gb top-of-rack switch FIO FIO Tests Data Management Layer 3x Host Servers Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz HT enable, Turbo Disabled, Total 56 CPU threads RAM: 128GB DDR4 Ubuntu 16.04 with kernel 4.9.13 25Gb Nic 100G TOR NVMe-oF Storage Layer - NVMe-oF Target Server Version : 2.1.11 Dell PowerEdge R930 4x Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz Per Host : QD = 32, Jobs = 16 HT disabled, Turbo Disabled Total 96 CPU threads RAM: 512GB DDR4 Ubuntu 16.04 with kernel 4.9.13 August 2017 100Gb NIC 10

IOPS FIO Maximum Throughput NVMe-oF throughput is the same as DAS 2,500,000 2,250,000 2,000,000 1,750,000 1,500,000 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 4K Random : Read/Write Instruction Mix DAS NVMe-oF iscsi August 2017 11

Host CPU Utiliaiton [%] FIO Host CPU Overhead CPU processing overhead is minimal 50 45 40 35 30 25 20 15 10 5 0 100/0 80/20 50/50 20/80 0/100 4K Random : Read/Write Instruction Mix DAS NVMe-oF iscsi August 2017 12

IOPS Target CPU Utilization [%] FIO Target Server Overhead CPU processing requirements are low 90% of DAS read-only throughput with 1/12 th of the target cores NVMe-oF requires less CPU to deliver consistent perf 2,500,000 2,000,000 1,500,000 1,000,000 500,000 20 iscsi 8 cores 10 0 0 NVMe-oF CPU% 100/0 80/20 Read/Write Instruction Mix iscsi CPU% August 2017 13 100 90 80 70 60 50 40 30 DAS NVMe-oF 32 cores NVMe-oF 16 cores NVMe-oF 8 cores iscsi 32 cores iscsi 16 cores

Latnecy [usec] Latnecy [usec] FIO Latency Under Load NVMe-oF latencies similar to DAS for all practical loads Both average and tail 10,000 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 DAS avg latency DAS 95th percentile NVMe-oF avg latency NVMe-oF 95th percentile iscsi avg latency 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 Zoomed in view IOPS IOPS August 2017 4K Random Read Load Latency 14 1,200 1,000 800 600 400 200 0 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000

RocksDB Persistent Key-Value Store optimized for Flash Used as data store in many applications MongoDB, MySQL, Redis-On-Flash, etc. Benchmark: DB_Bench 800B and 10KB objects 80%-20% read-write mix Host/Initiator: Dell PowerEdge R730 25 Gb Host/Initiator: Dell PowerEdge R730 25 Gb Host/Initiator: Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz PowerManagement, Turbo, HT: Disabled 128GB DDR4 2400: 4x 32GB Mellanox ConnectX-4 25GbE NIC Ubuntu 16.04 with kernel 4.9.13 100GbE Top of Rack Switch 100 Gb 25 Gb Storage Server: Dell PowerEdge R930 4x Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz PowerManagement, Turbo: Disabled 512GB DDR4 2400: 16x 32GB2x Mellanox ConnectX-4 100GbE NIC Ubuntu 16.04 with kernel 4.9.13 PM1725 PM1725 PM1725 August 2017 15

Performance (OPS) RocksDB Throughput NVMe-oF performance on-par with DAS 2% throughput difference RocksDB Performance Disk Bandwidth over Time on the Target 350,000 300,000 250,000 200,000 150,000 100,000 50,000-800B Object Size 10K DAS NVMe-oF iscsi August 2017 16

Percentage RocksDB Latency NVMe-oF performance on-par with DAS 2% throughput difference Average latency increase by 11%, tail latency increase by 2% Average Latency: 507μs 568μs 99 th percentile: 3.6ms 3.7ms 10% CPU utilization overhead on host 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Read Latency CDF Latency [us] DAS NVMe-oF August 2017 17

MySQL and TPC-C Setup MySQL Version 5 TPC-C 500 Warehouse Setup with 150 Connections tpcc-mysql ODBC SQL Server Filesyste m InnoDB filesystem Docker NVMe-oF NVMe-oF Target data dir log dir Storage NVMe SSD August 2017 18

Ethernet 100GbE Top of Rack Switch Storage Server 2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz (Skylake, 18 cores) Disaggregated Storage Setup Client 1 10Gb 10Gb 10Gb 10Gb MySQL Server 1 MySQL Server 2 MySQL Server 3 25 Gb 25 Gb 25 Gb PM1725 PM1725 10Gb MySQL Server 4 25 Gb PM1725 10Gb 10Gb 10Gb 10Gb MySQL Server 5 MySQL Server 6 MySQL Server 7 MySQL Server 8 25 Gb 25 Gb 25 Gb 25 Gb 100 Gb 100 Gb PM1725 PM1725 PM1725 Client 2 10Gb 10Gb MySQL Server 9 25 Gb PM1725 10Gb MySQL Server 10 25 Gb PM1725 Client & Hosts Dell PowerEdge R730 2x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz HT enable, turbo enabled, Total 88 CPU threads RAM: 128GB DDR4. Ubuntu 16.04 with kernel 4.9.13 Clients 10Gbe NIC Hosts : 1x Mellanox ConnectX-4 25GbE NIC Target Server Supermicro 2x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz HT enable, turbo enabled, Total 72 CPU threads RAM: 384GB DDR4. Ubuntu 16.04 with kernel 4.9.13 2x Mellanox ConnectX-4 100GbE NIC August 2017 19

TpmC MySQL TPC-C Performance MySQL/TPCC Performance NVMe-oF % CPU Utilization 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000-0 5 10 15 20 25 # of MySQL/TPCC instances DAS TpmC (proj.) NVMe-oF TpmC iscsi TpmC NVMe-oF delivers scalable performance August 2017 20 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00-0 5 10 15 20 25 Host % Target % # MySQL/TPCC Instances 2 Instance/Host

TpmC TpmC TpmC 3,000,000 2,500,000 2,000,000 1,500,000 MySQL/TPCC: Storage Analysis MySQL/TPCC Performance 4 Instances 2 Instances Data Disk BW 4,000.00 3,000.00 2,000.00 1,000.00-0 5 10 15 20 25 # of MySQL/TPCC instances 1,000,000 NVMe-oF Read BW NVMe-oF Write BW 500,000-0 2 4 6 8 10 12 14 16 18 # of PM1725 DAS TpmC ( (Proj.) NVMe-oF TpmC Fewer Drives : Efficient utilization of NVMe SSDs Scalable Performance Log Disk BW 1,000.00 500.00-0 5 10 15 20 25 # of MySQL/TPCC instances NVMe-oF Write BW August 2017 21

Throughput Normalized to DAS CPU% MySQL Sysbench Performance 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 - MySQL/Sysbench Performance 8 Data Disk 8 Log Disk 4 Data Disk 2 Log Disk 0 2 4 6 8 10 12 14 16 18 # of MySQL/Sysbench Instances DAS TPS (proj.) 45R NVMe-oF TPS 65R NVMe-oF TPS 80R NVMe-oF TPS NVMe-oF delivers scalable performance with fewer drives Low target CPU utilization August 2017 22 35 30 25 20 15 10 5 0 Total Target CPU% 0 4 8 12 16 20 45R55W 65R35W 80R20W

BW (MB/s) BW (MB/s) MySQL / Sysbench - Storage Analysis 7000 6000 5000 4000 3000 2000 1000 0 Data Disk Read Bandwidth 0 5 10 15 20 # of MySQL/Sysbench instances BW 45R55W BW 65R35W BW 80R20W 3500 3000 2500 2000 1500 1000 500 0 Data Disk Write Bandwidth 0 4 8 12 16 20 # of MySQL/Sysbench instances BW 45R55W BW 65R35W BW 80R20W Efficient utilization of disk bandwidth Scale number of disks as required by application August 2017 23

NVMe-oF Ecosystem Maturing Applications Layer Clients TPC-C DB-Bench TCP/IP Data Management Layer NVMe-oF Initiators Hosts / DataStore Servers Storage Layer NVMe-oF Target Servers NVMe-oF Bridge PCIe NVMe-oF NVMe-oF Bridge NVMe-oF Bridge PCIe PCIe Drivers Operating Systems NVMe SSD High Speed Networks RDMA Enabled Hardware August 2017 24

Conclusions NVMe-oF reduces remote storage overhead to a bare minimum Low processing overhead on both host and target Applications (host) gets the same performance Storage server (target) can support more drives with fewer cores NVMe SSD + NVMe-oF enables efficient disaggregation architecture for flash August 2017 25

Thanks http://www.nvmexpress.org/ August 2017 26

vijay.bala@samsung.com August 2017 27