A New Key-Value Data Store For Heterogeneous Storage Architecture

Similar documents
A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Flashed-Optimized VPSA. Always Aligned with your Changing World

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

Ben Walker Data Center Group Intel Corporation

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Yuan Zhou Chendi Xue Jian Zhang 02/2017

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

March NVM Solutions Group

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Evolution of Rack Scale Architecture Storage

Create a Flexible, Scalable High-Performance Storage Cluster with WekaIO Matrix

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Intel SSD Data center evolution

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Extremely Fast Distributed Storage for Cloud Service Providers

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

PCIe Storage Beyond SSDs

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Jim Harris. Principal Software Engineer. Data Center Group

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

Speeding Up Cloud/Server Applications Using Flash Memory

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

FAST SQL SERVER BACKUP AND RESTORE

New Approach to Unstructured Data

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Storage Optimization with Oracle Database 11g

Daniel Verkamp, Software Engineer

SolidFire and Ceph Architectural Comparison

Introduction to Database Services

Data and Intelligence in Storage Carol Wilder Intel Corporation

CS3600 SYSTEMS AND NETWORKS

Flash Storage Complementing a Data Lake for Real-Time Insight

Innovator, Disruptor or Laggard, Where will your storage applications live? Next generation storage

Storage for HPC, HPDA and Machine Learning (ML)

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Isilon Performance. Name

The Lion of storage systems

Dell EMC All-Flash solutions are powered by Intel Xeon processors. Learn more at DellEMC.com/All-Flash

SSDs Going Mainstream Do you believe? Tom Rampone Intel Vice President/General Manager Intel NAND Solutions Group

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Google File System. Arun Sundaram Operating Systems

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

The SNIA NVM Programming Model. #OFADevWorkshop

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Software-Defined Data Infrastructure Essentials

Distributed File Systems II

Toward a Memory-centric Architecture

Cisco HyperConverged Infrastructure

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Introducing SUSE Enterprise Storage 5

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Highly accurate simulations of big-data clusters for system planning and optimization

NVMe SSDs with Persistent Memory Regions

The Long-Term Future of Solid State Storage Jim Handy Objective Analysis

Accelerate Big Data Insights

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Jim Harris. Principal Software Engineer. Intel Data Center Group

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Apache Spark Graph Performance with Memory1. February Page 1 of 13

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Computational Storage: Acceleration Through Intelligence & Agility

Transcription:

A New Key-Value Data Store For Heterogeneous Storage Architecture brien.porter@intel.com wanyuan.yang@intel.com yuan.zhou@intel.com jian.zhang@intel.com Intel APAC R&D Ltd. 1

Agenda Introduction Background and Motivation Hybrid Layout Key-Value Data Store (HLKVDS) Design and Implementation Performance Overview Summary & Next Step 2

Introduction 3

Introduction Intel Big Data Technologies Team is part of Intel s Software Service Group Tasked to deliver optimized BigData and open source solutions on Intel platforms Open-source @Spark*, Hadoop*, OpenStack*, Ceph*, etc. Working closely with community and customers Technology and Innovation oriented Real-time, in-memory, complex analytics Structure and unstructured data Agility, Multi-tenancy, Scalability and elasticity Bridging advanced research and real-world applications *Other names and brands may be claimed as the property of others. Intel and Intel logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries 4

Background and Motivation 5

The Storage Hierarchy of Tomorrow There is a Memory-Storage gap in traditional storage hierarchy 3D XPoint technology breaks the memory/storage barrier Memory-like non-volatile memory (PM) Appears as memory to application Application load/store data directly Disk-like non-volatile memory Appears as disk drives to application Accessed with write/read CPU DRAM Persistent Memory Intel Optane storage (3D XPoint based NVMe SSD) NAND SSD Hard Disk Drives 6

NVM Library NVM Library help developer build application on PM directly Open Sourced http://pmem.io Providing transactional access method libpmemobj libpmemblk libpmemlog NVM Library [Picture Source] https://www.snia.org/sites/default/orig/nvm2014_ppt/rudoff-overview-of-the-nvm-programming-model.pdf Persistent memory programming will be easier with NVM Library 7

Motivation New storage technology is emerging HW - 3D XPoint technology SW - new program model Persistent Memory will be a popular and powerful storage media in future storage solutions There are challenges in today s Key-Value Store solutions for Cloud Storage that need to be resolved There is need to re-design the storage engine and re-think how best to leverage Persistent Memory in cloud storage 8

Key-Value Store Engine usage model in Cloud Storage Mission critical metadata store As a Key-Value metastore for any distributed storage system Started with Ceph, A distributed object, block, and file storage platform http://ceph.com Hybrid storage engine As a cache or storage engine for distributed storage system Started with a storage engine for HDCS (hyper-converged cache solutions for the cloud) https://github.com/intel-bigdata/hdcs As a high performance Key-Value database 9

Key-Value Store engine as Meta Store in Ceph BlueStore Ceph is one of the most popular open source storage solutions Ceph contains two types of daemons Monitor provides extremely reliable and durable storage of cluster membership, configuration, and state. OSD provides specific data storage BlueStore is the newest OSD store backend consumes raw block device(s) key/value database (RocksDB) for metadata data written directly to block device Challenges in Ceph Key-Value Store RocksDB Compaction brings unstable performance Need a Persistent Memory aware Key-Value Store [Picture Source] https://www.slideshare.net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph Intel does not control or audit third-party info or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. 10

Key-Value Store Engine in Hyper-converged clouds Hyper-converged Distributed Cache Store Providing cache for different APIs Block, Object, File cache Supporting various cloud storage backbends Ceph, GlusterFS, Swift Advanced service de-duplication, compression, QoS, snapshot Connecting with different backend Private cloud as a warm tier Public cloud as cold tier https://github.com/intel-bigdata/hdcs Unique I/O characteristics in Hyper-converged workloads High concurrency, typically fixed length request (e.g., 4K). Advanced data services like de-duplication, compression, data tiering Extremely high performance, low latency which might eliminate the dependency on local file system Interface HDCS Cloud Storage write HLKVDS PM read QoS, compression, dedup Write caching SSD Read caching Public cloud (cold data) 11 11

Hybrid Layout Key-Value Data Store (HLKVDS) Design and Implementation 12

Design Goal A simple and general solution Designed for the cloud - With a usage scenario of large, concurrent fixed-length random write/read requests A focus on high performance - Hybrid storage engine that supports both Persistent Memory and SSD Storage Service Storage Service data (fixed length) metadata Traditional Key-Value DB File System data (fixed length) HLKVDS metadata SSD/HDD SSD PM SSD 13

HLKVDS Architecture Overview Reads Writes A Key-Value store engine Read Cache With Put, Delete, Get, Iterator APIs Hash Table Index Write Buffer RAM The hybrid storage engine supports various hardware DRAM, Persistent Memory, SSD Stat Table Provides for rich features, optimized for Persistent Memory & SSD devices Main Store Area PM Space provisioning & Recycling Data tiering Meta Area Main Store Area HLKVDS SSD Cache semantic Optimized data layout based on media 14

HLKVDS Design Details In-Memory Hash Index Table HKey: RMD160 ( Key ) -> Digest Fixed-length entry in memory Notes: use RIPEMD-160 to generate key digest. The actual probability of collision is infinitesimally small or non-existent in practice for billions of keys. HValue: Hash(Digest) -> HashEntry Separate-chaining with linked lists Write amplification Write hash entry twice on device Sacrifice space to increase random read performance Hash Entry cache_ptr EntryOnDisk Data Header Hash Entry Header Physical Address Data Header Digest Key Size Data Size Data Offset Next Header Offset 20Bytes 4Bytes 4Bytes 4Bytes 4Bytes Hash Entry RAM Header Physical Address Location 1Byte (PM or SSD) Physical Offset 8Bytes 15

HLKVDS Design Details Data layout : Semantic abstraction of data placement Optimized data layout adapt to SSD friendly IO pattern Store fixed-length value with page align Header headpos_ Data Header Inlined KV pair Key Data Data K V Header Header K Page-aligned value Value Value tailpos_ Data Header Digest 20Bytes Key Size 4Bytes Data Size 4Bytes Data Offset 4Bytes Next Header Offset 4Bytes Time stamp Checksum nkeys RAM Data Header Store the relative position of Key, Value, Next Data Header in the segment 16

HLKVDS Design Details Space Provisioning State Table maintains the whole segment state in RAM State Tables persist to SSD PM State Table State Use State Free size Death size State cursegid_ State SSD State Table State State RAM UseState: State of the segment (free, used and reserved). FreeSize: record how much free space remained in the segment when the segment write to device DeathSize: record how much invalid key-value size in the segment. (when a key-value is deleted or updated, the origin segment store KV pairs are invalid) 17

HLKVDS Architecture Write workflow Writes Return Key Entry KVSlice Key Digest Value 1 Write Buffer 2 Stats Used Reserved Free Aggregate KVSlices to write to a segment (Timeout & Immediacy) Step 1 Write segment to Persistent Memory with log appending Step 2 Data will be tiered to SSD by space recycling Update in-memory metadata - Step 3, 4, 5, 6 7 Hash Table Index 5? Read Cache 4 State Table K V K V K V K V K V K V 6 K V K V K V K V K V K V stat stat stat PM Delete operation is equivalent to inserting a KVSlice with a value of null (step 5) LRU stat stat stat SSD RAM 3 Main Store Area SB Index PM SST SSD SST Meta Area Main Store Area PM SSD 18

HLKVDS Architecture Read workflow Reads Return Key 3 Entry Hash Table Index KVSlice Key Digest Value 2 5 Write Buffer Read Cache 3 State Table 2 K V K V K V K V K V K V K V K V K V K V K V K V LRU stat stat stat PM stat stat stat SSD 1 4 RAM Main Store Area SB Index PM SST SSD SST Stats Used Reserved Free PM Meta Area Main Store Area SSD Accelerated read with key-value LRU cache The value is obtained by a hash calculation and a read from the disk. 19

HLKVDS Architecture Space recycling Key Entry Hash Table Index Write Buffer Read Cache State Table K V K V K V K V K V K V K V K V K V K V K V K V LRU stat stat stat PM stat stat stat SSD 4 RAM Stats Used Reserved 1 Free Main Store Area PM Aggregate SSD SB 2 Index PM SST SSD 3 SST GC Meta Area Main Store Area SSD Data on Persistent Memory will be tiered to SSD Based on Threshold Data on SSD will do Garbage Collection 20

HLKVDS Architecture Space recycling: SSD Garbage Collection Semantic abstraction of space recycling Pick strategy base on timestamp for Persistent Memory Pick strategy base on valid space capacity for SSD Always pick segments which are most worthy of recycling Sort: Merge: 21

HLKVDS Design Details Fast Recovery The Super Block sets flag when it is cleanly shutdown Timestamp KV pair Writes the Hash Table index to the device periodically In a recovery, the system loads the last written Hash Table index checkpoint, then scan segments with timestamp after last saved timestamp and insert key-value pairs into the restored Hash Table index bool SuperBlock flag_clean_ shutdown Last Timestamp Hash Table Index 22

Hybrid Layout Key-Value Data Store (HLKVDS) Performance 23

Test Configurations & Methodology Test Configuration Each key size is 10 Bytes, value size is 4096 Bytes Different concurrency(1-512) to simulate cloud storage scenarios Benchmarks A simulation of db_bench* Write Test Inserted 1 million key-value pairs with different concurrency Read Test Inserted 10 million key-value pairs to warmup first Get 1 million key-value pairs with different concurrency CPU Memory Test Environment 2 CPUs, HT enable 40 vcpus, 10 Cores with Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GH 64GB OS Ubuntu 14.04 Device Misc Intel SSD DC P3600 (2.0TB) RocksDB(v.5.2) with ext4 filesystem; HLKVDS segment_size = 128KB * Refer to backup for details 24

Write Performance Intel SSD DC P3600 Specification Sequential Read Sequential Write Random Read Random Write 2600MB/s 1700MB/s 450000 IOPS 56000 IOPS 4K Write Performance Throughput Average Latency Best Latency 90K IOPS 0.25ms Peak IOPS 413K IOPS 1.2ms Best latency when concurrency=32, ~1.5X performance advantage over RocksDB and latency < 0.5ms Peak IOPS when concurrency=512, ~6.5X performance advantage over RocksDB and latency < 1.5ms 25

Read Performance Intel SSD DC P3600 Specification Sequential Read Sequential Write Random Read Random Write 2600MB/s 1700MB/s 450000 IOPS 56000 IOPS 4K Read Performance Throughput Average Latency Best Latency 8.5K IOPS 0.11ms Peak IOPS 450K IOPS 0.25ms Simulated cloud storage scenarios with a large number of random reads When concurrency=128, ~4X read performance advantage over RocksDB and latency < 0.25ms. 26

Summary & Next 27

Summary & Next Step Summary HLKVDS is a new key-value data store designed for hybrid hardware environment like DRAM, PM and SSD. It drives good performance in cloud storage scenarios of large concurrent fixedlength random write/read requests. Next Step Performance evaluation as Ceph BlueStore metadata store Performance evaluation on PM Call to action Go to https://github.com/intel-bigdata/hlkvds to download Share with us your feedback to wanyuan.yang@intel.com, brien.porter@intel.com 28

Legal notices Copyright 2017 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Intel Inside, and 3D XPoint are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804 Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. 29 29

Backup 30

Performance Test Implemented the prototype with 1-tier Aggregate write with timeout strategy Implemented benchmark tool Reference to db_bench [1] Simulated cloud storage scenarios Access data with different concurrent number 4K Fixed-length value Performance comparison with RocksDB [1] https://github.com/facebook/rocksdb/wiki/performance-benchmarks 31