HashKV: Enabling Efficient Updates in KV Storage via Hashing

Similar documents
arxiv: v1 [cs.db] 25 Nov 2018

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

SLM-DB: Single-Level Key-Value Store with Persistent Memory

WiscKey: Separating Keys from Values in SSD-Conscious Storage

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

FlashKV: Accelerating KV Performance with Open-Channel SSDs

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays

An Efficient Memory-Mapped Key-Value Store for Flash Storage

CSE-E5430 Scalable Cloud Computing Lecture 9

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

ElasticBF: Fine-grained and Elastic Bloom Filter Towards Efficient Read for LSM-tree-based KV Stores

Redesigning LSMs for Nonvolatile Memory with NoveLSM

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Using Transparent Compression to Improve SSD-based I/O Caches

FStream: Managing Flash Streams in the File System

Optimizing Key-Value Stores For Hybrid Storage Environments. Prashanth Menon

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

The Unwritten Contract of Solid State Drives

LSbM-tree: Re-enabling Buffer Caching in Data Management for Mixed Reads and Writes

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed File Systems II

Atlas: Baidu s Key-value Storage System for Cloud Data

RocksDB Key-Value Store Optimized For Flash

Freewrite: Creating (Almost) Zero-Cost Writes to SSD

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion

NPTEL Course Jan K. Gopinath Indian Institute of Science

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Heckaton. SQL Server's Memory Optimized OLTP Engine

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Don t stack your Log on my Log

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

SmartMD: A High Performance Deduplication Engine with Mixed Pages

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Inside the InfluxDB Storage Engine

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Ben Walker Data Center Group Intel Corporation

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

OSSD: A Case for Object-based Solid State Drives

Adaptation in distributed NoSQL data stores

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

SILT: A Memory-Efficient, High- Performance Key-Value Store

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

System Software for Persistent Memory

Open-Channel SSDs Then. Now. And Beyond. Matias Bjørling, March 22, Copyright 2017 CNEX Labs

BROMS: Best Ratio of MLC to SLC

Manylogs Improving CMR/SMR Disk Bandwidth & Latency. Tiratat Patana-anake, Vincentius Martin, Nora Sandler, Cheng Wu, and Haryadi S.

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

JOURNALING techniques have been widely used in modern

SFS: Random Write Considered Harmful in Solid State Drives

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

MongoDB Revs You Up: What Storage Engine is Right for You?

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

January 28-29, 2014 San Jose

2. PICTURE: Cut and paste from paper

A New Key-Value Data Store For Heterogeneous Storage Architecture

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

CSE 153 Design of Operating Systems

OS-caused Long JVM Pauses - Deep Dive and Solutions

Lecture 18: Reliable Storage

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Application-Managed Flash

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

LogBase: A Scalable Log-structured Database System in the Cloud

CLOUD-SCALE FILE SYSTEMS

The Google File System

BigTable. CSE-291 (Cloud Computing) Fall 2016

CSE 544: Principles of Database Systems

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

What is a file system

Distributed PostgreSQL with YugaByte DB

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

RocksDB Embedded Key-Value Store for Flash and RAM

Main-Memory Databases 1 / 25

Optimizing Software-Defined Storage for Flash Memory

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Google File System. Arun Sundaram Operating Systems

Making Storage Smarter Jim Williams Martin K. Petersen

Enabling NVMe I/O Scale

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

An SMR-aware Append-only File System Chi-Young Ku Stephen P. Morgan Futurewei Technologies, Inc. Huawei R&D USA

The Google File System

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

Transcription:

HashKV: Enabling Efficient Updates in KV Storage via Hashing Helen H. W. Chan, Yongkun Li, Patrick P. C. Lee, Yinlong Xu The Chinese University of Hong Kong University of Science and Technology of China USENIX ATC 2018 1

Background Update-intensive workloads are common in key-value (KV) stores Online transaction processing (OLTP) Enterprise servers Yahoo s workloads are shifting from reads to writes [*] Log-structured merge (LSM) tree Transform random writes into sequential writes Support efficient range scans Limitation: high read and write amplifications during compaction [*] Sears et al., blsm: A General Purpose Log Structured Merge Tree, SIGMOD 2012 2

LSM-tree in LevelDB Memory Disk L 0 L 1 Immutable MemTable flush MemTable Compaction KV pairs SSTable KV pairs (Sorted) Metadata 1. Read SSTables 2. Merge and sort by keys 3. Split into new SSTables High I/O amplifications! L k 1 3

KV Separation [*] Store values separately to reduce LSM-tree size LSM-tree: keys and metadata for indexing vlog: circular log for KV pairs KV pairs Log tail Key, value Key, value Log head vlog Key, value [*] Lu et al., WiscKey: Separating Keys from Values in SSD-Conscious Storage, FAST 2016 4

KV Separation Does KV separation solve all problems? High garbage collection (GC) overhead in vlog management More severe if reserved space is limited Update-intensive workloads aggravate GC overhead GC needs to query the LSM-tree to check if KV pairs are valid High write amplification of vlog if reserved space is filled Reserved space not filled up yet Filled reserved space 5

Our Contributions HashKV, an efficient KV store for update-intensive workloads Extend KV separation with hash-based data grouping for value storage Mitigate GC overhead with smaller I/O amplifications and without LSMtree queries Three extensions that adapt to workload characteristics E1: Dynamic reserved space allocation E2: Hotness awareness E3: Selective KV separation Extensive prototype experiments 4.6x throughput and 53.4% less write traffic over circular log 6

Hash-based Data Grouping Hash values into fixed-size partitions by keys Partition isolation: all value updates of the same key must go to the same partition in a log-structured manner Deterministic grouping: instantly locate the partition of a given key Allow flexible and lightweight GC Localize all updates of a key in the same partition KV pairs LSM-tree (for indexing) value store What if a partition is full? 7

E1: Dynamic Reserved Space Allocation Layout: Logical address space: main segments (e.g., 64 MiB) Reserved space: log segments (e.g., 1 MiB) Segment group: 1 main segment + multiple log segments In-memory segment table tracks all segment groups Checkpointed for fault tolerance Main segment LSM-tree Reserved space Log segment Segment group Segment group 8

Group-Based Garbage Collection Select a segment group for GC e.g., the one with largest amount of writes Likely to have many invalid KV pairs to reclaim free space Identify all valid KV pairs in selected group Since each group stores updates in a log-structured manner, the latest version of each key must reside at the end of the group No LSM-tree queries required Write all valid KV pairs to new segments Update LSM-tree 9

E2: Hotness Awareness Problem: mix of hot and cold KV pairs Unnecessary rewrites for cold KV pairs Tagging: Add a tag in metadata to indicate presence of cold values Cold values are separately stored Hot-cold value separation GC rewrites small tags instead of values 10

E3: Selective KV Separation KV separation for small values incurs extra costs to access both LSM-tree and value store Selective approach: Large values: KV separation Small values: stored entirely in LSM-tree Open issue: how to distinguish between small and large values? 11

Other Issues Range scans: Leverage read-ahead (via posix_fadvise) for speedup Metadata journaling: Crash consistency for both write and GC operations Implementation: Multi-threading for writes and GC Batched writes for KV pairs in the same segment group Built on SSDs 12

Putting It All Together: HashKV Architecture Write cache (meta, key) KV separation MemTable (meta, key, value) Group 1 (end pos, segments) Group 2 (end pos, segments) Segment table Memory Main segment Persistent Storage Reserved space Log segment Write journal Segment group Segment group GC journal Cold data log LSM-tree Value store 13

Experiments Testbed backed with an SSD RAID array KV stores LevelDB, RocksDB, HyperLevelDB, PebblesDB (default parameters) vlog (circular log) and HashKV: built on LevelDB for KV separation Workloads 40 GiB for main segments + 12 GiB (30%) reserved space for log segments Load: 40 GiB of 1-KiB KV pairs (Phase P0) Update: 40 GiB of updates for three phases (Phases P1, P2, P3) P1: reserved space gradually filled up P2 & P3: reserved space fully filled (stabilized performance) 14

Update Performance of HashKV Throughput Write size KV store size Compared to LevelDB, RocksDB, and vlog: 6.3-7.9x, 1.3-1.4x, and 3.7-4.6x throughput, resp. 49.6-71.5% lower write size Much lower KV store size than HyperLevelDB and PebblesDB 15

Impact of Reserved Space Throughput Latency breakdown (V = vlog; H = HashKV) HashKV s throughput increases with reserved space size vlog has high LSM-tree query overhead (80% of latency) 16

Range Scans HashKV maintains high range scan performance 17

Optimization Features Hotness awareness Selective KV separation Higher throughput and smaller write size with optimization features enabled 18

Conclusions HashKV: hash-based data grouping for efficient updates Dynamic reserved space allocation Hotness awareness via tagging Selective KV separation More evaluation results and analysis in paper and technical report Source code: http://adslab.cse.cuhk.edu.hk/software/hashkv 19