LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Similar documents
LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

HashKV: Enabling Efficient Updates in KV Storage via Hashing

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

Freewrite: Creating (Almost) Zero-Cost Writes to SSD

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

Atlas: Baidu s Key-value Storage System for Cloud Data

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Speeding Up Cloud/Server Applications Using Flash Memory

RocksDB Key-Value Store Optimized For Flash

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CSE-E5430 Scalable Cloud Computing Lecture 9

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

arxiv: v1 [cs.db] 25 Nov 2018

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

FlashKV: Accelerating KV Performance with Open-Channel SSDs

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

SILT: A Memory-Efficient, High- Performance Key-Value Store

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Optimizing Flash-based Key-value Cache Systems

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Redesigning LSMs for Nonvolatile Memory with NoveLSM

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

LSbM-tree: Re-enabling Buffer Caching in Data Management for Mixed Reads and Writes

Inside the InfluxDB Storage Engine

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Improving throughput for small disk requests with proximal I/O

CaSSanDra: An SSD Boosted Key- Value Store

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Flashed-Optimized VPSA. Always Aligned with your Changing World

G i v e Y o u r H a r d D r i v e w h a t i t s B e e n M i s s i n g P e r f o r m a n c e The new Synapse

Bigtable. Presenter: Yijun Hou, Yixiao Peng

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Big Data Analytics CSCI 4030

CS November 2018

Optimizing Key-Value Stores For Hybrid Storage Environments. Prashanth Menon

Solid State Drive (SSD) Cache:

CS November 2017

Storage Systems for Shingled Disks

Using Synology SSD Technology to Enhance System Performance Synology Inc.

systems & research project

Distributed File Systems II

Cluster-Level Google How we use Colossus to improve storage efficiency

[537] Flash. Tyler Harter

S-FTL: An Efficient Address Translation for Flash Memory by Exploiting Spatial Locality

Virtual Storage Tier and Beyond

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

CS3600 SYSTEMS AND NETWORKS

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

A New Metric for Analyzing Storage System Performance Under Varied Workloads

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

CSIT5300: Advanced Database Systems

Turning Object. Storage into Virtual Machine Storage. White Papers

Information Systems (Informationssysteme)

Physical Level of Databases: B+-Trees

Storage hierarchy. Textbook: chapters 11, 12, and 13

SmartMD: A High Performance Deduplication Engine with Mixed Pages

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Unveiling the new QM2 M.2 SSD/10GbE PCIe Expansion cards

Introduction to Database Services

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Using Transparent Compression to Improve SSD-based I/O Caches

WiscKey: Separating Keys from Values in SSD-Conscious Storage

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Isilon Performance. Name

CSE 451: Operating Systems Spring Module 12 Secondary Storage

Scalable Enterprise Networks with Inexpensive Switches

Information Retrieval

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Kathleen Durant PhD Northeastern University CS Indexes

RocksDB Embedded Key-Value Store for Flash and RAM

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Easily Recreate the Qtier SSD Tier

2.5-Inch SATA SSD PSSDS27Txxx6

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Copyright 2012 EMC Corporation. All rights reserved.

LSI MegaRAID Advanced Software Evaluation Guide V3.0

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Open vstorage RedHat Ceph Architectural Comparison

Ghislain Fourny. Big Data 5. Column stores

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems- The Bufferhash KV Store

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Transcription:

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Xingbo Wu Yuehai Xu Song Jiang Zili Shao The Hong Kong Polytechnic University

The Challenge on Today s Key-Value Store Trends on workloads Larger single-store capacity Multi-TB SSD Flash array of over 100 TB Smaller key-value items In a Facebook pool 99% of the items are 68B. Large metadata set on a single node 1

Consequences of a Large Metadata Set Less caching space for hot items. Low hit ratio compromises system throughput. Long warm-up time. It may take tens of minutes to read all metadata into memory. High read cost for out-of-core metadata. It s expensive to read multiple pages to serve a single GET. LevelDB has managed to reduce the metadata size. 2

LevelDB Reduces Metadata Size with SSTable To construct an SSTable: Sort data into a list. Build memory-efficient block-index. Generate Bloom filters to avoid unnecessary reads. How to support insertions on SSTable? Get 20 Get 22 1 10 19 Bloom filter [1,2,3,5,8,9] [10,11,13,15,16,18] [19,20,23,24,25] 4KB 4KB 4KB 3

Reorganizing Data Across Levels LSM-tree (Log-Structured Merge-tree) New items are first accumulated in MemTable. Each filled MemTable is converted to an SSTable at Level 0. LevelDB conducts compaction to merge the SSTables. A store can exponentially grow to several TBs with a few levels. 1-99 MemTable Very Expensive! Level 1 Level 0 1-25 26-50 51-77 78-99 Compaction 1-13 14-26 27-37 38-49 50-62 63-74 75-86 87-99 4

A Closer Look at Compaction Steps in compaction: 1. Read overlapping SSTables into memory. 2. Merge-Sort the data in memory to form a list of new SSTables. 3. Write the new SSTables onto the disk to replace the old ones. In one compaction (1:10 size ratio): Read 11 tables Write 11 tables Add only 1 table to the lower level Memory 70-75 76-79 80-84 85-90 45x WA for a 5-level store (WA: write amplification) Level N 70-90 Disk Level N+1 70-77 78-84 85-90 5

Compaction can be very Expensive The workload: PUT 2 billion items of random keys (~250GB). 16-byte key and 100-byte value. PUT throughput reduces to 18K QPS (2MB/s). 6

Metadata I/O is Inefficient Facts about LevelDB s metadata: Block Index: ~12 bytes per block. Bloom filter: ~10 bits per key. How large is it in a 10-TB store of 100-byte items? 155GB metadata: 30 GB block index + 125 GB Bloom filter. 4-GB memory holds 25% of the metadata for a 1-TB store. ~50% raw SSD IOPS ~10% raw SSD IOPS 7

Our solution: LSM-trie Build an ultra-large store for small data. Using a trie structure to improve compaction efficiency. Clustering Bloom filters for efficiently reading outof-core metadata. 8

Organizing Tables in a Trie (Prefix Tree) items are located in the trie according to their hashed key. Each trie node contains a pile of immutable tables. The nodes at the same depth form a conceptual level. How does LSM-trie help with efficient compaction? Level 0 00 01 10 11 Level 1 00 01 10 11 Hash prefix: 00 10 11 00 01 10 11 9

Efficient Compaction in LSM-trie Compaction steps: 1. Read tables from the parent node into memory. 2. Assign the items to new tables according to hash-prefixes. 3. Write new tables into its child nodes. For one compaction: Read 8 tables (1:8 fan-out) Write 8 tables Add 8 tables to the lower level Memory #00 #01 #10 #11 Only 5x WA for a 5-level LSM-trie Tables linearly grow at each node 10

Introducing HTable* HTable: Immutable hash-table of key-value items Each bucket has 4KB space by default. Some buckets have overflowed items. Migrating the overflowed items. Bucket ID: 0 to 3 0 1 2 3 *It s not the HTable in HBase. 11

Selecting Items for Migration Sorting items in a bucket according to their key s hash. Migrating the items above the watermark (HashMark). Recording the HashMark and the corresponding IDs. 2B Source ID, 2B Target ID, 4B HashMark Move to another bucket 0xef 0xa0 0x9a 0x6d 0x56 0x33 0 HashMark: 0xa0 12

Caching HashMarks for Efficient GETs Only cache HashMark for most overloaded buckets. 1.01 amortized reads per GET. A 1-TB store only needs ~400MB in-memory HashMark. 0x86 0x60 0x33 0x2a 1 0x95 0xac 0x10 0x35 7 Item owned by Bucket 1 Item owned by Bucket 7 The only item that triggers a 2 nd read. Metadata in bucket 1 Bucket ID: 1 HashMark: 0x95 Target ID: 7 13

Most Metadata is in the Last Level The first four levels contain 1.9 GB Bloom filters (BF). The last level may contain over one hundred GB BFs. We explicitly cache the BFs for Level 0 to Level 3. The BFs at the last level are managed differently. Data size distribution across the levels: 256 MB 2 GB 16 GB 128 GB N TB 147 GB data 14

Clustering Bloom Filter for Efficient Read Each hash(key) indicates a column of 4-KB buckets. We collect all the BFs in a column to form a BloomCluster. Each GET requires one SSD read for all the out-of-core BFs. 15

Exploiting Full SSD Performance Using an additional small SSD to host BloomClusters. e.g., a 10-TB SSD for data + a 128-GB SSD for metadata. Plenty of memory space is left for your data cache! A few GBs GET In-memory Metadata 1 st Read 2 nd Read 52K IOPS 50K QPS Bloom-Cluster items 16

Performance Evaluation of PUT Expected high TP on SSD High TP lasts longer on two SSDs Consistent Throughput (TP) on HDD: 2x-3x higher than the others TP dropped due to static wear-leveling in SSD 17

Write Amplification Comparison Consistent 5x WA ratio 18

Read Performance with 4GB Memory Only one SSD is used. SSD IOPS ~50% raw SSD IOPS *No data cache for LSM-trie 19

Read Performance with 4GB Memory Gains 96% raw SSD IOPS with an additional SSD. *No data cache for LSM-trie 20

Summary LSM-trie is designed to manage a large set of small data. It reduces the write-amplification by an order of magnitude. It delivers high throughput even with out-of-core metadata. The LSM-trie source code can be downloaded at: https://github.com/wuxb45/lsm-trie-release 21

Thank you! Q & A 22