PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Similar documents
SLM-DB: Single-Level Key-Value Store with Persistent Memory

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

arxiv: v1 [cs.db] 25 Nov 2018

HashKV: Enabling Efficient Updates in KV Storage via Hashing

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

RocksDB Key-Value Store Optimized For Flash

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

An Efficient Memory-Mapped Key-Value Store for Flash Storage

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

Redesigning LSMs for Nonvolatile Memory with NoveLSM

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

WiscKey: Separating Keys from Values in SSD-Conscious Storage

RocksDB Embedded Key-Value Store for Flash and RAM

MongoDB Revs You Up: What Storage Engine is Right for You?

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

Kinetic Action: Micro and Macro Benchmark-based Performance Analysis of Kinetic Drives Against LevelDB-based Servers

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

CSE-E5430 Scalable Cloud Computing Lecture 9

Distributed File Systems II

LSbM-tree: Re-enabling Buffer Caching in Data Management for Mixed Reads and Writes

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

A Real-Time, Low Latency, Key-Value Solution Combining Samsung Z-SSD and Levyx s Helium Data Store. January 2018

Using Transparent Compression to Improve SSD-based I/O Caches

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

A Self-Designing Key-Value Store

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Distributed caching for cloud computing

Developing Extremely Low-Latency NVMe SSDs

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

High-Performance Key-Value Store on OpenSHMEM

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Closing the Performance Gap Between Volatile and Persistent K-V Stores

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Towards Weakly Consistent Local Storage Systems

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Adaptation in distributed NoSQL data stores

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

FlashKV: Accelerating KV Performance with Open-Channel SSDs

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

CaSSanDra: An SSD Boosted Key- Value Store

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

BigTable. CSE-291 (Cloud Computing) Fall 2016

A New Key-Value Data Store For Heterogeneous Storage Architecture

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Optimizing Flash-based Key-value Cache Systems

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Inside the InfluxDB Storage Engine

Understanding Write Behaviors of Storage Backends in Ceph Object Store

SMORE: A Cold Data Object Store for SMR Drives

Ben Walker Data Center Group Intel Corporation

A Low-cost Disk Solution Enabling LSM-tree to Achieve High Performance for Mixed Read/Write Workloads

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

SFS: Random Write Considered Harmful in Solid State Drives

Application-Managed Flash

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Performance Evaluation of Cassandra in a Virtualized environment

The Unwritten Contract of Solid State Drives

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

The Btrfs Filesystem. Chris Mason

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Cloud Computing with FPGA-based NVMe SSDs

MongoDB on Kaminario K2

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

White Paper. File System Throughput Performance on RedHawk Linux

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Next-Generation Cloud Platform

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Recovering Disk Storage Metrics from low level Trace events

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Architecture of a Real-Time Operational DBMS

Outline. Spanner Mo/va/on. Tom Anderson

The TokuFS Streaming File System

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Performance Impact of Multithreaded Java Server Applications

CS3600 SYSTEMS AND NETWORKS

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Storage Systems for Shingled Disks

Atlas: Baidu s Key-value Storage System for Cloud Data

Transcription:

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

What is a key-value store? Store any arbitrary value for a given key Keys Values 123 124 { name : John Doe, age : 25} { name : Ross Gel, age : 28} 2

What is a key-value store? Store any arbitrary value for a given key Keys 123 124 Values { name : John Doe, age : 25} { name : Ross Gel, age : 28} Insertions: Point lookups: Range Queries: 3

What is a key-value store? Store any arbitrary value for a given key Keys 123 124 Values { name : John Doe, age : 25} { name : Ross Gel, age : 28} Insertions: put(key, value) Point lookups: Range Queries: 4

What is a key-value store? Store any arbitrary value for a given key Keys 123 124 Values { name : John Doe, age : 25} { name : Ross Gel, age : 28} Insertions: put(key, value) Point lookups: get(key) Range Queries: 5

What is a key-value store? Store any arbitrary value for a given key Keys 123 124 Values { name : John Doe, age : 25} { name : Ross Gel, age : 28} Insertions: put(key, value) Point lookups: get(key) Range Queries: get_range(key1, key2) 6

Key-Value Stores - widely used Google s BigTable powers Search, Analytics, Maps and Gmail Facebook s RocksDB is used as storage engine in production systems of many companies 7

Write-optimized data structures Log Structured Merge Tree (LSM) is a write-optimized data structure used in key-value stores Provides high write throughput with good read throughput, but suffers high write amplification 8

Write-optimized data structures Log Structured Merge Tree (LSM) is a write-optimized data structure used in key-value stores Provides high write throughput with good read throughput, but suffers high write amplification Write amplification - Ratio of amount of write IO to amount of user data Client 10 GB User data KV-store If total write I/O is 200 GB Write amplification = 20 9

Write amplification in LSM based KV stores Inserted 500M key-value pairs Key: 16 bytes, Value: 128 bytes Total user data: ~45 GB Write IO (GB) 2100 1800 1500 1200 900 600 300 0 RocksDB LevelDB PebblesDB User Data 45 10

Write amplification in LSM based KV stores Inserted 500M key-value pairs Key: 16 bytes, Value: 128 bytes Total user data: ~45 GB 2100 1800 1868 (42x) Write IO (GB) 1500 1200 900 600 1222 (27x) 756 (17x) 300 0 RocksDB LevelDB PebblesDB User Data 45 11

Why is write amplification bad? Reduces the write throughput Flash devices wear out after limited write cycles (Intel SSD DC P4600 can last ~5 years assuming ~5 TB write per day) RocksDB can write ~500 GB of user data per day to a SSD to last 1.25 years Data source: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-1-6tb-2-5inch-3d1.html 12

PebblesDB High performance write-optimized key-value store Built using new data structure Fragmented Log-Structured Merge Tree Achieves 3-6.7x higher write throughput and 2.4-3x lesser write amplification compared to RocksDB Gets the highest write throughput and least write amplification as a backend store to MongoDB 13

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 14

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 15

Log Structured Merge Tree (LSM) In-memory File 1 Memory Storage Data is stored both in memory and storage 16

Log Structured Merge Tree (LSM) Write (key, value) In-memory File 1 Memory Storage Writes are directly put to memory 17

Log Structured Merge Tree (LSM) In-memory File 2 File 1 Memory Storage In-memory data is periodically written as files to storage (sequential I/O) 18

Log Structured Merge Tree (LSM) In-memory Memory Level 0 Storage Level 1 Level n Files on storage are logically arranged in different levels 19

Log Structured Merge Tree (LSM) In-memory Memory Level 0 Storage Level 1 Level n Compaction pushes data to higher numbered levels 20

Log Structured Merge Tree (LSM) Level 0 Level 1 In-memory 1. 12 15. 19 25. 75 79. 99 Memory Storage Search using binary search Level n Files are sorted and have non-overlapping key ranges 21

Log Structured Merge Tree (LSM) Limit on number of level 0 files Level 0 In-memory 2. 57 23. 78 Memory Storage Level 1 Level n Level 0 can have files with overlapping (but sorted) key ranges 22

Write amplification: Illustration Level 0 58. In-memory 68 2. 37 23. 48 Level 1 re-write counter: 1 Memory Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Max files in level 0 is configured to be 2 23

Write amplification: Illustration Level 1 re-write counter: 1 In-memory Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Level 0 has 3 files (> 2), which triggers a compaction 24

Write amplification: Illustration Level 1 re-write counter: 1 In-memory Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n * Files are immutable * Sorted non-overlapping files 25

Write amplification: Illustration Level 1 re-write counter: 1 In-memory Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Set of overlapping files between levels 0 and 1 26

Write amplification: Illustration Level 1 re-write counter: 1 In-memory Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Set of overlapping files between levels 0 and 1 27

Write amplification: Illustration Level 1 re-write counter: 1 In-memory Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Set of overlapping files between levels 0 and 1 28

Write amplification: Illustration 47 24 1. 23 68 46 In-memory Level 1 re-write counter: 21 Memory Level 0 2. 37 23. 48 58. 68 Storage Level 1 1. 12 15. 25 39. 62 77. 95 Level n Compacting level 0 with level 1 29

Write amplification: Illustration In-memory Level 1 re-write counter: 2 Memory Level 0 Storage Level 1 1. 23 24. 46 47. 68 77. 95 Level n Level 0 is compacted 30

Write amplification: Illustration 10. 33 17. 53 1. 121 Level 1 re-write counter: 2 Memory Level 0 Storage Level 1 1. 23 24. 46 47. 68 77. 95 Level n Data is being flushed as level 0 files after some Write operations 31

Write amplification: Illustration Level 1 re-write counter: 2 Memory Level 0 10. 33 17. 53 1. 121 Storage Level 1 1. 23 24. 46 47. 68 77. 95 Level n Compacting level 0 with level 1 32

Write amplification: Illustration 92 31 162.. 121 30 60 90 Level 1 re-write counter: 23 Memory Level 0 Storage Level 1 Level n Compacting level 0 with level 1 33

Write amplification: Illustration Level 1 re-write counter: 3 Memory Level 0 Storage Level 1 1. 30 31. 60 62. 90 92. 121 Level n Existing data is re-written to the same level (1) 3 times 34

Root cause of write amplification Rewriting data to the same level multiple times To maintain sorted non-overlapping files in each level 35

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 36

Naïve approach to reduce write amplification Just append the file to the end of next level Many (possibly all) overlapping files within a level Level i 1. 89 5. 65 6. 91 8. 95 9. 99 1. 102 1 271 (all files have overlapping key ranges) Affects the read performance 37

Partially sorted levels Hybrid between all non-overlapping files and all overlapping files Inspired from Skip-List data structure Concrete boundaries (guards) to group together overlapping files 13 35 70 Level i 1. 12 13. 34 18. 31 40. 47 45. 56 42. 65 72. 87 (files of same color can have overlapping key ranges) 38

Fragmented Log-Structured Merge Tree Novel modification of LSM data structure Uses guards to maintain partially sorted levels Writes data only once per level in most cases 39

FLSM structure Level 0 Level 1 In-memory 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Memory Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Note how files are logically grouped within guards 40

FLSM structure Level 0 Level 1 In-memory 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Memory Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Guards get more fine grained deeper into the tree 41

How does FLSM reduce write amplification? 42

How does FLSM reduce write amplification? Level 0 Level 1 In-memory 30. 68 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Memory Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Max files in level 0 is configured to be 2 43

How does FLSM reduce write amplification? 15 15 2. 14 68 In-memory Memory Level 0 2. 37 23. 48 30. 68 Storage 15 70 Level 1 1. 12 15. 59 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Compacting level 0 44

How does FLSM reduce write amplification? 15 2. 14 15. 68 In-memory Memory Level 0 Storage 15 70 Level 1 1. 12 15. 59 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Fragmented files are just appended to next level 45

How does FLSM reduce write amplification? 15. 68 In-memory Memory Level 0 Storage 15 70 Level 1 2. 14 1. 12 15. 68 15. 59 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Guard 15 in Level 1 is to be compacted 46

How does FLSM reduce write amplification? 40 40 15. 39 68 In-memory Memory Level 0 Storage 15 70 Level 1 2. 14 1. 12 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Files are combined, sorted and fragmented 47

How does FLSM reduce write amplification? 40 15. 39 40. 68 In-memory Memory Level 0 Storage 15 70 Level 1 2. 14 1. 12 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Fragmented files are just appended to next level 48

How does FLSM reduce write amplification? FLSM doesn t re-write data to the same level in most cases How does FLSM maintain read performance? FLSM maintains partially sorted levels to efficiently reduce the search space 49

Selecting Guards Guards are chosen randomly and dynamically Dependent on the distribution of data 50

Selecting Guards Guards are chosen randomly and dynamically Dependent on the distribution of data 1 Keyspace 1e+9 51

Selecting Guards Guards are chosen randomly and dynamically Dependent on the distribution of data 1 Keyspace 1e+9 52

Selecting Guards Guards are chosen randomly and dynamically Dependent on the distribution of data 1 Keyspace 1e+9 53

Operations: Write Write (key, value) Put(1, abc ) Level 0 In-memory 2. 37 23. 48 Memory Storage Level 1 15 70 1. 12 15. 59 77. 87 82. 95 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 FLSM structure 54

Operations: Get In-memory Get(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 FLSM structure 55

Operations: Get In-memory Get(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Search level by level starting from memory 56

Operations: Get In-memory Get(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 All level 0 files need to be searched 57

Operations: Get In-memory Get(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Level 1: File under guard 15 is searched 58

Operations: Get In-memory Get(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 Level 2: Both the files under guard 15 are searched 59

High write throughput in FLSM Compaction from memory to level 0 is stalled Writes to memory is also stalled Write (key, value) In-memory Memory Level 0 2. 98 23. 48 1. 37 18. 48 Storage If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 60

High write throughput in FLSM Compaction from memory to level 0 is stalled Writes to memory is also stalled Level 0 Write (key, value) FLSM has faster compaction In-memorybecause of lesser I/O and Memory hence higher write throughput 2. 98 23. 48 1. 37 18. 48 Storage If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 61

Challenges in FLSM Every read/range query operation needs to examine multiple files per level For example, if every guard has 5 files, read latency is increased by 5x (assuming no cache hits) Trade-off between write I/O and read performance 62

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 63

PebblesDB Built by modifying HyperLevelDB (±9100 LOC) to use FLSM HyperLevelDB, built over LevelDB, to provide improved parallelism and compaction API compatible with LevelDB, but not with RocksDB 64

Optimizations in PebblesDB Challenge (get/range query): Multiple files in a guard Get() performance is improved using file level bloom filter 65

Optimizations in PebblesDB Challenge (get/range query): Multiple files in a guard Get() performance is improved using file level bloom filter Is key 25 present? Bloom filter Definitely not Possibly yes 66

Optimizations in PebblesDB Challenge (get/range query): Multiple files in a guard Get() performance is improved using file level bloom filter 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Bloom Filter Bloom Filter Bloom Filter Bloom Filter Maintained in-memory 67

Optimizations in PebblesDB Challenge (get/range query): Multiple files in a guard Get() performance is improved using file level bloom filter 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Bloom Filter Bloom Filter Bloom Filter Bloom Filter Maintained in-memory PebblesDB reads same number of files as any LSM based store 68

Optimizations in PebblesDB Challenge (get/range query): Multiple files in a guard Get() performance is improved using file level bloom filter Range query performance is improved using parallel threads and better compaction 69

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 70

Evaluation Micro-benchmarks Real world workloads - YCSB Crash recovery Small dataset Low memory CPU and memory usage NoSQL applications Aged file system 71

Evaluation Micro-benchmarks Real world workloads - YCSB Crash recovery Small dataset Low memory CPU and memory usage NoSQL applications Aged file system 72

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 Throughput ratio wrt HyperLevelDB 2 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 73

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 74

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 75

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 76

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 77

Real world workloads - YCSB Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 35.08 Kops/s 25.8 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 5.8 Kops/s 32.09 Kops/s 952.93 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 78

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 79

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 1 0.5 20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 80

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 1 0.5 20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 81

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 1 0.5 20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 82

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 1 0.5 20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 83

NoSQL stores - MongoDB YCSB on MongoDB, a widely used key-value store Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 Throughput ratio wrt WiredTiger 2 1.5 0.5 0 20.73 Kops/s 9.95 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s PebblesDB 1 combines low write IO of WiredTiger with high performance of RocksDB Load A Run A Run B Run C Run D Load E Run E Run F Total IO 20.68 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 84

Outline Log-Structured Merge Tree (LSM) Fragmented Log-Structured Merge Tree (FLSM) Building PebblesDB using FLSM Evaluation Conclusion 85

Conclusion PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees Increases write throughput and reduces write IO at the same time Obtains 6X the write throughput of RocksDB As key-value stores become more widely used, there have been several attempts to optimize them PebblesDB combines algorithmic innovation (the FLSM data structure) with careful systems building 86

https://github.com/utsaslab/pebblesdb

https://github.com/utsaslab/pebblesdb

Backup slides 89

Operations: Seek Seek(target): Returns the smallest key in the database which is >= target Used for range queries (for example, return all entries between 5 and 18) Level 0 1, 2, 100, 1000 Level 1 1, 5, 10, 2000 Level 2 5, 300, 500 Get(1) 90

Operations: Seek Seek(target): Returns the smallest key in the database which is >= target Used for range queries (for example, return all entries between 5 and 18) Level 0 1, 2, 100, 1000 Level 1 1, 5, 10, 2000 Level 2 5, 300, 500 Seek(200) 91

Operations: Seek Seek(target): Returns the smallest key in the database which is >= target Used for range queries (for example, return all entries between 5 and 18) 92

Operations: Seek In-memory Seek(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 FLSM structure 93

Operations: Seek In-memory Seek(23) Memory Level 0 Level 1 2. 37 23. 48 15 70 1. 12 15. 59 77. 87 82. 95 Storage 15 40 70 95 Level 2 2. 8 16. 32 15. 23 45. 65 70. 90 96. 99 All levels and memtable need to be searched 94

Optimizations in PebblesDB Challenge with reads: Multiple sstable reads per level Optimized using sstable level bloom filters Bloom filter: determine if an element is in a set Is key 25 present? Bloom filter Definitely not Possibly yes 95

Optimizations in PebblesDB Challenge with reads: Multiple sstable reads per level Optimized using sstable level bloom filters Bloom filter: determine if an element is in a set 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Bloom Filter Bloom Filter Bloom Filter Bloom Filter Maintained in-memory Get(97) True 96

Optimizations in PebblesDB Challenge with reads: Multiple sstable reads per level Optimized using sstable level bloom filters Bloom filter: determine if an element is in a set 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Bloom Filter Bloom Filter Bloom Filter Bloom Filter Get(97) False True 97

Optimizations in PebblesDB Challenge with reads: Multiple sstable reads per level Optimized using sstable level bloom filters Bloom filter: determine if an element is in a set 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Bloom Filter Bloom Filter Bloom Filter Bloom Filter PebblesDB reads at most one file per guard with high probability 98

Optimizations in PebblesDB Challenge with seeks: Multiple sstable reads per level Parallel seeks: Parallel threads to seek() on files in a guard Thread 1 Thread 2 15 70 Level 1 1. 12 15. 39 77. 97 82. 95 Seek(85) 99

Optimizations in PebblesDB Challenge with seeks: Multiple sstable reads per level Parallel seeks: Parallel threads to seek() on files in a guard Seek based compaction: Triggers compaction for a level during a seek-heavy workload Reduce the average number of sstables per guard Reduce the number of active levels Seek based compaction increases write I/O but as a trade-off to improve seek performance 100

Tuning PebblesDB PebblesDB characteristics like Increase in write throughput, decrease in write amplification and overhead of read/seek operation all depend on one parameter, maxfilesperguard (default 2 in PebblesDB) Setting this to a very high value favors write throughput Setting this to a very low value favors read throughput 101

Horizontal compaction Files compacted within the same level for the last two levels in PebblesDB Some optimizations to prevent huge increase in write IO 102

Experimental setup Intel Xeon 2.8 GHz processor 16 GB RAM Running Ubuntu 16.04 LTS with the Linux 4.4 kernel Software RAID0 over 2 Intel 750 SSDs (1.2 TB each) Datasets in experiments 3x bigger than DRAM size 103

Write amplification Inserted different number of keys with key size 16 bytes and value size 128 bytes 5 Write IO ratio wrt PebblesDB 4.5 4 3.5 3 2.5 2 1.5 1 7.2 GB 100.7 GB 756 GB 0.5 0 10M 100M 500M Number of keys inserted 104

Micro-benchmarks Used db_bench tool that ships with LevelDB Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB Number of read/seek operations: 10M 3 Throughput ratio wrt HyperLevelDB 2.5 2 1.5 1 0.5 0 11.72 Kops/s 6.89 Kops/s 7.5 Kops/s Random-Writes Reads Range-Queries Benchmark 105

Micro-benchmarks Used db_bench tool that ships with LevelDB Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB Number of read/seek operations: 10M 3 Throughput ratio wrt HyperLevelDB 2.5 2 1.5 1 0.5 0 239.05 Kops/s 11.72 Kops/s 6.89 Kops/s 7.5 Kops/s 126.2 Kops/s Seq-Writes Random-Writes Reads Range-Queries Deletes Benchmark 106

Multi threaded micro-benchmarks Writes 4 threads each writing 10M Reads 4 threads each reading 10M Mixed 2 threads writing and 2 threads reading (each 10M) 2.5 Throughput ratio wrt HyperLevelDB 2 1.5 1 0.5 0 44.4 Kops/s 40.2 Kops/s 38.8 Kops/s Writes Reads Mixed Benchmark 107

Small cached dataset Insert 1M key-value pairs with 16 bytes key and 1 KB value Total data set (~1 GB) fits within memory PebblesDB-1: with maximum one file per guard 2.5 Throughput ratio wrt HyperLevelDB 2 1.5 1 0.5 0 45.25 Kops/s 205.76 Kops/s 205.34 Kops/s Writes Reads Range-Queries Benchmark 108

Small key-value pairs Inserted 300M key-value pairs Key 16 bytes and 128 bytes value 3.5 Throughput ratio wrt HyperLevelDB 3 2.5 2 1.5 1 0.5 0 44.48 Kops/s 6.34 Kops/s 6.31 Kops/s Writes Reads Range-Queries Benchmark 109

Aged FS and KV store File system aging: Fill up 89% of the file system KV store aging: Insert 50M, delete 20M and update 20M key-value pairs in random order 2.5 Throughput ratio wrt HyperLevelDB 2 1.5 1 0.5 0 17.37 Kops/s 5.65 Kops/s 6.29 Kops/s Writes Reads Range-Queries Benchmark 110

Low memory micro-benchmark 100M key-value pairs with 1KB (~65 GB data set) DRAM was limited to 4 GB 2.5 Throughput ratio wrt HyperLevelDB 2 1.5 1 0.5 0 27.78 Kops/s 2.86 Kops/s 4.37 Kops/s Writes Reads Range-Queries Benchmark 111

Impact of empty guards Inserted 20M key-value pairs (0 to 20M) in random order with value size 512 bytes Incrementally inserted new 20M keys after deleting the older keys Around 9000 empty guards at the start of the last iteration Read latency did not reduce with the increase in empty guards 112

NoSQL stores - HyperDex HyperDex distributed key-value store from Cornell Inserted 20M key-value pairs with 1 KB value size and 10M operations Throughput ratio wrt HyperLevelDB 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 22.08 Kops/s 21.85 Kops/s 31.17 Kops/s 32.75 Kops/s 38.02 Kops/s 7.62 Kops/s 0.37 Kops/s 19.11 Kops/s 1349.5 GB Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run A - 50% reads, 50% writes Run B - 95% reads, 5% writes Run C - 100% reads Run D - 95% reads (latest), 5% writes Load E - 100% writes Run E - 95% range queries, 5% writes Run F - 50% reads, 50% read-modify-writes 113

CPU usage Median CPU usage by inserting 30M keys and reading 10M keys PebblesDB: ~171% Other key-value stores: 98-110% Due to aggressive compaction, more CPU operations due to merging multiple files in a guard 114

Memory usage 100M records (16 bytes key, 1 KB value) 106 GB data set 300 MB memory space 0.3% of data set size Worst case: 100M records (16 bytes key, 16 bytes value) ~3.2 GB 9% of data set size 115

Bloom filter calculation cost 1.2 sec per GB of sstable 3200 files 52 GB 62 seconds 116

Impact of different optimizations Sstable level bloom filter improve read performance by 63% PebblesDB without optimizations for seek 66% 117

Thank you! Questions? 118