RocksDB Embedded Key-Value Store for Flash and RAM

RocksDB Embedded Key-Value Store for Flash and RAM Dhruba Borthakur February 2018. Presented at Dropbox

Dhruba Borthakur: Who Am I? University of Wisconsin Madison Alumni Developer of AFS: Andrew File System (mid 1990 s) Developer of Veritas File System (late 1990 s) Founding Engineer for Hadoop File System (mid 2000 s) Founding Engineer of RocksDB (early 2010 s) Co-founder of Rockset (a stealth mode startup)

A Client-Server Architecture with disks Application Server Network roundtrip = 50 micro sec Database Server Disk access = 10 milli seconds Locally attached Disks

Client-Server Architecture with fast storage Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs 100 nanosecs SSD RAM Latency dominated by network

Architecture of an Embedded Database Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs 100 nanosecs SSD RAM Storage attached directly to application servers

RocksDB is born! Key-Value persistent store Embedded Optimized for fast storage Server workloads

What is it not? Not distributed No failover Not highly-available, if machine dies you lose your data Focus on single node performance

RocksDB API Keys and values are byte arrays. Data are stored sorted by key. Update Operations: Put/Delete/Merge Queries: Get/Iterator

Log Structured Merge Architecture Scan Request from Application Write Request from Application Periodic Compaction Read Write data in RAM Read Only data in SSD or disk Transaction log

RocksDB Write Path Write Request Active MemTable Switch ReadOnly MemTable log Switch log log LS Flush d Compaction

RocksDB -- Reads Data could be in memory or on disk Consult multiple files to find the latest instance of the key Use bloom filters to reduce IO

RocksDB Read Path Memory Active MemTable Persistent Storage log Read Request Get(k) ReadOnly MemTable log log LS Flush d Compaction Blooms

RocksDB Architecture Write Request Read Request Memory Active MemTable Switch ReadOnly MemTable Persistent Storage log Switch log log LS ReadOnly BlockCache Flush d Compaction

RocksDB: Open & Pluggable Get or Scan Request from Application Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable data format on storage Transaction log

Customizable WALogging Write Request PutLogData( I came from Mars ) Put(k1,v1) Active MemTable k1 v1 log log I came from Mars k1/v1

SST Files Static Sorted Table All Keys are sorted Block Based Format data on spinning disks and SSD Plain Table Format data on RAM

Read uses Bloom Filters Memory Persistent Storage Blooms Active MemTable log Read Request Blooms ReadOnly MemTable log log LS Flush d Compaction Blooms

Pluggable Memtable Formats Write Request Read Request Memory Unsorted MemTable Switch ReadOnly MemTable Persistent Storage log Switch log log Sort, Flush d LS Compaction Configure an unsorted meltable for bulk imports

Column Families Persistent Storage LS d Write to CF1 MemTables CF! shared log Write to CF2 MemTables CF2 d LS Atomic Writes to multiple keys across multiple column families

Write Ahead Log (WAL) Configuration Write Request Memory MemTable Persistent Storage log DisableWAL = true reduces write amplification sync = false process restart does not lose data sync = true machine reboot does not lose any data

WAL Recovery Modes Process WAL during database Open Options recover all data from WAL recover all except the last WAL record recover upto the first corrupted record recover all valid records

Block Cache Used only for reads Adjacent keys are delta-encoded Sharded n ways to avoid lock contention Configure: Index and Filter blocks in cache compressed or uncompressed

Block Cache Pluggable Pluggable, supply your own code LRU Cache, ClockCache Shared by multiple dbs within same process

Compaction Filter Invoked when compacting two or more data files drop keys or modify values c++ or lua Useful to implement higher level functionality Time-to-live of each individual keys

Merge Records User writes a Merge Record to DB Specifies a MergeOperator Invoked by Compaction and Get Avoid read-modify writes Counters, Redis lists AssociateMerge and GenericMerge

Add external file Used to bulk-import data from Hadoop/S3 Add an file to RocksDB All keys are added atomically Add as most recent or as oldest

Compression Options on Storage Compression per block pluggable, supply your own code snappy, lib, lz4, zstd dictionary per file dictionary size configurable

Optimize for short range scans Prefix scans Range scans within same key prefix Blooms created for prefix Reduces read amplification

RocksDB Usage explosion Development started in May 2012 Open sourced in Nov 2013 The benefits of Open Source Adoption by LinkedIn (feed), Yahoo (sherpa) Ported to Windows by Microsoft (Bing) Apache Samza, bitcoin, RedHat CEPH Ported to IOS and Android MySQL and MongoDB storage engine

MongoDB: RocksDB storage engine Reduces a 5 TB MongoDB instance to 285 GB on MongoRocks (Experimental result in 2014)

MySQL: RocksDB storage engine DB Size Comparison 1.2 1 0.8 0.6 0.4 0.2 0 DB Size (Relative) InnoDB RocksDB LinkBench: open source benchmark for Facebook s workload Reduces MySQL flash storage space by 50% for LinkBench

MySQL: RocksDB storage engine DB Size Comparison 1.2 1 0.8 0.6 0.4 0.2 0 Bytes Written (Relative) InnoDB RocksDB Reduces write amplification by 50% for LinkBench

SPARROW Theorem New way to measure DB performance on fast storage Space Amplification (SPA) Read Amplification (RA) SPARROW theorem states: RA is inversely related to WA WA is inversely related to SPA http://rocksdb.blogspot.com/

RocksDB features for MySQL support Optimistic transactions Pessimistic transactions

RocksDB-Cloud Optimized for Cloud applications AWS, Google, Azure Provides durability Locally attached SSD for performance AWS-S3 for durability n-times cost-effective than EBS, n >=2

ARCHITECTURE OF A CLOUD APPLICATION tail data from distributed log storage writes Cloud Application RocksDB-Cloud reads queries memtable cache block cache flush file to local SSD flush file to cloud storage persistent read cache on SSD Cloud Storage

ROCKSDB-CLOUD: ZERO COPY CLONES tail data from distributed log storage Server RocksDB-Cloud queries write read Cloud Bucket A tail data from distributed log storage read Cloned Server RocksDB-Cloud queries served by either server Instantaneous clone creation write read Both machines run at their own speeds Cloud Bucket B True masterless configuration

PORTABILITY ACROSS CLOUD VENDORS SEAMLESS COPY AMONG S3, AZURE, GOOGLE App on Azure can access AWS S3 Storage App on Google Cloud can access Azure Storage same API on all cloud platforms write write RocksDB Cloud App AWS S3 RocksDB Cloud App read read Azure Google Cloud https://github.com/rockset/rocksdb-cloud

LOW ADOPTION COST COMPATIBILITY WITH ROCKSDB Pure Open Source API compatible with stock RocksDB Data format compatible with stock RocksDB License compatible with stock RocksDB https://github.com/rockset/rocksdb-cloud

Questions?