RocksDB Embedded Key-Value Store for Flash and RAM

Similar documents
DHRUBA BORTHAKUR, ROCKSET PRESENTED AT PERCONA-LIVE, APRIL 2017 ROCKSDB CLOUD

RocksDB Key-Value Store Optimized For Flash

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

MyRocks Engineering Features and Enhancements. Manuel Ung Facebook, Inc. Dublin, Ireland Sept th, 2017

CSE-E5430 Scalable Cloud Computing Lecture 9

SED 762. Transcript EPISODE 762 [INTRODUCTION]

Distributed PostgreSQL with YugaByte DB

10 Million Smart Meter Data with Apache HBase

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Why Choose Percona Server for MongoDB? Tyler Duzan

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

MongoDB Storage Engine with RocksDB LSM Tree. Denis Protivenskii, Software Engineer, Percona

MongoDB Revs You Up: What Storage Engine is Right for You?

SLM-DB: Single-Level Key-Value Store with Persistent Memory

POLARDB for MyRocks Extending shared storage to MyRocks. Zhang, Yuan Alibaba Cloud Apr, 2018

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

MySQL Storage Engines Which Do You Use? April, 25, 2017 Sveta Smirnova

HashKV: Enabling Efficient Updates in KV Storage via Hashing

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Running Databases in Containers.

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Introduction to Database Services

Why Choose Percona Server For MySQL? Tyler Duzan

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

CIB Session 12th NoSQL Databases Structures

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Ghislain Fourny. Big Data 5. Wide column stores

CLOUD-SCALE FILE SYSTEMS

MyRocks in MariaDB. Sergei Petrunia MariaDB Tampere Meetup June 2018

Scaling with mongodb

Compression in Open Source Databases. Peter Zaitsev April 20, 2016

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

NoSQL Databases Analysis

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Tools for Social Networking Infrastructures

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

What s New in MySQL and MongoDB Ecosystem Year 2017

Optimizing Space Amplification in RocksDB

Intro Cassandra. Adelaide Big Data Meetup.

HBase Solutions at Facebook

Home of Redis. April 24, 2017

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

BigTable. CSE-291 (Cloud Computing) Fall 2016

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

The Google File System

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

Inside the InfluxDB Storage Engine

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Amazon AWS-Solution-Architect-Associate Exam

Aerospike Scales with Google Cloud Platform

Percona Server for MySQL 8.0 Walkthrough

CA485 Ray Walshe Google File System

A New Key-Value Data Store For Heterogeneous Storage Architecture

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

Presented by Nanditha Thinderu

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

Monitoring MongoDB s Engines in the Wild. Tim Vaillancourt Sr. Technical Operations Architect

Intra-cluster Replication for Apache Kafka. Jun Rao

The Google File System

Distributed File Systems II

RDMA Requirements for High Availability in the NVM Programming Model

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

BespoKV: Application Tailored Scale-Out Key-Value Stores

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

CIT 668: System Architecture. Amazon Web Services

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Ben Walker Data Center Group Intel Corporation

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Google File System. Arun Sundaram Operating Systems

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Mnemosyne Lightweight Persistent Memory

Aurora, RDS, or On-Prem, Which is right for you

1

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

The Google File System

Transcription:

RocksDB Embedded Key-Value Store for Flash and RAM Dhruba Borthakur February 2018. Presented at Dropbox

Dhruba Borthakur: Who Am I? University of Wisconsin Madison Alumni Developer of AFS: Andrew File System (mid 1990 s) Developer of Veritas File System (late 1990 s) Founding Engineer for Hadoop File System (mid 2000 s) Founding Engineer of RocksDB (early 2010 s) Co-founder of Rockset (a stealth mode startup)

A Client-Server Architecture with disks Application Server Network roundtrip = 50 micro sec Database Server Disk access = 10 milli seconds Locally attached Disks

Client-Server Architecture with fast storage Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs 100 nanosecs SSD RAM Latency dominated by network

Architecture of an Embedded Database Application Server Network roundtrip = 50 micro sec Database Server 100 microsecs 100 nanosecs SSD RAM Storage attached directly to application servers

RocksDB is born! Key-Value persistent store Embedded Optimized for fast storage Server workloads

What is it not? Not distributed No failover Not highly-available, if machine dies you lose your data Focus on single node performance

RocksDB API Keys and values are byte arrays. Data are stored sorted by key. Update Operations: Put/Delete/Merge Queries: Get/Iterator

Log Structured Merge Architecture Scan Request from Application Write Request from Application Periodic Compaction Read Write data in RAM Read Only data in SSD or disk Transaction log

RocksDB Write Path Write Request Active MemTable Switch ReadOnly MemTable log Switch log log LS Flush d Compaction

RocksDB -- Reads Data could be in memory or on disk Consult multiple files to find the latest instance of the key Use bloom filters to reduce IO

RocksDB Read Path Memory Active MemTable Persistent Storage log Read Request Get(k) ReadOnly MemTable log log LS Flush d Compaction Blooms

RocksDB Architecture Write Request Read Request Memory Active MemTable Switch ReadOnly MemTable Persistent Storage log Switch log log LS ReadOnly BlockCache Flush d Compaction

RocksDB: Open & Pluggable Get or Scan Request from Application Write Request from Application Customizable WAL Blooms Pluggable Compaction Pluggable Memtable format in RAM Pluggable data format on storage Transaction log

Customizable WALogging Write Request PutLogData( I came from Mars ) Put(k1,v1) Active MemTable k1 v1 log log I came from Mars k1/v1

SST Files Static Sorted Table All Keys are sorted Block Based Format data on spinning disks and SSD Plain Table Format data on RAM

Read uses Bloom Filters Memory Persistent Storage Blooms Active MemTable log Read Request Blooms ReadOnly MemTable log log LS Flush d Compaction Blooms

Pluggable Memtable Formats Write Request Read Request Memory Unsorted MemTable Switch ReadOnly MemTable Persistent Storage log Switch log log Sort, Flush d LS Compaction Configure an unsorted meltable for bulk imports

Column Families Persistent Storage LS d Write to CF1 MemTables CF! shared log Write to CF2 MemTables CF2 d LS Atomic Writes to multiple keys across multiple column families

Write Ahead Log (WAL) Configuration Write Request Memory MemTable Persistent Storage log DisableWAL = true reduces write amplification sync = false process restart does not lose data sync = true machine reboot does not lose any data

WAL Recovery Modes Process WAL during database Open Options recover all data from WAL recover all except the last WAL record recover upto the first corrupted record recover all valid records

Block Cache Used only for reads Adjacent keys are delta-encoded Sharded n ways to avoid lock contention Configure: Index and Filter blocks in cache compressed or uncompressed

Block Cache Pluggable Pluggable, supply your own code LRU Cache, ClockCache Shared by multiple dbs within same process

Compaction Filter Invoked when compacting two or more data files drop keys or modify values c++ or lua Useful to implement higher level functionality Time-to-live of each individual keys

Merge Records User writes a Merge Record to DB Specifies a MergeOperator Invoked by Compaction and Get Avoid read-modify writes Counters, Redis lists AssociateMerge and GenericMerge

Add external file Used to bulk-import data from Hadoop/S3 Add an file to RocksDB All keys are added atomically Add as most recent or as oldest

Compression Options on Storage Compression per block pluggable, supply your own code snappy, lib, lz4, zstd dictionary per file dictionary size configurable

Optimize for short range scans Prefix scans Range scans within same key prefix Blooms created for prefix Reduces read amplification

RocksDB Usage explosion Development started in May 2012 Open sourced in Nov 2013 The benefits of Open Source Adoption by LinkedIn (feed), Yahoo (sherpa) Ported to Windows by Microsoft (Bing) Apache Samza, bitcoin, RedHat CEPH Ported to IOS and Android MySQL and MongoDB storage engine

MongoDB: RocksDB storage engine Reduces a 5 TB MongoDB instance to 285 GB on MongoRocks (Experimental result in 2014)

MySQL: RocksDB storage engine DB Size Comparison 1.2 1 0.8 0.6 0.4 0.2 0 DB Size (Relative) InnoDB RocksDB LinkBench: open source benchmark for Facebook s workload Reduces MySQL flash storage space by 50% for LinkBench

MySQL: RocksDB storage engine DB Size Comparison 1.2 1 0.8 0.6 0.4 0.2 0 Bytes Written (Relative) InnoDB RocksDB Reduces write amplification by 50% for LinkBench

SPARROW Theorem New way to measure DB performance on fast storage Space Amplification (SPA) Read Amplification (RA) SPARROW theorem states: RA is inversely related to WA WA is inversely related to SPA http://rocksdb.blogspot.com/

RocksDB features for MySQL support Optimistic transactions Pessimistic transactions

RocksDB-Cloud Optimized for Cloud applications AWS, Google, Azure Provides durability Locally attached SSD for performance AWS-S3 for durability n-times cost-effective than EBS, n >=2

ARCHITECTURE OF A CLOUD APPLICATION tail data from distributed log storage writes Cloud Application RocksDB-Cloud reads queries memtable cache block cache flush file to local SSD flush file to cloud storage persistent read cache on SSD Cloud Storage

ROCKSDB-CLOUD: ZERO COPY CLONES tail data from distributed log storage Server RocksDB-Cloud queries write read Cloud Bucket A tail data from distributed log storage read Cloned Server RocksDB-Cloud queries served by either server Instantaneous clone creation write read Both machines run at their own speeds Cloud Bucket B True masterless configuration

PORTABILITY ACROSS CLOUD VENDORS SEAMLESS COPY AMONG S3, AZURE, GOOGLE App on Azure can access AWS S3 Storage App on Google Cloud can access Azure Storage same API on all cloud platforms write write RocksDB Cloud App AWS S3 RocksDB Cloud App read read Azure Google Cloud https://github.com/rockset/rocksdb-cloud

LOW ADOPTION COST COMPATIBILITY WITH ROCKSDB Pure Open Source API compatible with stock RocksDB Data format compatible with stock RocksDB License compatible with stock RocksDB https://github.com/rockset/rocksdb-cloud

Questions?