MongoDB Storage Engine with RocksDB LSM Tree. Denis Protivenskii, Software Engineer, Percona

Similar documents
RocksDB Key-Value Store Optimized For Flash

MongoDB. David Murphy MongoDB Practice Manager, Percona

Why Choose Percona Server for MongoDB? Tyler Duzan

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Why Choose Percona Server For MySQL? Tyler Duzan

MongoDB Backup and Recovery Field Guide. Tim Vaillancourt Sr Technical Operations Architect, Percona

Bringing code to the data: from MySQL to RocksDB for high volume searches

MySQL vs MongoDB. Choosing right technology for your application. Peter Zaitsev CEO, Percona All Things Open, Raleigh,NC October 23 rd, 2017

The course modules of MongoDB developer and administrator online certification training:

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

MyRocks Storage Engine Status Update. Sergei Petrunia MariaDB Meetup New York February, 2018

MyRocks in MariaDB. Sergei Petrunia MariaDB Tampere Meetup June 2018

RocksDB Embedded Key-Value Store for Flash and RAM

MySQL Storage Engines Which Do You Use? April, 25, 2017 Sveta Smirnova

NPTEL Course Jan K. Gopinath Indian Institute of Science

DHRUBA BORTHAKUR, ROCKSET PRESENTED AT PERCONA-LIVE, APRIL 2017 ROCKSDB CLOUD

Percona Software & Services Update

MongoDB Revs You Up: What Storage Engine is Right for You?

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

SED 762. Transcript EPISODE 762 [INTRODUCTION]

MongoDB Monitoring and Performance for The Savvy DBA

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Time-Series Data in MongoDB on a Budget. Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California April 23th 25th, 2018

HashKV: Enabling Efficient Updates in KV Storage via Hashing

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

MyRocks Engineering Features and Enhancements. Manuel Ung Facebook, Inc. Dublin, Ireland Sept th, 2017

Why we re excited about MySQL 8

MongoDB Backup & Recovery Field Guide

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Mike Kania Truss

Database Management System

How Percona Contributes to Open Source Database Ecosystem. Peter Zaitsev 5 October 2016

SILT: A Memory-Efficient, High- Performance Key-Value Store

SLM-DB: Single-Level Key-Value Store with Persistent Memory

The Google File System

POLARDB for MyRocks Extending shared storage to MyRocks. Zhang, Yuan Alibaba Cloud Apr, 2018

Become a MongoDB Replica Set Expert in Under 5 Minutes:

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

MySQL Backup Best Practices and Case Study:.IE Continuous Restore Process

How to Scale MongoDB. Apr

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

What s New in MySQL and MongoDB Ecosystem Year 2017

Compression in Open Source Databases. Peter Zaitsev April 20, 2016

CS122 Lecture 15 Winter Term,

Choosing Storage for MySQL. Peter Zaitsev CEO, Percona Inc Percona Live, Washington,DC 11 January 2012

What s new in Mongo 4.0. Vinicius Grippa Percona

Percona Software & Services Update

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

The Google File System

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

Monitoring MongoDB s Engines in the Wild. Tim Vaillancourt Sr. Technical Operations Architect

What is a file system

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Compression in Open Source Databases. Peter Zaitsev CEO, Percona Percona Technical Webinars January 27 th, 2016

There And Back Again

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

MongoDB Shootout: MongoDB Atlas, Azure Cosmos DB and Doing It Yourself

File System Consistency

Effective Testing for Live Applications. March, 29, 2018 Sveta Smirnova

Operating Systems. File Systems. Thomas Ropars.

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

File Systems Management and Examples

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

SQL Server 2014 In-Memory Tables (Extreme Transaction Processing)

Visit ::: Original Website For Placement Papers. ::: Data Structure

Distributed File Systems II

HBase. Леонид Налчаджи

Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

Distributed Data Management Replication

Exploring the replication in MongoDB. Date: Oct

Lesson 9 Transcript: Backup and Recovery

ASN Configuration Best Practices

Apache Accumulo 1.4 & 1.5 Features

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

Redis to the Rescue? O Reilly MySQL Conference

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Database Applications (15-415)

1/29/2009. Outline ARIES. Discussion ACID. Goals. What is ARIES good for?

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

16 Sharing Main Memory Segmentation and Paging

22 File Structure, Disk Scheduling

Use multi-document ACID transactions in MongoDB 4.0 November 7th Corrado Pandiani - Senior consultant Percona

Percona Server for MySQL 8.0 Walkthrough

MongoDB Schema Design

SCSI overview. SCSI domain consists of devices and an SDS

BigTable: A Distributed Storage System for Structured Data

Comparing SQL and NOSQL databases

10 Percona Toolkit tools every MySQL DBA should know about

Reduce MongoDB Data Size. Steven Wang

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Hadoop MapReduce Framework

Transcription:

MongoDB Storage Engine with RocksDB LSM Tree Denis Protivenskii, Software Engineer, Percona

Contents - What is MongoRocks? 2

Contents - What is MongoRocks? - RocksDB overview 3

Contents - What is MongoRocks? - RocksDB overview - MongoDB contracts for storage engines 4

Contents - What is MongoRocks? - RocksDB overview - MongoDB contracts for storage engines - The most problematic operation 5

What is MongoRocks?

7

8

RocksDB overview

RocksDB for the user Key-value storage: - Get(k) v - Put(k, v) - Delete(k) 10

RocksDB for the user Key-value storage: - Get(k) v - Put(k, v) - Delete(k) - Merge... 11

Level organization 12

Write-ahead log 13

Every next level is larger multiple times 14

Keys are ordered within the level 15

Compaction starts when level is too large 16

Next level may not fit 17

Compaction may run recursively 18

Files in levels are immutable - Compaction creates new files and old ones get deleted when not used 19

Files in levels are immutable - Compaction creates new files and old ones get deleted when not used - Files are written sequentially to disk, which speeds up I/O 20

MongoDB + RocksDB

Data organization in MongoDB 22

Data organization in MongoDB - Containers for data and indexes receive unique string identifiers ident - Elements themselves shall have unique id inside a container 23

Data organization in RocksDB 24

How to present MongoDB s data structure in the plain storage like RocksDB? 25

Data organization in MongoRocks <ident + id> for every container s element coll1 26 ind1_1 ind1_2 coll2 indn_m

Data organization in MongoRocks - ident > 20 symbols, extra cost for every data element 27

Data organization in MongoRocks - ident > 20 symbols, extra cost for every data element - such ident length is caused by using it as a filename for WiredTiger and mmapv1 28

How to save on ident length properly? 29

Data organization in MongoRocks - hash from ident is bad as it may cause collisions for short hashes 30

Data organization in MongoRocks - hash from ident is bad as it may cause collisions for short hashes - Auto increment counter (named prefix) and map of ident prefix 31

Data organization in MongoRocks <prefix + id> for every container s element prefix_0 32 prefix_1 prefix_2 prefix_3 prefix_n

Index format in MongoRocks K = <prefix + value + order + id (loc)> V = <typeof value> 33

Index format in MongoRocks K = <prefix + value + order + id (loc)> Comes from MongoDB V = <typeof value> 34

How to search for id if it constitutes the part of a key? 35

Index format in MongoRocks - The storage should support search operation lower_bound upper_bound 36

Index format in MongoRocks - The storage should support search operation lower_bound upper_bound - Allows to position on the closest value and decode it 37

Index format in MongoRocks - The storage should support search operation lower_bound upper_bound - Allows to position on the closest value and decode it - RocksDB has iterators for this purpose 38

The most problematic operation

Deleting data in MongoRocks - Deleting an element (document, index) is just putting operation D into LSM-tree 40

Deleting data in MongoRocks - Deleting an element (document, index) is just putting operation D into LSM-tree - As a result, the tree is filled with garbage of old data and delete ops, which slows down the iteration 41

The solution! 42

Deleting data in MongoRocks - Ask for iterator s statistics after iteration 43

Deleting data in MongoRocks - Ask for iterator s statistics after iteration - If there s too much skipped data - run compaction for this range 44

Deleting data in MongoRocks - Ask for iterator s statistics after iteration - If there s too much skipped data - run compaction for this range - The range is always a prefix 45

This was the easier part of the problem though... 46

Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item 47

Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item - A lot of garbage created 48

Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item - A lot of garbage created - Doesn t make sense compared to engines that just drop files on disk 49

Compaction filters 50

Deleting collections in MongoRocks 51

Deleting collections in MongoRocks - Create filter with prefixes of dropped containers 52

Deleting collections in MongoRocks - Create filter with prefixes of dropped containers - Start compaction for prefix 53

Deleting collections in MongoRocks - Create filter with prefixes of dropped containers - Start compaction for prefix - Compaction calls the filter for every item and decides if it shall be deleted or not 54

Deleting collections in MongoRocks To run compaction after the crash, a marker about dropped prefix is persisted, and it s kept until the compaction is finished 55

It can be even better 56

Deleting collections in MongoRocks Fully contains range to drop 57

Deleting collections in MongoRocks - DeleteFilesInRange allows to delete files that contain keys fully in requested range 58

Deleting collections in MongoRocks - DeleteFilesInRange allows to delete files that contain keys fully in requested range - Requires care as it deletes files immediately even if some keys are still in use (by snapshots) 59

What s missing 60

Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db 61

Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db - Because WiredTiger or mmapv1 don t need this as they delete files on disk 62

Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db - Because WiredTiger or mmapv1 don t need this as they delete files on disk - Forces to compact every prefix by itself 63

oplog 64

Capped collections in MongoRocks MongoDB has specific collection type built as circular buffer 65

Capped collections in MongoRocks MongoDB has specific collection type built as circular buffer Developed solely for oplog - replication log 66

Capped collections in MongoRocks - oplog is pretty large (5% of disk size, not more than 50Gb by default) 67

Capped collections in MongoRocks - oplog is pretty large (5% of disk size, not more than 50Gb by default) - Because of lots of overwrites, oplog is polluted with garbage, which affects the performance of the whole storage 68

Capped collections in MongoRocks - Have separate code to monitor oplog size and number of tombstones in it 69

Capped collections in MongoRocks - Have separate code to monitor oplog size and number of tombstones in it - Higher priority for oplog compaction (in the queue of compaction operations) 70

Radical solution 71

Column families in MongoRocks - Classic storage engine has one B-tree for one container (data or index) 72

Column families in MongoRocks - Classic storage engine has one B-tree for one container (data or index) - MongoRocks has one LSM-tree for all containers 73

More LSM-trees! 74

Column families in MongoRocks 75

Column families in MongoRocks - RocksDB supports set of LSM-trees (column families) with shared WAL to provide transactional logic 76

Column families in MongoRocks - RocksDB supports set of LSM-trees (column families) with shared WAL to provide transactional logic - First developed for MySQL (MyRocks project) 77

Column families in MongoRocks - MongoRocks should have separate LSM-tree for oplog, maybe even separate LSM-tree for every prefix 78

Conclusion

- MongoDB contracts still have some typical details not applicable to MongoRocks 80

- MongoDB contracts still have some typical details not applicable to MongoRocks - It s good to order keys in a storage somehow 81

- The problem of deleting keys may be solved using different optimizations 82

- The problem of deleting keys may be solved using different optimizations - The idea of multiple LSM-trees is a step forward 83

Thank You Sponsors! 84

SAVE THE DATE! April 23-25, 2018 Santa Clara Convention Center CALL FOR PAPERS OPENING SOON! 85 www.perconalive.com

Questions?

Thank you!