MongoDB Storage Engine with RocksDB LSM Tree Denis Protivenskii, Software Engineer, Percona
Contents - What is MongoRocks? 2
Contents - What is MongoRocks? - RocksDB overview 3
Contents - What is MongoRocks? - RocksDB overview - MongoDB contracts for storage engines 4
Contents - What is MongoRocks? - RocksDB overview - MongoDB contracts for storage engines - The most problematic operation 5
What is MongoRocks?
7
8
RocksDB overview
RocksDB for the user Key-value storage: - Get(k) v - Put(k, v) - Delete(k) 10
RocksDB for the user Key-value storage: - Get(k) v - Put(k, v) - Delete(k) - Merge... 11
Level organization 12
Write-ahead log 13
Every next level is larger multiple times 14
Keys are ordered within the level 15
Compaction starts when level is too large 16
Next level may not fit 17
Compaction may run recursively 18
Files in levels are immutable - Compaction creates new files and old ones get deleted when not used 19
Files in levels are immutable - Compaction creates new files and old ones get deleted when not used - Files are written sequentially to disk, which speeds up I/O 20
MongoDB + RocksDB
Data organization in MongoDB 22
Data organization in MongoDB - Containers for data and indexes receive unique string identifiers ident - Elements themselves shall have unique id inside a container 23
Data organization in RocksDB 24
How to present MongoDB s data structure in the plain storage like RocksDB? 25
Data organization in MongoRocks <ident + id> for every container s element coll1 26 ind1_1 ind1_2 coll2 indn_m
Data organization in MongoRocks - ident > 20 symbols, extra cost for every data element 27
Data organization in MongoRocks - ident > 20 symbols, extra cost for every data element - such ident length is caused by using it as a filename for WiredTiger and mmapv1 28
How to save on ident length properly? 29
Data organization in MongoRocks - hash from ident is bad as it may cause collisions for short hashes 30
Data organization in MongoRocks - hash from ident is bad as it may cause collisions for short hashes - Auto increment counter (named prefix) and map of ident prefix 31
Data organization in MongoRocks <prefix + id> for every container s element prefix_0 32 prefix_1 prefix_2 prefix_3 prefix_n
Index format in MongoRocks K = <prefix + value + order + id (loc)> V = <typeof value> 33
Index format in MongoRocks K = <prefix + value + order + id (loc)> Comes from MongoDB V = <typeof value> 34
How to search for id if it constitutes the part of a key? 35
Index format in MongoRocks - The storage should support search operation lower_bound upper_bound 36
Index format in MongoRocks - The storage should support search operation lower_bound upper_bound - Allows to position on the closest value and decode it 37
Index format in MongoRocks - The storage should support search operation lower_bound upper_bound - Allows to position on the closest value and decode it - RocksDB has iterators for this purpose 38
The most problematic operation
Deleting data in MongoRocks - Deleting an element (document, index) is just putting operation D into LSM-tree 40
Deleting data in MongoRocks - Deleting an element (document, index) is just putting operation D into LSM-tree - As a result, the tree is filled with garbage of old data and delete ops, which slows down the iteration 41
The solution! 42
Deleting data in MongoRocks - Ask for iterator s statistics after iteration 43
Deleting data in MongoRocks - Ask for iterator s statistics after iteration - If there s too much skipped data - run compaction for this range 44
Deleting data in MongoRocks - Ask for iterator s statistics after iteration - If there s too much skipped data - run compaction for this range - The range is always a prefix 45
This was the easier part of the problem though... 46
Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item 47
Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item - A lot of garbage created 48
Deleting collections in MongoRocks - Need to iterate over all data and indexes of collection and delete every item - A lot of garbage created - Doesn t make sense compared to engines that just drop files on disk 49
Compaction filters 50
Deleting collections in MongoRocks 51
Deleting collections in MongoRocks - Create filter with prefixes of dropped containers 52
Deleting collections in MongoRocks - Create filter with prefixes of dropped containers - Start compaction for prefix 53
Deleting collections in MongoRocks - Create filter with prefixes of dropped containers - Start compaction for prefix - Compaction calls the filter for every item and decides if it shall be deleted or not 54
Deleting collections in MongoRocks To run compaction after the crash, a marker about dropped prefix is persisted, and it s kept until the compaction is finished 55
It can be even better 56
Deleting collections in MongoRocks Fully contains range to drop 57
Deleting collections in MongoRocks - DeleteFilesInRange allows to delete files that contain keys fully in requested range 58
Deleting collections in MongoRocks - DeleteFilesInRange allows to delete files that contain keys fully in requested range - Requires care as it deletes files immediately even if some keys are still in use (by snapshots) 59
What s missing 60
Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db 61
Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db - Because WiredTiger or mmapv1 don t need this as they delete files on disk 62
Deleting collections in MongoRocks - MongoDB doesn t send notifications about logical drop of a collection or a db - Because WiredTiger or mmapv1 don t need this as they delete files on disk - Forces to compact every prefix by itself 63
oplog 64
Capped collections in MongoRocks MongoDB has specific collection type built as circular buffer 65
Capped collections in MongoRocks MongoDB has specific collection type built as circular buffer Developed solely for oplog - replication log 66
Capped collections in MongoRocks - oplog is pretty large (5% of disk size, not more than 50Gb by default) 67
Capped collections in MongoRocks - oplog is pretty large (5% of disk size, not more than 50Gb by default) - Because of lots of overwrites, oplog is polluted with garbage, which affects the performance of the whole storage 68
Capped collections in MongoRocks - Have separate code to monitor oplog size and number of tombstones in it 69
Capped collections in MongoRocks - Have separate code to monitor oplog size and number of tombstones in it - Higher priority for oplog compaction (in the queue of compaction operations) 70
Radical solution 71
Column families in MongoRocks - Classic storage engine has one B-tree for one container (data or index) 72
Column families in MongoRocks - Classic storage engine has one B-tree for one container (data or index) - MongoRocks has one LSM-tree for all containers 73
More LSM-trees! 74
Column families in MongoRocks 75
Column families in MongoRocks - RocksDB supports set of LSM-trees (column families) with shared WAL to provide transactional logic 76
Column families in MongoRocks - RocksDB supports set of LSM-trees (column families) with shared WAL to provide transactional logic - First developed for MySQL (MyRocks project) 77
Column families in MongoRocks - MongoRocks should have separate LSM-tree for oplog, maybe even separate LSM-tree for every prefix 78
Conclusion
- MongoDB contracts still have some typical details not applicable to MongoRocks 80
- MongoDB contracts still have some typical details not applicable to MongoRocks - It s good to order keys in a storage somehow 81
- The problem of deleting keys may be solved using different optimizations 82
- The problem of deleting keys may be solved using different optimizations - The idea of multiple LSM-trees is a step forward 83
Thank You Sponsors! 84
SAVE THE DATE! April 23-25, 2018 Santa Clara Convention Center CALL FOR PAPERS OPENING SOON! 85 www.perconalive.com
Questions?
Thank you!