TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Similar documents
Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

MySQL Storage Engines Which Do You Use? April, 25, 2017 Sveta Smirnova

RocksDB Key-Value Store Optimized For Flash

Why Choose Percona Server For MySQL? Tyler Duzan

NoVA MySQL October Meetup. Tim Callaghan VP/Engineering, Tokutek

MongoDB Revs You Up: What Storage Engine is Right for You?

RocksDB Embedded Key-Value Store for Flash and RAM

Improvements in MySQL 5.5 and 5.6. Peter Zaitsev Percona Live NYC May 26,2011

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Percona Server for MySQL 8.0 Walkthrough

POLARDB for MyRocks Extending shared storage to MyRocks. Zhang, Yuan Alibaba Cloud Apr, 2018

MyRocks in MariaDB. Sergei Petrunia MariaDB Tampere Meetup June 2018

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Kathleen Durant PhD Northeastern University CS Indexes

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Distributed File Systems II

Database Systems II. Record Organization

The Google File System

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

A New Key-Value Data Store For Heterogeneous Storage Architecture

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Basics of SQL Transactions

HashKV: Enabling Efficient Updates in KV Storage via Hashing

MyRocks Engineering Features and Enhancements. Manuel Ung Facebook, Inc. Dublin, Ireland Sept th, 2017

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. Amazon Aurora. User Guide

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

MySQL usage of web applications from 1 user to 100 million. Peter Boros RAMP conference 2013

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

InnoDB Scalability Limits. Peter Zaitsev, Vadim Tkachenko Percona Inc MySQL Users Conference 2008 April 14-17, 2008

MariaDB Developer UnConference April 9-10 th 2017 New York. MyRocks in MariaDB. Why and How. Sergei Petrunia

OS and Hardware Tuning

DBMS Performance Tuning

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

OS and HW Tuning Considerations!

1. Creates the illusion of an address space much larger than the physical memory

Switching to Innodb from MyISAM. Matt Yonkovit Percona

What's new in MySQL 5.5? Performance/Scale Unleashed

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Storage hierarchy. Textbook: chapters 11, 12, and 13

The TokuFS Streaming File System

Load Testing Tools. for Troubleshooting MySQL Concurrency Issues. May, 23, 2018 Sveta Smirnova

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

MySQL Architecture and Components Guide

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud

Memory Management. Dr. Yingwu Zhu

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

Computer Systems II. Memory Management" Subdividing memory to accommodate many processes. A program is loaded in main memory to be executed

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

DHRUBA BORTHAKUR, ROCKSET PRESENTED AT PERCONA-LIVE, APRIL 2017 ROCKSDB CLOUD

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Performance improvements in MySQL 5.5

File Structures and Indexing

Google File System. Arun Sundaram Operating Systems

Representing Data Elements

Monitoring MongoDB s Engines in the Wild. Tim Vaillancourt Sr. Technical Operations Architect

7. Query Processing and Optimization

A Performance Puzzle: B-Tree Insertions are Slow on SSDs or What Is a Performance Model for SSDs?

InnoDB Compression Present and Future. Nizameddin Ordulu Justin Tolmer Database

MyRocks Storage Engine Status Update. Sergei Petrunia MariaDB Meetup New York February, 2018

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

CS 550 Operating Systems Spring File System

OPS-23: OpenEdge Performance Basics

Let s Explore SQL Storage Internals. Brian

Disks, Memories & Buffer Management

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

What s New in MySQL and MongoDB Ecosystem Year 2017

L9: Storage Manager Physical Data Organization

Unit 3 Disk Scheduling, Records, Files, Metadata

BigTable. CSE-291 (Cloud Computing) Fall 2016

Course Outline. SQL Server Performance & Tuning For Developers. Course Description: Pre-requisites: Course Content: Performance & Tuning.

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

The Right Read Optimization is Actually Write Optimization. Leif Walsh

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Introduction to Database Services

Replication features of 2011

6.830 Problem Set 3 Assigned: 10/28 Due: 11/30

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Outline. Failure Types

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Corda Performance To infinity and beyond! James Carlyle Chief Engineer, R3 7 March 2018

( D ) 4. Which is not able to solve the race condition? (A) Test and Set Lock (B) Semaphore (C) Monitor (D) Shared memory

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

File Systems Management and Examples

Inside the PostgreSQL Shared Buffer Cache

The Google File System

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value

MariaDB 10.3 vs MySQL 8.0. Tyler Duzan, Product Manager Percona

Column Stores vs. Row Stores How Different Are They Really?

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Transcription:

TokuDB vs RocksDB What to choose between two write-optimized DB engines supported by Percona George O. Lorch III Vlad Lesin

What to compare? Amplification Write amplification Read amplification Space amplification Features set 2

How to compare? Theoretical review Fractal trees vs LSM Implementation What and how is persistently stored? What and how is cached? Read/write/space tradeoff options Benchmarks 3

Fractal tree: B-tree B B B B logb(n/b) 4

Fractal tree: B-tree with node buffers B k children buffer logk(n/b) message 5

Fractal tree: messages pushdown 1 6

Fractal tree: messages pushdown 1 7 2

Fractal tree: messages pushdown 1 8 2 3

Fractal tree: messages pushdown 1 9 2 3 4

Fractal tree: messages pushdown 5 1 10 2 3 4

Fractal tree: messages pushdown 5 1 11 2 3 4

Fractal tree: amplification for both point and range Write amplification Worst case: O(k*logk(N/B)) Skewed writes: O(logk(N/B)) Read amplification Cold cache: O(logk(N/B)) Warm cache: 1 where k - fanout, B - block size, N - data size. 12

LSM: Log Structured Merge tree A collection of sorted runs of data Each run contains kv-pairs with keys sorted A run can be managed as a single file or a collection of smaller files Binary search 13

LSM: example 14 Consist of collection of runs Many versions of a row can exist in different runs Binary search in each run Compaction to maintain good read performance

LSM: leveled L1 L2 L3 15 Each level is limited by size For example if 10^(i-1)< sizeof(li) <= 10^i then L1 is between 1MB and 10MB L2 is between 10MB and 100MB L3 is between 100MB and 1Gb... When the size of certain level reaches it s limit the level is merged to the next level.........

LSM: compaction example L1 L2 L3 L1 L2 L3 16 L1, L2 are merged to L3 Merge sort algorithm Updates

LSM: write amplification Θ(logk(N/B)) levels where k - growth factor, N - data size, B - block size. On average each block is remerged back at the same level k/2 times? L0: B L1: nothing L0: nothing L1: B L0: B L1: B L0: nothing L1: BB... L0: nothing L1: B k-times The first B bytes is remerged back in L1 k-1 times, the second B bytes are remerged back k-1 times, and so on, the last B bytes are remerged back 0 times. The average number of remerged bytes is sum(number of remerges for kb bytes)/k. The sum is arithmetic progression and can be counted as (k^2)/2, so the average number of remerges is k/2. 17 The total write amplification is Θ(klogk(N/B))

LSM: read amplification For range queries and cold cache summarizing binary search read amplification for all runs: O(((log(N/B))^2)/log(k)) For range queries and warm cache the first several runs are cached so range read will cost several IOs For point queries and bloom filters 1 IO in most cases. 18

FT vs LSM: resume Asymptotically: FT and LSM are the same for write amplification FT is better than LSM for range reads on cold cache, but the same on warm cache FT is the same as LSM for point reads: one or several IO s per read 19

TokuDB vs MyRocks: implementation How data is stored on disk? What and how data is cached? 20

TokuDB vs MyRocks: storage files TokuDB MyRocks Log files + block files The only FT in each file File per index WAL files + SST files All tables and indexes are in the same space(s) ColumnFamilies 21

TokuDB: block file Each block is a tree node The blocks can have different size Two sets of blocks, the sets can intersect Header 0 LSN 0 BTT 0 info Header 1 LSN 1 BTT 1 info Offset 0 Offset 4K 22 Block #42 Block #7 Offset #7 Size #7 There can be used and free blocks The blocks can be compressed Blocks translation table Fragmentation BTT 1 BTT 1 offset Block BTT 0 #177 BTT 0 offset

MyRocks: block-based sst file format kv-pairs are ordered and partitioned in data blocks Meta-index contains references to meta blocks Index block contains one entry per data block which contains it s the last key Bloom filter per block or per file Compression dictionary for the whole sst file Data block 1 23 Data block 2... Data block n Meta block 1 filter Meta block 2 stats Meta block 3 comp. dict.... Meta block k Meta index block Index block Footer

TokuDB vs MyRocks: persistent storage TokuDB RocksDB Two sets of blocks, the sets can be intersected, but on heavy random updates the intersection is small, the free blocks are reallocated One file per index(drop index is just file removing) 24 Big blocks compressed There are no gaps like in TokuDB, but on heavy random updates there can be several values for the same key in different levels (which are merge on compaction) All indexes in the same namespace(s) Data blocks are compressed, compressed dictionary is common

TokuDB: caching Blocks are cached in cache table Leaf nodes can be read partially All modifications are in memory Dirty pages are flushed during checkpoint or eviction Checkpoint is periodical mark all nodes as pending If client modifies pending node it s cloned Checkpoint can lead to extra io and memory usage Clock eviction algorithm In-memory cache table Cache file Block file block block... Disk block... block block... dirty block 25...

TokuDB: checkpoint performance 26

TokuDB: checkpoints and clone storm effect 27

TokuDB: checkpoints and clone storm effect 28

MyRocks: caching 29 Block Cache Sharded LRU Clock Indexes and block file bloom filters are caches separately by default, they don t count in memory allocated for BlockCache and can be big memory contributors. MemTables can be considered as write buffers, the more space for MemTables, the less write amplification, several kinds of memtable are supported Memory MemTables BloomFilters Indexes BloomFilters BlockCache Disk WAL SST

TokuDB vs MyRocks: benchmarking Different data structures to store data persistently Both engines have settings for read/write/space tradeoff Mysql infrastructure influence Different set of features 30

TokuDB vs MyRocks: benchmarking Hard to compare because each engine has settings for read-write tradeoff tuning TokuDB: block size, fanout, etc... MyRocks: blocks size, base level size, etc 31

MySQL Transaction Coordinator issues 32 > 1 transactional storage engine installed requires a TC for every transaction. Binlog is considered a transactional storage engine when enabled. MySQL provides two functioning coordinators, TC_BINLOG and TC_MMAP. When binary logging is enabled TC_BINLOG is always used, otherwise TC_MMAP is used when > 1 transactional engine installed. TC_MMAP is not optimized nor has as good of a group commit implementation as TC_BINLOG. When TC_MMAP is used, dramatic increase in fsyncs. Considering options to best improve this in Percona Server: Add session variable to tell if TC is needed for next transaction and return SQL error if transaction tries to engage > 1 transactional storage engine. Remove TC_MMAP and enhance TC_BINLOG to allow use without actual binary logs, i.e. keep TC logic but disable binlog logic.

TokuDB vs MyRocks: benchmarking Table: CREATE TABLE sbtest1 ( id INTEGER NOT NULL, k INTEGER DEFAULT '0' NOT NULL, c CHAR(120) DEFAULT '' NOT NULL, pad CHAR(60) DEFAULT '' NOT NULL, PRIMARY KEY (id)) ENGINE =... 33

TokuDB vs MyRocks: benchmarking Point reads: SELECT id, k, c FROM sbtest1 WHERE id IN (?) - 10 random points Range reads: SELECT count(id) FROM sbtest1 id BETWEEN? AND? OR, number of ranges 10, interval - 300 Point updates: UPDATE sbtest1 SET c=? WHERE id=? Range updates: UPDATE sbtest1 SET c=? WHERE id BETWEEN? AND?, interval=100 recovery logs are not synchronized on each commit 34

TokuDB vs MyRocks: benchmarking, first case NVME, ~400G table size, 10G - DB cache, 10G - fs cache 48 HW threads, 40 sysbench threads TokuDB: block size 4M, fanout 16 MyRocks: block size 8k, first level size 512M 35 5 levels for 400G table size

TokuDB vs MyRocks: TPS first case 36

TokuDB vs MyRocks: benchmarking, second case NVME, ~100G DB size, 4G - DB cache, 4G - fs cache 48 HW threads, 40 sysbench threads TokuDB: block size 1M, fanout 128 MyRocks: block size 8k, first level size 32M 37 5 levels - the same amount as for 400G table size

TokuDB vs MyRocks: TPS second case 38

TokuDB vs MyRocks: TPS summary Updates cause reads, TokuDB fanout affects read performance The less TokuDB block size, the more effective caching, the less IO s TokuDB range reads looks suspicious MyRocks - as we preserved the same amount of levels and the same cache/persistence storage balance, the TPS is almost the same as on previous test 39

MyRocks: range reads io and cpu 40

TokuDB: range reads io and cpu 41

TokuDB: range reads performance PFS is coming soon... 42

TokuDB vs MyRocks: write amplification counting /proc/diskstats shows blocks written Sysbench shows the amount of executed transactions WA = K * (blocks written / amount of executed transactions) K =( (block size) / (row size) + binlog writes) Count relative values, let MyRocks WA is 100%, K is diminished 43

TokuDB vs MyRocks: write amplification 44

TokuDB vs MyRocks: write amplification 45

TokuDB vs MyRocks: what about several tables NVME, ~100G DB size, 4G - DB cache, 4G - fs cache 48 HW threads, 40 sysbench threads TokuDB: block size 1M, fanout 128 MyRocks: block size 8k, first level size 32M 20 tables 46

TokuDB vs MyRocks: TPS, 20 tables 47

TokuDB vs MyRocks: benchmarking, third case NVME, ~200G compressed table size, 10G - DB cache, 10G - fs cache 48 HW threads, 40 sysbench threads TokuDB: block size 1M, fanout 128 MyRocks: block size 8k, first level size 512M 48 5 levels

MyRocks: TPS of range updates 49

MyRocks: io of range updates 50

TokuDB: TPS of range updates 51

TokuDB: io of range updates 52

TokuDB vs MyRocks: benchmarking, space For all the benchmarks: The initial space TokuDB and MyRocks tables is approximately the same TokuDB space doubles after heavy updates MyRocks space doubles too but after compaction the allocated space decreases to the initial size + ~10%. 53

TokuDB vs MyRocks: benchmarking, what to try more? Play with smaller TokuDB block sizes Play with different amount of levels, block sizes etc. in Rocks Test it on rotated disks Test for timed series load Test for reads after updates/deletes Test for inserts Test for mixed load Test for secondary indexes performance 54

TokuDB vs MyRocks: features TokuDB MyRocks Clustered indexes Different settings for MemTable Online index ColumnFamilies Gap locks Per-level compression settings 55

DATABASE PERFORMANCE MATTERS