MyRocks Engineering Features and Enhancements Manuel Ung Facebook, Inc. Dublin, Ireland Sept 25 27 th, 2017
Agenda Bulk load Time to live (TTL) Debugging deadlocks Persistent auto-increment values Improved transactions 2
Bulk Load
Sorted Bulk Load RocksDB usual writes bulk load t1 Memtable Memtable Memtable SST Datafile SST SST SST bulk load t2 SST SST SST SET ROCKSDB_BULK_LOAD = 1; to enable. RocksDB feature SST FileWriter. Bypass memtable, writes go directly to SST files. Keys must be added in ascending or descending order (no SKs) 4
Fast Secondary Key Creation RocksDB ALTER TABLE ADD INDEX SST SST SST Primary key tmpfile SST SST SST Secondary key Integrate SST Filewriter into ALTER TABLE ADD INDEX. Disable secondary keys during initial table load. Add them back after. 5
Unsorted Bulk Load INSERT INTO t... RocksDB tmpfile SST SST SST Primary key tmpfile SST SST SST Secondary key SET ROCKSDB_BULK_LOAD_ALLOW_UNSORTED = 1; No need to drop secondary keys INSERTs can occur out of primary key order 6
Time to Live (TTL)
Time to Live (TTL) Some workloads have datasets that should expire after some time. One solution: add create-time column and issue delete through daily job. Requires CPU for processing delete query. Adds delete markers slowing down scans. With RocksDB, we can leverage compaction filter for this. Compaction filter is already used for dropping tables. Respond immediately to request to drop table. Actual data is removed when compaction occurs. 8
DDL Syntax Implicit timestamp: CREATE TABLE t1 (a INT, b INT, c INT, PRIMARY KEY (a)) ENGINE=ROCKSDB COMMENT "ttl_duration=3600;"; Explicit timestamp: CREATE TABLE t2 (a INT, b INT, c INT, ts BIGINT UNSIGNED NOT NULL, PRIMARY KEY (a)) ENGINE=ROCKSDB COMMENT "ttl_duration=3600;ttl_col=ts;"; 9
Row Format INSERT INTO t1 (a, b, c) VALUES (1,10,20); t1-pk 1 TTL-now 10 20 INSERT INTO t2 (a, b, c, ts) VALUES (3,30,35, 1490000000); t2-pk 3 1490000000 30 35 1490000000 TTL field of create-time added to each table row. Implicit timestamp uses row insertion time. Explicit timestamp uses value from column specified by ttl_col. 10
Read Filtering Rows might disappear during a transaction if they expire while the transaction is active. Remove only rows that expired before than oldest snapshot. Filter rows on read based on snapshot creation time. This is a problem for repeatable read. 11
Read Filtering Repeatable Read ttl_duration: 1000 Time 0 1000 2000 3000 4000 Transaction 1 Transaction 2 Compaction INSERT INTO t VALUES (1) INSERT INTO t VALUES (2) BEGIN; SELECT * from t Compaction removes row 1 and keeps 2 SELECT * from t SELECTs sees row 2 only because row 1 is filtered out from result set. timestamp row 1 < timestamp current ttl_duration Compaction keeps row 2 despite it being expired already. timestamp row 2 >= timestamp oldest snapshot ttl_duration 12
TTL with Secondary Keys Read filtering makes secondary keys with TTL possible. Implicit timestamp: CREATE TABLE t1 (a INT, b INT, c INT, PRIMARY KEY (a), KEY(b)) ENGINE=ROCKSDB COMMENT "ttl_duration=3600;"; Explicit timestamp: CREATE TABLE t2 (a INT, b INT, c INT, ts BIGINT UNSIGNED NOT NULL, PRIMARY KEY (a), KEY(b)) ENGINE=ROCKSDB COMMENT "ttl_duration=3600;ttl_col=ts;"; 13
Debugging Deadlocks
Snapshot Conflicts vs Deadlocks Both snapshot conflicts and deadlocks return ER_LOCK_DEADLOCK. Snapshot conflicts Happens during REPEATABLE READ when multiple transactions modify same row. Deadlock found when trying to get lock; try restarting transaction (snapshot conflict) Deadlocks Happens when multiple transactions lock rows in different orders. Deadlock found when trying to get lock; try restarting transaction Get most recent deadlocks from SHOW ENGINE ROCKSDB TRANSACTION STATUS; Number of deadlocks stored controlled by rocksdb_max_latest_deadlocks 15
Latest Detected Deadlocks mysql> SHOW ENGINE ROCKSDB TRANSACTION STATUS; ----------LATEST DETECTED DEADLOCKS---------- *** DEADLOCK PATH ========================================= TRANSACTION ID: 2 COLUMN FAMILY NAME: default WAITING KEY: 0000010580000001 LOCK TYPE: EXCLUSIVE INDEX NAME: PRIMARY TABLE NAME: test.t ---------------WAITING FOR--------------- TRANSACTION ID: 1 COLUMN FAMILY NAME: default WAITING KEY: 0000010580000002 LOCK TYPE: EXCLUSIVE INDEX NAME: PRIMARY TABLE NAME: test.t ---------------WAITING FOR--------------- TRANSACTION ID: 2 COLUMN FAMILY NAME: default WAITING KEY: 0000010580000001 LOCK TYPE: EXCLUSIVE INDEX NAME: PRIMARY TABLE NAME: test.t Transaction 1 Transaction 2 BEGIN; SELECT * FROM t WHERE i = 1 FOR UPDATE; SELECT * FROM t WHERE i = 2 FOR UPDATE; (deadlock) BEGIN; SELECT * FROM t WHERE i = 2 FOR UPDATE; SELECT * FROM t WHERE i = 1 FOR UPDATE; 16 --------TRANSACTION ID: 1 GOT DEADLOCK--------- -----------------------------------------
Persistent Auto-increment Values
Auto-increment values Auto-increment values are not persisted (both InnoDB and RocksDB) InnoDB behavior fixed in MySQL 8.0 RocksDB fixed by storing maximum id in data dictionary STATEMENT CREATE TABLE t (i int AUTO_INCREMENT PRIMARY KEY); INSERT INTO t VALUES (NULL); 1 INSERT INTO t VALUES (NULL); 2 INSERT INTO t VALUES (NULL); 3 DELETE FROM t; # Restart server INSERT INTO t VALUES (NULL); 1 LAST_INSERT_ID 18
Data Dictionary 0x9 INDEX_ID VERSION AUTO_INC ID Maximum auto-increment ID is stored in data dictionary. Keyed by primary key index ID of the table containing auto-increment column. Makes use of RocksDB feature merge operator. 19
Merge Operator tx1 INSERT INTO t VALUES (NULL); PUT(1) : MERGE(IDX_ID) : 1 COMMIT Memtable tx2 INSERT INTO t VALUES (NULL); PUT(2) : MERGE(IDX_ID) : 2 COMMIT MERGE(IDX_ID) : 2 MERGE(IDX_ID) : 3 MERGE(IDX_ID) : 1 tx3 INSERT INTO t VALUES (NULL); PUT(3) : MERGE(IDX_ID) : 3 COMMIT 20
Merge Operator GET(IDX_ID) Memtable MERGE(IDX_ID) : 2 MERGE(IDX_ID) : 3 MO(2, 3) VALUE : 3 MO(3, 1) GET(IDX_ID) VALUE : 3 MERGE(IDX_ID) : 1 21
Improved Transactions
Problems Low throughput Commit stalls Memory footprint 23
Transactions per second Low Throughput Separate queues for prepare and commit Decrease queue latency for commits Linkbench FlushWAL Avoids fwrite syscall latency in commit path http://rocksdb.org/blog/2017/08/25/flushwal.html 64 32 16 8 Threads Before After 24
Commit Stalls Move memtable write from commit to prepare. Less work done during commit time. Higher throughput Large transactions won t stall the server Work in progress. 25
Memory Footprint Move memtable write from prepare to put. Uncommitted data will be written into the database without needing to buffer in memory. Work in progress. 26
Additional Information
GitHub https://github.com/facebook/mysql-5.6 Currently based on 5.6.35 Welcome feedback and contributions! 28
29 Q&A