Closing the Performance Gap Between Volatile and Persistent K-V Stores

Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs Steve Byan, Oracle Labs ATC 2018, 07/13/2018, Boston MA, USA 1

Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast 2

Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast What about both? 2

Non-Volatile Memory Faster than HDD/SSD Byte-addressable 3

Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently 3

Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently DRAM K-V stores don t easily translate to NVM 3

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 4

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 5

Caching Volatile store (frontend) caches Persistent store (backend) 6

Caching Volatile store (frontend) caches Persistent store (backend) NVM 7

Caching Volatile store (frontend) How? caches Persistent store (backend) NVM 7

Reading: Standard Volatile store (frontend) Lookup on cache miss Persistent store (backend) NVM 8

Writing: Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9

Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9

Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM Challenges: persistence, low latency, scalability, consistency 9

Cross-Referencing Logs Persistence: Place logs in NVM Low latency Few NVM accesses per log append Scalability Multiple logs (one per thread) Consistency Cross-log references 10

Cross-Referencing Logs Log entry: Key OP Applied Prev append consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append Key 2, OP: put B consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 3, OP: delete 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 12

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 13

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 14

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 15

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 16

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 17

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 18

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Applied, Pr: Key 3, OP: delete Applied, Pr: 19

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 20

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 20

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 21

Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 22

Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 23

Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 24

Experimental Setup 16 CPU cores, 512 GB DRAM Intel s NVM emulation platform Zipfian distribution of keys (YCSB) Measuring 99%ile latency Comparing against HiKV 25

Raw Hash Table Performance 26

Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 27

Bullet K-V Store Volatile store (frontend) Persistent store (backend) X-Ref Logs 28

Bullet K-V Store Volatile store (frontend) Writers Persistent store (backend) X-Ref Logs 29

Bullet K-V Store Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 29

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 30

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 31

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 32

Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 33

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 34

Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 35

Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 36

Read-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 37

Write-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 38

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 39

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 40

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 41

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 42

Optimization: Overwriting Log entry: Key OP Applied Prev append IGNORE Key 2, OP: delete Key 1, OP: put A IGNORE Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 43

Optimization: Overwriting Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 44

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 45

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 46

Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 47

Latency in ns Bullet Performance 1000 2% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 800 600 400 200 0 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 48

Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs 49

Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs Bullet K-V store Almost identical hash tables in DRAM and NVM DRAM performance for reads Substantial latency improvements for writes 49

Backup Slides 50

Optimization Summary Dynamic worker threads Overwrites Non-blocking reads (see paper) Backend access off critical path (see paper) 51

Bullet Performance Read-only 52

Bullet Performance 2% writes 53

Bullet Performance 15% writes 54

Bullet Performance 50% writes 55

Bullet Performance (end-to-end) Read-only 56

Bullet Performance (end-to-end) 2% writes 57

Bullet Performance (end-to-end) 15% writes 58

Bullet Performance (end-to-end) 50% writes 59

Latency Distribution Reads 60

Latency Distribution Writes 61

Log Size Senzitivity 62

Experimental Setup 16 CPU cores 512 GB DRAM NVM simulated using DRAM 384 GB (out of 512GB) 300ns load (2x DRAM) 100ns persist barrier Same throughput DRAM cache fits all data Zipfian distribution of keys (YCSB) 50M K/V pairs, 16 B keys, 100 B values Measuring 99%ile latency 63

Dynamic Worker Thread Switching 64

Cross-Referencing Logs Persistence: Circular buffers in NVM Low latency Fast appends (3 persist barriers) Scalability Multiple logs, stall only when all full Coarse-grained synchronization Epoch-based space reclamation Consistency Cross-log references 65

Cross-Referencing Logs Summary Persistent circular buffers (in NVM) Cross-log pointers for consistency Coarse-grained synchronization Fast appends Epoch-based space reclamation 66

Latency in ns 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 67