Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs Steve Byan, Oracle Labs ATC 2018, 07/13/2018, Boston MA, USA 1
Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast 2
Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast What about both? 2
Non-Volatile Memory Faster than HDD/SSD Byte-addressable 3
Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently 3
Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently DRAM K-V stores don t easily translate to NVM 3
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 4
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 5
Caching Volatile store (frontend) caches Persistent store (backend) 6
Caching Volatile store (frontend) caches Persistent store (backend) NVM 7
Caching Volatile store (frontend) How? caches Persistent store (backend) NVM 7
Reading: Standard Volatile store (frontend) Lookup on cache miss Persistent store (backend) NVM 8
Writing: Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9
Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9
Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM Challenges: persistence, low latency, scalability, consistency 9
Cross-Referencing Logs Persistence: Place logs in NVM Low latency Few NVM accesses per log append Scalability Multiple logs (one per thread) Consistency Cross-log references 10
Cross-Referencing Logs Persistence: Place logs in NVM Low latency Few NVM accesses per log append Scalability Multiple logs (one per thread) Consistency Cross-log references 10
Cross-Referencing Logs Log entry: Key OP Applied Prev append consume 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append consume 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append Key 2, OP: put B consume 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 3, OP: delete 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 11
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 12
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 13
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 14
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 15
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 16
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 17
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 18
Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Applied, Pr: Key 3, OP: delete Applied, Pr: 19
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 20
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 20
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 21
Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 22
Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 23
Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 24
Experimental Setup 16 CPU cores, 512 GB DRAM Intel s NVM emulation platform Zipfian distribution of keys (YCSB) Measuring 99%ile latency Comparing against HiKV 25
Raw Hash Table Performance 26
Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 27
Bullet K-V Store Volatile store (frontend) Persistent store (backend) X-Ref Logs 28
Bullet K-V Store Volatile store (frontend) Writers Persistent store (backend) X-Ref Logs 29
Bullet K-V Store Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 29
Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 30
Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 31
Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 32
Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 33
Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 34
Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 35
Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 36
Read-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 37
Write-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 38
Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 39
Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 40
Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 41
Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 42
Optimization: Overwriting Log entry: Key OP Applied Prev append IGNORE Key 2, OP: delete Key 1, OP: put A IGNORE Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 43
Optimization: Overwriting Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 44
Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 45
Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 46
Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 47
Latency in ns Bullet Performance 1000 2% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 800 600 400 200 0 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 48
Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs 49
Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs Bullet K-V store Almost identical hash tables in DRAM and NVM DRAM performance for reads Substantial latency improvements for writes 49
Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs Bullet K-V store Almost identical hash tables in DRAM and NVM DRAM performance for reads Substantial latency improvements for writes More details in the paper! 49
Backup Slides 50
Optimization Summary Dynamic worker threads Overwrites Non-blocking reads (see paper) Backend access off critical path (see paper) 51
Bullet Performance Read-only 52
Bullet Performance 2% writes 53
Bullet Performance 15% writes 54
Bullet Performance 50% writes 55
Bullet Performance (end-to-end) Read-only 56
Bullet Performance (end-to-end) 2% writes 57
Bullet Performance (end-to-end) 15% writes 58
Bullet Performance (end-to-end) 50% writes 59
Latency Distribution Reads 60
Latency Distribution Writes 61
Log Size Senzitivity 62
Experimental Setup 16 CPU cores 512 GB DRAM NVM simulated using DRAM 384 GB (out of 512GB) 300ns load (2x DRAM) 100ns persist barrier Same throughput DRAM cache fits all data Zipfian distribution of keys (YCSB) 50M K/V pairs, 16 B keys, 100 B values Measuring 99%ile latency 63
Dynamic Worker Thread Switching 64
Cross-Referencing Logs Persistence: Circular buffers in NVM Low latency Fast appends (3 persist barriers) Scalability Multiple logs, stall only when all full Coarse-grained synchronization Epoch-based space reclamation Consistency Cross-log references 65
Cross-Referencing Logs Summary Persistent circular buffers (in NVM) Cross-log pointers for consistency Coarse-grained synchronization Fast appends Epoch-based space reclamation 66
Latency in ns 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 67