BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory JOY ARULRAJ JUSTIN LEVANDOSKI UMAR FAROOQ MINHAS PER-AKE LARSON Microsoft Research
NON-VOLATILE MEMORY [NVM] PERFORMANCE DRAM VOLATILE NON-VOLATILE NVM SSD FAST SLOW DURABILITY 2
DEVICE CHARACTERISTICS CHARACTERISTIC DRAM NVM SSD Device Latency x 0x 000x Byte-Addressability Durability High Capacity 3
BWTREE: LATCH-FREE B+TREE 5 0 5 SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU 5 0 5 0 4
BZTREE: NVM-CENTRIC LATCH-FREE B+TREE 0 5 5 5 0 5 LATCH-FREE B+TREE NON-VOLATILE MEMORY 5
BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 6
BWTREE: SSD-CENTRIC ARCHITECTURE MAPPING TABLE PAGE ID ADDRESS 0 02 INDEX BUFFER POOL LOG-STRUCTURED STORE DRAM SSD 7
BWTREE: LATCH-FREE ALGORITHMS MAPPING TABLE PAGE ID ADDRESS 0 DELETE 2 INSERT 3 DELTA DELTA SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU [, 2] NODE P 8
BWTREE: LOGGING & RECOVERY PROTOCOL DRAM 3 BUFFER POOL 2 BEGIN TRANSACTION Update Stock by Stock ID COMMIT TRANSACTION SSD INDEX LOG INDEX-SPECIFIC LOGGING & RECOVERY 9
BWTREE: RECAP Delivers high performance on a DRAM + SSD system SSD-centric architecture Latch-free algorithms Logging & recovery protocol Limitations NVM invalidates the key design assumptions of BwTree Challenging to design & extend such latch-free data structures 0
PROBLEM #: ALGORITHMIC COMPLEXITY 2 S S SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU 3 4 AB SPLITTING A NODE A B LATCH-FREEDOM INTERMEDIATE STATES
PROBLEM #2: PROTOCOL COMPLEXITY BUFFER POOL NVM 3 INDEX 2 LOG DURABILITY & ATOMICITY INDEX-SPECIFIC LOGGING & RECOVERY 2
PROBLEM #3: ARCHITECTURAL COMPLEXITY PAGE ID 0 02 ADDRESS MAPPING TABLE LOCATION VIRTUALIZATION BUFFER POOL INDEX NVM 3
4 HOW CAN WE SIMPLIFY LATCH-FREE PROGRAMMING ON NON-VOLATILE MEMORY? 4
BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 5
BZTREE: OVERVIEW NVM-centric design Based on a new NVM-centric software primitive Provides same guarantees as disk-centric BwTree BzTree supersedes BwTree (skipped BxTree and ByTree) Because we think that it is the last index you will ever need! Key techniques Adopt a simpler NVM-centric architecture Reduce complexity using software primitive 6
NVM-CENTRIC SOFTWARE PRIMITIVE HARDWARE PRIMITIVE SOFTWARE PRIMITIVE DRAM 2 3 NVM VOLATILE SINGLE-WORD COMPARE-AND-SWAP PERSISTENT MULTI-WORD COMPARE-AND-SWAP EASY LOCK-FREE INDEXING IN NON-VOLATILE MEMORY ICDE 208 7
BZTREE: NVM-CENTRIC ARCHITECTURE L CACHE L2 CACHE BUFFER POOL PERSISTENT MULTI-WORD CAS NVM INDEX LOG BEGIN TRANSACTION Update BEGIN Stock TRANSACTION by Stock ID COMMIT Update BEGIN TRANSACTION Stock TRANSACTION by Stock ID COMMIT Update TRANSACTION Stock by Stock ID COMMIT TRANSACTION 8
BZTREE: DURABILITY & ATOMICITY OPERATION TABLE PERSISTENT MULTI-WORD CAS LOCATION EXPECTED OLD VALUE NEW VALUE FLUSHED NVM 0x00 OLD CHILD POINTER NEW CHILD POINTER 0x200 OLD NODE STATUS NEW NODE STATUS 0x300 OLD PARENT POINTER NEW PARENT POINTER 0 9
SOLUTION #: ALGORITHMIC COMPLEXITY PERSISTENT MULTI-WORD CAS S AB SPLITTING A NODE S A B EXPONENTIALLY FEWER INTERMEDIATE STATES 20
SOLUTION #2: PROTOCOL COMPLEXITY PERSISTENT MULTI-WORD CAS LOCATION OLD VALUE NEW VALUE FLUSHED 0x00 OLD CHILD POINTER NEW CHILD POINTER NVM 0x200 OLD NODE STATUS NEW NODE STATUS 0x300 OLD PARENT POINTER NEW PARENT POINTER 0 INDEX NO INDEX-SPECIFIC PROTOCOL DURABILITY & ATOMICITY 2
SOLUTION #3: ARCHITECTURAL COMPLEXITY NO MAPPING TABLE NO DELTA RECORDS & INDIRECTION OVERHEAD NVM NO LOG STRUCTURED INDEX STORE 22
BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 23
EVALUATION Index data structures: BzTree vs. BwTree index Code complexity Runtime performance Recovery time Benchmark: Yahoo Cloud Serving benchmark Read-mostly & Balanced workloads Storage device Emulated Non-Volatile Memory 24
CODE COMPLEXITY Lower is Better CODE COMPLEXITY METRIC BWTREE BZTREE CYCLOMATIC COMPLEXITY 2 7 LINES OF CODE 750 200 2x 4x 2 FEWER INTERMEDIATE STATES NO INDEX-SPECIFIC LOGGING PROTOCOL 25
RUNTIME PERFORMANCE DISK-CENTRIC BWTREE NVM-CENTRIC BZTREE Throughput (M Operations/sec) Higher is Better 90 60 30 0 27M READ-MOSTLY WORKLOAD In addition to simplifying programming, BzTree also delivers better performance 45M 2x 7M 3M BALANCED WORKLOAD 4x 26
RECOVERY TIME BzTree: no recovery logic Recovery is entirely handled by software primitive Rolls back operations that were in progress during the crash Lower is Better BWTREE BZTREE RECOVERY TIME ~5000 us 45 us 30x 27
CONCLUSION NVM invalidates design assumptions in data structures Presented the design of a NVM-centric latch-free B+tree Importance of tailoring data structures for NVM DEVELOPMENT COST PERFORMANCE RECOVERY TIME 28