BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Similar documents
STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

High Performance Transactions in Deuteronomy

Dalí: A Periodically Persistent Hash Map

JANUARY 20, 2016, SAN JOSE, CA. Microsoft. Microsoft SQL Hekaton Towards Large Scale Use of PM for In-memory Databases

Transactions. Chapter 15. New Chapter. CS 2550 / Spring 2006 Principles of Database Systems. Roadmap. Concept of Transaction.

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Non-Volatile Memory Through Customized Key-Value Stores

Architectural Support for Atomic Durability in Non-Volatile Memory

Database Management System

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Transactions. Lecture 8. Transactions. ACID Properties. Transaction Concept. Example of Fund Transfer. Example of Fund Transfer (Cont.

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

Transactions and Recovery Study Question Solutions

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Database Management Systems 2010/11

Heckaton. SQL Server's Memory Optimized OLTP Engine

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Chapter 15: Transactions

JOURNALING techniques have been widely used in modern

January 28-29, 2014 San Jose

Transactions. Prepared By: Neeraj Mangla

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Overcoming System Memory Challenges with Persistent Memory and NVDIMM-P

New Abstractions for Fast Non-Volatile Storage

Blurred Persistence in Transactional Persistent Memory

Loose-Ordering Consistency for Persistent Memory

Transactions These slides are a modified version of the slides of the book Database System Concepts (Chapter 15), 5th Ed

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

ICOM 5016 Database Systems. Chapter 15: Transactions. Transaction Concept. Chapter 15: Transactions. Transactions

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Chapter 13: Transactions

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Instant Recovery for Main-Memory Databases

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

WearDrive: Fast and Energy Efficient Storage for Wearables

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Recoverability. Kathleen Durant PhD CS3200

Easy Lock-Free Indexing in Non-Volatile Memory

Intro to DB CHAPTER 15 TRANSACTION MNGMNT

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

6.830 Problem Set 3 Assigned: 10/28 Due: 11/30

COS 318: Operating Systems. Journaling, NFS and WAFL

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Lazy Maintenance of Materialized Views

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Easy Lock-Free Indexing in Non-Volatile Memory

Mnemosyne Lightweight Persistent Memory

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Soft Updates Made Simple and Fast on Non-volatile Memory

An Analysis of Persistent Memory Use with WHISPER

Advanced Databases (SE487) Prince Sultan University College of Computer and Information Sciences. Dr. Anis Koubaa. Spring 2014

SLM-DB: Single-Level Key-Value Store with Persistent Memory

The Bw-Tree: A B-tree for New Hardware Platforms

Chapter 15: Transactions

Transaction Concept. Two main issues to deal with:

No compromises: distributed transactions with consistency, availability, and performance

Background: disk access vs. main memory access (1/2)

System Software for Persistent Memory

Chapter 9: Transactions

Transactions. 1. Transactions. Goals for this lecture. Today s Lecture

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

The SNIA NVM Programming Model. #OFADevWorkshop

BİL 354 Veritabanı Sistemleri. Transaction (Hareket)

Database System Concepts

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

Redo Log Removal Mechanism for NVRAM Log Buffer

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Innodb Performance Optimization

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Hardware Support for NVM Programming

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

NVthreads: Practical Persistence for Multi-threaded Applications

Enabling Persistent Memory Use in Java. Steve Dohrmann Sr. Staff Software Engineer, Intel

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

An Analysis of Persistent Memory Use with WHISPER

Distributed Systems

Orphans, Corruption, Careful Write, and Logging, or Gfix says my database is CORRUPT or Database Integrity - then, now, future

Defining a High-Level Programming Model for Emerging NVRAM Technologies

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

Block Device Scheduling. Don Porter CSE 506

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Transcription:

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory JOY ARULRAJ JUSTIN LEVANDOSKI UMAR FAROOQ MINHAS PER-AKE LARSON Microsoft Research

NON-VOLATILE MEMORY [NVM] PERFORMANCE DRAM VOLATILE NON-VOLATILE NVM SSD FAST SLOW DURABILITY 2

DEVICE CHARACTERISTICS CHARACTERISTIC DRAM NVM SSD Device Latency x 0x 000x Byte-Addressability Durability High Capacity 3

BWTREE: LATCH-FREE B+TREE 5 0 5 SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU 5 0 5 0 4

BZTREE: NVM-CENTRIC LATCH-FREE B+TREE 0 5 5 5 0 5 LATCH-FREE B+TREE NON-VOLATILE MEMORY 5

BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 6

BWTREE: SSD-CENTRIC ARCHITECTURE MAPPING TABLE PAGE ID ADDRESS 0 02 INDEX BUFFER POOL LOG-STRUCTURED STORE DRAM SSD 7

BWTREE: LATCH-FREE ALGORITHMS MAPPING TABLE PAGE ID ADDRESS 0 DELETE 2 INSERT 3 DELTA DELTA SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU [, 2] NODE P 8

BWTREE: LOGGING & RECOVERY PROTOCOL DRAM 3 BUFFER POOL 2 BEGIN TRANSACTION Update Stock by Stock ID COMMIT TRANSACTION SSD INDEX LOG INDEX-SPECIFIC LOGGING & RECOVERY 9

BWTREE: RECAP Delivers high performance on a DRAM + SSD system SSD-centric architecture Latch-free algorithms Logging & recovery protocol Limitations NVM invalidates the key design assumptions of BwTree Challenging to design & extend such latch-free data structures 0

PROBLEM #: ALGORITHMIC COMPLEXITY 2 S S SINGLE-WORD COMPARE-AND-SWAP INSTRUCTION CPU 3 4 AB SPLITTING A NODE A B LATCH-FREEDOM INTERMEDIATE STATES

PROBLEM #2: PROTOCOL COMPLEXITY BUFFER POOL NVM 3 INDEX 2 LOG DURABILITY & ATOMICITY INDEX-SPECIFIC LOGGING & RECOVERY 2

PROBLEM #3: ARCHITECTURAL COMPLEXITY PAGE ID 0 02 ADDRESS MAPPING TABLE LOCATION VIRTUALIZATION BUFFER POOL INDEX NVM 3

4 HOW CAN WE SIMPLIFY LATCH-FREE PROGRAMMING ON NON-VOLATILE MEMORY? 4

BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 5

BZTREE: OVERVIEW NVM-centric design Based on a new NVM-centric software primitive Provides same guarantees as disk-centric BwTree BzTree supersedes BwTree (skipped BxTree and ByTree) Because we think that it is the last index you will ever need! Key techniques Adopt a simpler NVM-centric architecture Reduce complexity using software primitive 6

NVM-CENTRIC SOFTWARE PRIMITIVE HARDWARE PRIMITIVE SOFTWARE PRIMITIVE DRAM 2 3 NVM VOLATILE SINGLE-WORD COMPARE-AND-SWAP PERSISTENT MULTI-WORD COMPARE-AND-SWAP EASY LOCK-FREE INDEXING IN NON-VOLATILE MEMORY ICDE 208 7

BZTREE: NVM-CENTRIC ARCHITECTURE L CACHE L2 CACHE BUFFER POOL PERSISTENT MULTI-WORD CAS NVM INDEX LOG BEGIN TRANSACTION Update BEGIN Stock TRANSACTION by Stock ID COMMIT Update BEGIN TRANSACTION Stock TRANSACTION by Stock ID COMMIT Update TRANSACTION Stock by Stock ID COMMIT TRANSACTION 8

BZTREE: DURABILITY & ATOMICITY OPERATION TABLE PERSISTENT MULTI-WORD CAS LOCATION EXPECTED OLD VALUE NEW VALUE FLUSHED NVM 0x00 OLD CHILD POINTER NEW CHILD POINTER 0x200 OLD NODE STATUS NEW NODE STATUS 0x300 OLD PARENT POINTER NEW PARENT POINTER 0 9

SOLUTION #: ALGORITHMIC COMPLEXITY PERSISTENT MULTI-WORD CAS S AB SPLITTING A NODE S A B EXPONENTIALLY FEWER INTERMEDIATE STATES 20

SOLUTION #2: PROTOCOL COMPLEXITY PERSISTENT MULTI-WORD CAS LOCATION OLD VALUE NEW VALUE FLUSHED 0x00 OLD CHILD POINTER NEW CHILD POINTER NVM 0x200 OLD NODE STATUS NEW NODE STATUS 0x300 OLD PARENT POINTER NEW PARENT POINTER 0 INDEX NO INDEX-SPECIFIC PROTOCOL DURABILITY & ATOMICITY 2

SOLUTION #3: ARCHITECTURAL COMPLEXITY NO MAPPING TABLE NO DELTA RECORDS & INDIRECTION OVERHEAD NVM NO LOG STRUCTURED INDEX STORE 22

BWTREE INDEX BZTREE INDEX EXPERIMENTAL RESULTS 23

EVALUATION Index data structures: BzTree vs. BwTree index Code complexity Runtime performance Recovery time Benchmark: Yahoo Cloud Serving benchmark Read-mostly & Balanced workloads Storage device Emulated Non-Volatile Memory 24

CODE COMPLEXITY Lower is Better CODE COMPLEXITY METRIC BWTREE BZTREE CYCLOMATIC COMPLEXITY 2 7 LINES OF CODE 750 200 2x 4x 2 FEWER INTERMEDIATE STATES NO INDEX-SPECIFIC LOGGING PROTOCOL 25

RUNTIME PERFORMANCE DISK-CENTRIC BWTREE NVM-CENTRIC BZTREE Throughput (M Operations/sec) Higher is Better 90 60 30 0 27M READ-MOSTLY WORKLOAD In addition to simplifying programming, BzTree also delivers better performance 45M 2x 7M 3M BALANCED WORKLOAD 4x 26

RECOVERY TIME BzTree: no recovery logic Recovery is entirely handled by software primitive Rolls back operations that were in progress during the crash Lower is Better BWTREE BZTREE RECOVERY TIME ~5000 us 45 us 30x 27

CONCLUSION NVM invalidates design assumptions in data structures Presented the design of a NVM-centric latch-free B+tree Importance of tailoring data structures for NVM DEVELOPMENT COST PERFORMANCE RECOVERY TIME 28