Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Similar documents
Endurable Transient Inconsistency in Byte- Addressable Persistent B+-Tree

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

Dalí: A Periodically Persistent Hash Map

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

CSE 530A ACID. Washington University Fall 2013

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

Write-Optimized Dynamic Hashing for Persistent Memory

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

System Software for Persistent Memory

Failure-Atomic Slotted Paging for Persistent Memory

Blurred Persistence in Transactional Persistent Memory

Main Memory and the CPU Cache

NVWAL: Exploiting NVRAM in Write-Ahead Logging

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Lecture 18: Reliable Storage

Hardware Support for NVM Programming

Heckaton. SQL Server's Memory Optimized OLTP Engine

High Performance Transactions in Deuteronomy

Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems

Main-Memory Databases 1 / 25

Google File System. Arun Sundaram Operating Systems

Soft Updates Made Simple and Fast on Non-volatile Memory

File Systems: Consistency Issues

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CSE506: Operating Systems CSE 506: Operating Systems

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Loose-Ordering Consistency for Persistent Memory

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Transaction Management & Concurrency Control. CS 377: Database Systems

CSE 544 Principles of Database Management Systems

Operating Systems. File Systems. Thomas Ropars.

NVthreads: Practical Persistence for Multi-threaded Applications

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Log-Free Concurrent Data Structures

TRANSACTION PROPERTIES

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

No compromises: distributed transactions with consistency, availability, and performance

JOURNALING techniques have been widely used in modern

CS5460: Operating Systems

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Caching and reliability

CS848 Paper Presentation Building a Database on S3. Brantner, Florescu, Graf, Kossmann, Kraska SIGMOD 2008

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Announcements. Persistence: Crash Consistency

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

NPTEL Course Jan K. Gopinath Indian Institute of Science

CLOUD-SCALE FILE SYSTEMS

The Google File System

HashKV: Enabling Efficient Updates in KV Storage via Hashing

The Google File System

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

CSE 530A. B+ Trees. Washington University Fall 2013

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Main Points. File System Reliability (part 2) Last Time: File System Reliability. Reliability Approach #1: Careful Ordering 11/26/12

COS 318: Operating Systems. Journaling, NFS and WAFL

Non-Volatile Memory Through Customized Key-Value Stores

TRANSACTION MANAGEMENT

Database Security: Transactions, Access Control, and SQL Injection

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

CA485 Ray Walshe Google File System

arxiv: v1 [cs.dc] 3 Jan 2019

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

Outline. Failure Types

Windows Persistent Memory Support

Transactions & Update Correctness. April 11, 2018

The Google File System

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

EECS 482 Introduction to Operating Systems

Introduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES

FAULT TOLERANT SYSTEMS

Push-button verification of Files Systems via Crash Refinement

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

A transaction is a sequence of one or more processing steps. It refers to database objects such as tables, views, joins and so forth.

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Goal of Concurrency Control. Concurrency Control. Example. Solution 1. Solution 2. Solution 3

Lectures 8 & 9. Lectures 7 & 8: Transactions

Lock Tuning. Concurrency Control Goals. Trade-off between correctness and performance. Correctness goals. Performance goals.

The Google File System

HDF5 Single Writer/Multiple Reader Feature Design and Semantics

Advanced Databases. Transactions. Nikolaus Augsten. FB Computerwissenschaften Universität Salzburg

GFS: The Google File System. Dr. Yingwu Zhu

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Fine-grained Metadata Journaling on NVM

Concurrent Preliminaries

Database System Concepts

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Transcription:

Deukyeon Hwang UNIST Wook-Hee Kim UNIST Youjip Won Hanyang Univ. Beomseok Nam UNIST

Fast but Asymmetric Access Latency Non-Volatility Byte-Addressability Large Capacity

CPU Caches (Volatile) Persistent Memory (Non-Volatile) 10 20 30 40 cache line

CPU Caches (Volatile) Persistent Memory (Non-Volatile) 40 10 20 30 40 cache line

CPU Caches (Volatile) Persistent Memory (Non-Volatile) 30 30 40 10 20 30 40 cache line

CPU Caches (Volatile) Persistent Memory (Non-Volatile) 30 30 40 10 20 30 40 cache line FLUSH

CPU Caches (Volatile) Persistent Memory (Non-Volatile) 40 10 20 30 30 40 cache line

CPU Caches (Volatile) Persistent Memory (Non-Volatile) LOST 40! 10 20 30 30 40 cache line

Inserting 25 into a node (0) 10 20 30 40 (1) 10 20 30 40 40 Partially updated tree node is inconsistent à (2) 10 20 30 30 40 Append-Only Update (3) 10 20 25 30 40

Node Split Node A 10 20 30 P1 P2 P3 ʌ Node B 40 60 P4 P6 ʌ Logging à Selective Persistence (Internal node in DRAM)

Append-Only Unsorted keys Selective Persistence Internal node à DRAM Internal nodes have to be reconstructed from leaf nodes after failures Logging for leaf nodes Previous solutions NV-Tree [FAST 15] wb+-tree [VLDB 15] FP-Tree [SIGMOD 16] Append-Only leaf update + Selective Persistence Append-Only node update + bitmap/slot array metadata Append-Only leaf update + fingerprints + Selective Persistence

Append-Only (Unsorted keys) Selective Persistence (DRAM + PM) Failure-Atomic ShifT (FAST) Failure-Atomic In-place Rebalancing (FAIR) Lock-Free Search

Modern processors reorder instructions to utilize the memory bandwidth Memory ordering in x86 and ARM x86 ARM stores-after-stores Y N stores-after-loads N N loads-after-stores N N loads-after-loads N N Inst. w/ dependency Y Y x86 guarantees Total Store Ordering (TSO) Dependent instructions are not reordered

Pointers in B+-Tree store unique memory addresses 8-byte pointer can be atomically updated Read transactions detect transient inconsistency between duplicate pointers transient inconsistency In-flight state partially updated by a write transaction 10 20 30 40 40 P1 P2 P3 P4 P5 P5

10 20 30 40 P1 P2 P3 P4 P5 P5 mfence(); TSO 10 20 30 40 40 P1 P2 P3 P4 P5 P5

Insert (25, P6) into a node using FAST 10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ g: Garbage ʌ: Null Read transactions can succeed in finding a key even if a system crashes in any step

Insert (25, P6) into a node using FAST 10 20 30 40 g g P1 P2 P3 P4 P5 P5 ʌ

Insert (25, P6) into a node using FAST 10 20 30 40 40 g P1 P2 P3 P4 P5 P5 ʌ

Insert (25, P6) into a node using FAST 10 20 30 40 40 g P1 P2 P3 P4 P5 P5 ʌ

Insert (25, P6) into a node using FAST read transaction 10 20 30 40 40 g P1 P2 P3 P4 P5 P5 ʌ Key 40 between duplicate pointers is ignored!

Insert (25, P6) into a node using FAST 10 20 30 40 40 g P1 P2 P3 P4 P4 P5 ʌ Shifting P4 invalidates the left 40

Insert (25, P6) into a node using FAST 10 20 30 30 40 g P1 P2 P3 P4 P4 P5 ʌ

Insert (25, P6) into a node using FAST 10 20 30 30 40 g P1 P2 P3 P3 P4 P5 ʌ

Insert (25, P6) into a node using FAST 10 20 25 30 40 g P1 P2 P3 P3 P4 P5 ʌ

Insert (25, P6) into a node using FAST 10 20 25 30 40 g P1 P2 P3 P6 P4 P5 ʌ Storing P6 validates 25

It is necessary to call clflush at the boundary of cache line 10 20 30 40 g g P1 P2 P3 P4 P5 ʌ ʌ Cache Line 1 Cache Line 2 10 20 30 30 40 g P1 P2 P3 P3 P4 P5 Cache Line 1 Cache Line 2 ʌ mfence() clflush( Cache Line 2 ) mfence()

Let s avoid expensive logging by making read transactions be aware of rebalancing operations B link -Tree 10 20 30 40 70 80 90

FAIR split a node Node A 10 20 30 40 60 P1 P2 P3 P4 P6 ʌ Node B 40 60 P4 P6 ʌ A read transaction can detect transient inconsistency if keys are out of order

FAIR split a node Node A 10 20 30 P1 P2 P3 ʌ Node B 40 60 P4 P6 ʌ Setting NULL pointer validates Node B. Node A and Node B are virtually a single node

FAIR split a node Node A 10 20 30 P1 P2 P3 ʌ Node B 40 60 P4 P6 ʌ Migrated keys can be accessed via sibling pointer

FAIR split a node Node A 10 20 30 Node B 40 50 60 P1 P2 P3 ʌ P4 P5 P6 ʌ

Insert a key into the parent node using FAST after FAIR split root Node R 10 70 70 C2 C3 C3 Node A Node B Node C 10 20 30 40 50 60 70 80 90

Insert a key into the parent node using FAST after FAIR split root Node R 10 70 70 C2 C2 C3 Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A

Insert a key into the parent node using FAST after FAIR split Ø Searching the key 50 from the root after a system crash root Node R 10 70 70 C2 C2 C3 key accessed by read transaction Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A

Insert a key into the parent node using FAST after FAIR split Ø Searching the key 50 from the root after a system crash root Node R 10 70 70 C2 C2 C3 key accessed by read transaction Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A

Insert a key into the parent node using FAST after FAIR split Ø Searching the key 50 from the root after a system crash root Node R 10 70 70 C2 C2 C3 key accessed by read transaction Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A

Insert a key into the parent node using FAST after FAIR split Ø Searching the key 50 from the root after a system crash root Node R 10 70 70 C2 C2 C3 key accessed by read transaction Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A

Insert a key into the parent node using FAST after FAIR split root Node R 10 40 70 C2 C4 C3 Node A Node B Node C 10 20 30 40 50 60 70 80 90 FAST inserting makes Node B visible atomically

Read transactions can tolerate any inconsistency caused by write transactions à à Read transactions can access the transient inconsistent tree node being modified by a write transaction Lock-Free Search

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 40 g g shift à P1 P2 P3 P4 P5 ʌ ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 40 g g shift à P1 P2 P3 P4 P5 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 40 40 g shift à P1 P2 P3 P4 P5 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 40 40 g shift à P1 P2 P3 P4 P4 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 30 40 g shift à P1 P2 P3 P4 P4 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 30 30 40 g shift à P1 P2 P3 P3 P4 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 20 30 40 g shift à P1 P2 P3 P3 P4 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) Read transaction Write transaction read à 10 20 20 30 40 g shift à P1 P2 P2 P3 P4 P5 ʌ

[Example 1] Searching 30 while inserting (15, P6) read à FOUND! Read transaction Write transaction 10 20 20 30 40 g shift à P1 P2 P2 P3 P4 P5 ʌ

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 20 30 40 g P1 P2 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 20 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 30 40 g P1 P3 P3 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 30 40 g P1 P3 P4 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 40 40 g P1 P3 P4 P4 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 40 40 g P1 P3 P4 P5 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) Read transaction Write transaction read à 10 30 40 40 g P1 P3 P4 P5 P5 ʌ g ʌ ß shift

[Example 2] Searching 30 while deleting (20, P2) read à 30 NOT FOUND Read transaction Write transaction 10 30 40 40 g P1 P3 P4 P5 P5 ʌ g ʌ ß shift The read transaction cannot find the key 30 due to shift operation

Direction flag: Even Number Insertion shifts to the right. Search must scan from Left to Right Odd Number Deletion shifts to the left. Search must scan from Right to Left Search 40 read à counter 2 10 20 30 40 g g P1 P2 P3 P4 P5 ʌ ʌ Insert 25 shift à

Direction flag: Even Number Insertion shifts to the right. Search must scan from Left to Right Odd Number Deletion shifts to the left. Search must scan from Right to Left Search 40 ß read counter 3 10 20 30 40 g g P1 P2 P3 P4 P5 ʌ ʌ Delete 25 ß shift

Direction flag: Even Number Insertion shifts to the right. Search must scan from Left to Right Odd Number Deletion shifts to the left. Search must scan from Right to Left Search 40 read à counter 32 10 20 30 40 g g P1 P2 P3 P4 P5 ʌ ʌ Delete 25 ß shift The read transaction has to check the counter once again to make sure the counter has not changed. Otherwise, search the node again.

Transaction A Transaction B BEGIN INSERT 10 SUSPENDED BEGIN SEARCH 10(FOUND) COMMIT WAKE UP ABORT Dirty reads problem The ordering of Transaction A and Transaction B cannot be determined

Isolation Level Highest Serializable Repeatable reads Read committed Lowest Read uncommitted Our Lock-Free Search supports low isolation level

Lock Contention High 10 50 70 90 Root...... Lock-Free Search... Low 1 10 13 40... 99 150 160 Leaf For higher isolation level, read lock is necessary for leaf nodes

Xeon Haswell-Ex E7-4809 v3 processors 2.0 GHz, 16 vcpus with hyper-threading enabled, and 20 MB L3 cache Total Store Ordering (TSO) is guaranteed g++ 4.8.2 with -O3 PM latency Read latency A DRAM-based PM latency emulator, Quartz Write latency Injecting delay

Sorted keys, cache locality, and memory level parallelism à up to 20X speed up

FAST+FAIRà FP-Tree à wb+-tree à WORT à Skiplist

clflush: I/O time Search: Tree traversal time Node Update: Computation time WORT, FAST+FAIR, FP-Tree à FAST+Logging à wb+-tree à Skiplist FAST+Logging uses logging instead of FAIR when splitting a node

New Order Paymen t Order Status Delivery Stock Level W1 34% 43% 5% 4% 14% W2 27% 43% 15% 4% 11% W3 20% 43% 25% 4% 8% W4 13% 43% 35% 4% 5% Specification of TPCC workloads More Range Queries FAST+FAIR consistently outperforms other indexes because of its good insertion performance and superior range query performance

(a) 50M Search (b) 50M Insertion (c) 200M Search / 50M Insertion / 12.5M Deletion Lock-free search with FAST+FAIR shows high scalability and performance FAST+FAIR+LeafLock shows comparable scalability and provides high concurrency level

We designed a byte addressable persistent B+-Tree that stores keys in order avoids expensive logging FAST and FAIR always transform B+-Trees into consistent/transient inconsistent B+-Trees Lock-Free search By tolerating transient inconsistency

Deukyeon Hwang UNIST Wook-Hee Kim UNIST Youjip Won Hanyang Univ. Beomseok Nam UNIST

To guarantee the order of instructions, the dmb instruction is used for FAST+FAIR Although there is an overhead by dmb, FAST+FAIR is less affected by latency