NVthreads: Practical Persistence for Multi-threaded Applications

Similar documents
System Software for Persistent Memory

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

arxiv: v1 [cs.dc] 13 Dec 2017

An Analysis of Persistent Memory Use with WHISPER

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

An Analysis of Persistent Memory Use with WHISPER

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

Soft Updates Made Simple and Fast on Non-volatile Memory

Distributed Shared Persistent Memory

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Blurred Persistence in Transactional Persistent Memory

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

DTHREADS: Efficient Deterministic

Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems [VLDB 2017]

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Caching and reliability

File Systems: Consistency Issues

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Redrawing the Boundary Between So3ware and Storage for Fast Non- Vola;le Memories

COS 318: Operating Systems. Journaling, NFS and WAFL

Lazy Persistency: a High-Performing and Write-Efficient Software Persistency Technique

Demand-Driven Software Race Detection using Hardware

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Non-Volatile Memory Through Customized Key-Value Stores

PASTE: A Networking API for Non-Volatile Main Memory

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Hardware Undo+Redo Logging. Matheus Ogleari Ethan Miller Jishen Zhao CRSS Retreat 2018 May 16, 2018

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Dalí: A Periodically Persistent Hash Map

ThyNVM. Enabling So1ware- Transparent Crash Consistency In Persistent Memory Systems

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Mnemosyne Lightweight Persistent Memory

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

Characterizing Multi-threaded Applications based on Shared-Resource Contention

Energy Aware Persistence: Reducing Energy Overheads of Memory-based Persistence in NVMs

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Architectural Support for Atomic Durability in Non-Volatile Memory

Hardware Support for NVM Programming

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

The SNIA NVM Programming Model. #OFADevWorkshop

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Accelerated Machine Learning Algorithms in Python

UNIT 9 Crash Recovery. Based on: Text: Chapter 18 Skip: Section 18.7 and second half of 18.8

RAMP-White / FAST-MP

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Closing the Performance Gap Between Volatile and Persistent K-V Stores

SFS: Random Write Considered Harmful in Solid State Drives

An Efficient Memory-Mapped Key-Value Store for Flash Storage

New Abstractions for Fast Non-Volatile Storage

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

JOURNALING techniques have been widely used in modern

Multiple-Writer Distributed Memory

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Operating Systems. File Systems. Thomas Ropars.

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

RDMA Requirements for High Availability in the NVM Programming Model

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

Failure-atomic Synchronization-free Regions

Deterministic Process Groups in

Reminder from last time

Edinburgh Research Explorer

Hierarchical PLABs, CLABs, TLABs in Hotspot

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

High Performance Transactions in Deuteronomy

Farewell to Servers: Resource Disaggregation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

VMM Emulation of Intel Hardware Transactional Memory

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Instant Recovery for Main-Memory Databases

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Distributed Memory and Cache Consistency. (some slides courtesy of Alvin Lebeck)

SHERIFF: Precise Detection and Automatic Mitigation of False Sharing

Don t stack your Log on my Log

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Benchmark: In-Memory Database System (IMDS) Deployed on NVDIMM

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Distributed caching for cloud computing

Problems Caused by Failures

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Beyond Block I/O: Rethinking

arxiv: v2 [cs.dc] 2 May 2017

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Distributed Shared Persistent Memory

Accessing NVM Locally and over RDMA Challenges and Opportunities

The Google File System

Transcription:

NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs. NVMW 2018 NVthreads was published in EuroSys 2017 This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

What is non-volatile memory (NVM)? Key features: persistence, good performance, byte addressability Persistence - Retain data without power Good performance - Outperform traditional filesystem interface Byte addressability - Allow for pure memory operations 2

Programming interfaces for NVM NVM aware filesystems: BPFS, PMFS, PMEM - Pro: provide good performance - Con: require applications to use file-system interfaces and may need hardware modifications Durable transaction and heaps: NV-Heaps, Mnemosyne - Pro: allow fine-grained NVM access - Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs Problem: Can we provide a simpler programming interface? 4

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : 5 : 6 : e->value = 5; Challenges: 1.data consistency programmability volatile caches performance 7 : 8 : 9 : e->next = NULL; 10: 11: NVM 12: head->next = e; //crash 13: 14: 15: tail = e; head 1. tail e 5. NULL 16: pthread_unlock(&m); 8

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : 6 : e->value = 5; Challenges: 1.data consistency 2.programmability volatile caches performance 7 : <save old value of e->next> 8 : 9 : e->next = NULL; 10: <save old value of head->next> 11: NVM 12: head->next = e; 13: <save old value of tail> 14: 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 9

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> Challenges: 1.data consistency 2.programmability 3.volatile caches performance Cache 9 : e->next = NULL; flushing 10: <save old value of head->next> 11: <flush log entry to NVM> NVM 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 10

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> Challenges: 1.data consistency 2.programmability 3.volatile caches 4.performance Cache 9 : e->next = NULL; flushing 10: <save old value of head->next> 11: <flush log entry to NVM> NVM 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 11

Challenges of using NVM Data consistency - Ensure data consistency even after crash Volatile caches - Manage data movement from volatile caches to NVM Programmability - Avoid extensive program modifications Performance - Minimize runtime overhead!proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs 13

Goals of NVthreads Make existing lock-based C/C++ applications crash tolerant Minimize porting effort - Drop-in replacement for pthreads library - No need for transactions Advantages of the NVthreads - Good performance - Easier to develop NVM-aware applications 14

Key ideas Use synchronization points to infer consistent regions (cf. Atlas [OOPSLA 14]) - Does not require applications to use transactions Execute multithreaded program as multi-process program (cf. DThreads [SOSP 11]) - Process memory buffers uncommitted writes Track data modifications at page granularity - Amortizes logging overhead vs fine-grained tracking 15

Using NVthreads Ease of use: bash$ gcc foo.c o foo.out rdynamic libnvthread.so ldl Unmodified C/C++ application User space Kernel space Hardware Modifications Allocate data in NVM: nvmalloc() Recover data in NVM: nvrecover() NVthreads library Multi-process, intercepting synchronization, tracking data, maintaining log Operating system Memory allocation and file system interface for both DRAM and NVM DRAM Volatile main memory e.g., stacks NVM Persistent regions e.g., linked list on heap 19 Add recovery code, specify persistent allocations Link to NVthreads library DRAM NVM

NVthreads: programming model 1 void main(){ 2 if( crashed() ){ 3 int *c = (int*) nvmalloc(sizeof(int), c ); 4 *c = nvrecover(c, sizeof(int), c ); 5 } 6 else{ // normal execution 7 int *c = (int*) nvmalloc(sizeof(int), c ); 8... // thread creation 9 m.lock() 10 *c = *c+1; 11... 12 m.unlock() 13 } 14 } Locks mark boundary for durable code section. 22

NVthreads: programming model 1 void main(){ 2 if( crashed() ){ 3 int *c = (int*)nvmalloc(sizeof(int), nvmalloc(sizeof(int), c ); 4 *c = nvrecover(c, sizeof(int), c ); 5 } 6 else{ // normal execution Application specific recovery code. Programer needs to add. 7 int *c = (int*) nvmalloc(sizeof(int), c ); 8... // thread creation 9 m.lock() 10 *c = *c+1; 11... 12 m.unlock() 13 } 14 } 23

Example: linked list NVthreads guarantees that the linked list is atomically appended w.r.t. failures 1 : # L is a persistent list 2 : Start threads {T1, T2, T3} T1 Critical section (add e1) 3 : 4 : # Add element to the tail of list 5 : pthread_lock(&m); 6 : nvmalloc(&e, sizeof(*e)); T2 Critical section (add e2) Recovery phase (execute redo ops) 7 : e->val = localval; 8 : tail->next = e; 9 : e->prev = tail; // crash! 10: tail = e; 11: pthread_unlock(&m) T3 NVM 25 Critical section (add e3) L={} L={e1} L={e1, e2} state of the list data structure L

Implementing atomic durability Convert threads to processes (cf. DThreads [SOSP 11]) - Each process works on private memory, no undo log shared address space disjoint address spaces At synchronization points, propagate private updates, execute processes sequentially Track dirty pages and log them to NVM for recovery - Apply redo log in the event of crash 26

From threads to processes T1 Track dirty pages Start NVM log write Stop Merge shared state Wait Pass token T2 Wait Track dirty pages Start NVM log write Stop Merge shared state Parallel phase Critical section Parallel phase 33 Execute Wait

Redo logging Parallel phase Critical section Clean page Dirtied page Shared state T1 Rego log log dirty pages merge updated bytes write back to NVM sync() NVM 34

Tracking data dependencies A T1 X=1 cond_wait() X=Y=0 dependence T2 Y=X cond_signal() NVM Log1 Log2 Log3 NVthreads maintains metadata for memory pages per lockset to track data dependencies. B 46

Evaluation Environment - Ubuntu 14.04 (Linux 3.16.7) - Two Intel Xeon X5650 processors (12cores@2.67GHz) - 198GB RAM and 600GB SSD Applications - PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means NVM emulator - Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs) - Injected 1000ns delay to each 4KB page write via RDTSCP instruction 47

Performance vs pthreads Phoenix and PARSEC benchmarks No recovery protocol 16 Slowdown (x) 12 8 4 0 histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 48

Performance vs pthreads 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads Remaining 5 applications: 4x to 7x slowdown vs pthreads 16 Slowdown (x) 12 8 4 0 histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 50

Performance vs Atlas [OOPSLA 14] 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas 16 101.96 46.92 Slowdown (x) 12 8 4 0 x x histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 52

Performance vs Atlas [OOPSLA 14] 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas Remaining 2 applications: 7% to 2x slower vs Atlas 16 Slowdown (x) 12 8 4 0 x x histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 53

Is coarse grained tracking a good fit? 9 out of 14 applications touch more than 55% of each page It is worthwhile to track data at page granularity in these apps % of each page modified 100 90 80 70 60 50 40 30 20 10 0 linear regression (25) string match (37) histogram (44) blackscholes (89) swaptions (483) matrix multiply (4K) kmeans (10K) pca (11K) word count (12K) ferret (150K) streamcluster (180K) dedup (2.3M) reverse index (2.7M) canneal (7.4M) 54

NVthreads is faster than fine-grained tracking Microbenchmark: 4 threads randomly modify parts of 1000 memory pages Mnemosyne [ASPLOS 11] and Atlas [OOPSLA 14] use word-level tracking NVthreads is 3x to 30x faster than fine-grained tracking Slowdown over pthreads (x) 250 225 200 175 150 125 100 75 50 25 0 5% 10% 25% 50% 75% 100% Percentage of page modified 56 NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas

Benefits of recovery (K-means) We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration NVthreads K-means provides up to 1.9x speedup vs pthreads NVthreads requires only 4 SLOC changes to make K-means crash tolerant Speedup over over pthreads (x) 2 1.5 1 0.5 0 Input size Pthreads NVthreads (nvm=1000ns) 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 10 50 75 150 Iteration when crash occured 58

Summary NVthreads allows programmers to easily leverage NVM with just few lines of source code changes Recovery requires only redo log because multi-process execution buffers private updates Coarse-grained page-level tracking amortizes logging overheads NVthreads prototype is publicly available at: https://github.com/hewlettpackard/nvthreads 61