NVthreads: Practical Persistence for Multi-threaded Applications

NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs. NVMW 2018 NVthreads was published in EuroSys 2017 This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

What is non-volatile memory (NVM)? Key features: persistence, good performance, byte addressability Persistence - Retain data without power Good performance - Outperform traditional filesystem interface Byte addressability - Allow for pure memory operations 2

Programming interfaces for NVM NVM aware filesystems: BPFS, PMFS, PMEM - Pro: provide good performance - Con: require applications to use file-system interfaces and may need hardware modifications Durable transaction and heaps: NV-Heaps, Mnemosyne - Pro: allow fine-grained NVM access - Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs Problem: Can we provide a simpler programming interface? 4

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : 5 : 6 : e->value = 5; Challenges: 1.data consistency programmability volatile caches performance 7 : 8 : 9 : e->next = NULL; 10: 11: NVM 12: head->next = e; //crash 13: 14: 15: tail = e; head 1. tail e 5. NULL 16: pthread_unlock(&m); 8

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : 6 : e->value = 5; Challenges: 1.data consistency 2.programmability volatile caches performance 7 : <save old value of e->next> 8 : 9 : e->next = NULL; 10: <save old value of head->next> 11: NVM 12: head->next = e; 13: <save old value of tail> 14: 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 9

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> Challenges: 1.data consistency 2.programmability 3.volatile caches performance Cache 9 : e->next = NULL; flushing 10: <save old value of head->next> 11: <flush log entry to NVM> NVM 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 10

NVM-aware apps programming 1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> Challenges: 1.data consistency 2.programmability 3.volatile caches 4.performance Cache 9 : e->next = NULL; flushing 10: <save old value of head->next> 11: <flush log entry to NVM> NVM 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; head 1. e 5. tail NULL 16: pthread_unlock(&m); 11

Challenges of using NVM Data consistency - Ensure data consistency even after crash Volatile caches - Manage data movement from volatile caches to NVM Programmability - Avoid extensive program modifications Performance - Minimize runtime overhead!proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs 13

Goals of NVthreads Make existing lock-based C/C++ applications crash tolerant Minimize porting effort - Drop-in replacement for pthreads library - No need for transactions Advantages of the NVthreads - Good performance - Easier to develop NVM-aware applications 14

Key ideas Use synchronization points to infer consistent regions (cf. Atlas [OOPSLA 14]) - Does not require applications to use transactions Execute multithreaded program as multi-process program (cf. DThreads [SOSP 11]) - Process memory buffers uncommitted writes Track data modifications at page granularity - Amortizes logging overhead vs fine-grained tracking 15

Using NVthreads Ease of use: bash$ gcc foo.c o foo.out rdynamic libnvthread.so ldl Unmodified C/C++ application User space Kernel space Hardware Modifications Allocate data in NVM: nvmalloc() Recover data in NVM: nvrecover() NVthreads library Multi-process, intercepting synchronization, tracking data, maintaining log Operating system Memory allocation and file system interface for both DRAM and NVM DRAM Volatile main memory e.g., stacks NVM Persistent regions e.g., linked list on heap 19 Add recovery code, specify persistent allocations Link to NVthreads library DRAM NVM

NVthreads: programming model 1 void main(){ 2 if( crashed() ){ 3 int *c = (int*) nvmalloc(sizeof(int), c ); 4 *c = nvrecover(c, sizeof(int), c ); 5 } 6 else{ // normal execution 7 int *c = (int*) nvmalloc(sizeof(int), c ); 8... // thread creation 9 m.lock() 10 *c = *c+1; 11... 12 m.unlock() 13 } 14 } Locks mark boundary for durable code section. 22

NVthreads: programming model 1 void main(){ 2 if( crashed() ){ 3 int *c = (int*)nvmalloc(sizeof(int), nvmalloc(sizeof(int), c ); 4 *c = nvrecover(c, sizeof(int), c ); 5 } 6 else{ // normal execution Application specific recovery code. Programer needs to add. 7 int *c = (int*) nvmalloc(sizeof(int), c ); 8... // thread creation 9 m.lock() 10 *c = *c+1; 11... 12 m.unlock() 13 } 14 } 23

Example: linked list NVthreads guarantees that the linked list is atomically appended w.r.t. failures 1 : # L is a persistent list 2 : Start threads {T1, T2, T3} T1 Critical section (add e1) 3 : 4 : # Add element to the tail of list 5 : pthread_lock(&m); 6 : nvmalloc(&e, sizeof(*e)); T2 Critical section (add e2) Recovery phase (execute redo ops) 7 : e->val = localval; 8 : tail->next = e; 9 : e->prev = tail; // crash! 10: tail = e; 11: pthread_unlock(&m) T3 NVM 25 Critical section (add e3) L={} L={e1} L={e1, e2} state of the list data structure L

Implementing atomic durability Convert threads to processes (cf. DThreads [SOSP 11]) - Each process works on private memory, no undo log shared address space disjoint address spaces At synchronization points, propagate private updates, execute processes sequentially Track dirty pages and log them to NVM for recovery - Apply redo log in the event of crash 26

From threads to processes T1 Track dirty pages Start NVM log write Stop Merge shared state Wait Pass token T2 Wait Track dirty pages Start NVM log write Stop Merge shared state Parallel phase Critical section Parallel phase 33 Execute Wait

Redo logging Parallel phase Critical section Clean page Dirtied page Shared state T1 Rego log log dirty pages merge updated bytes write back to NVM sync() NVM 34

Tracking data dependencies A T1 X=1 cond_wait() X=Y=0 dependence T2 Y=X cond_signal() NVM Log1 Log2 Log3 NVthreads maintains metadata for memory pages per lockset to track data dependencies. B 46

Evaluation Environment - Ubuntu 14.04 (Linux 3.16.7) - Two Intel Xeon X5650 processors (12cores@2.67GHz) - 198GB RAM and 600GB SSD Applications - PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means NVM emulator - Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs) - Injected 1000ns delay to each 4KB page write via RDTSCP instruction 47

Performance vs pthreads Phoenix and PARSEC benchmarks No recovery protocol 16 Slowdown (x) 12 8 4 0 histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 48

Performance vs pthreads 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads Remaining 5 applications: 4x to 7x slowdown vs pthreads 16 Slowdown (x) 12 8 4 0 histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 50

Performance vs Atlas [OOPSLA 14] 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas 16 101.96 46.92 Slowdown (x) 12 8 4 0 x x histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 52

Performance vs Atlas [OOPSLA 14] 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas Remaining 2 applications: 7% to 2x slower vs Atlas 16 Slowdown (x) 12 8 4 0 x x histogram kmeans linear regression matrix multiply pca reverse index string match word count blackscholes canneal dedup ferret streamcluster swaptions Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas 53

Is coarse grained tracking a good fit? 9 out of 14 applications touch more than 55% of each page It is worthwhile to track data at page granularity in these apps % of each page modified 100 90 80 70 60 50 40 30 20 10 0 linear regression (25) string match (37) histogram (44) blackscholes (89) swaptions (483) matrix multiply (4K) kmeans (10K) pca (11K) word count (12K) ferret (150K) streamcluster (180K) dedup (2.3M) reverse index (2.7M) canneal (7.4M) 54

NVthreads is faster than fine-grained tracking Microbenchmark: 4 threads randomly modify parts of 1000 memory pages Mnemosyne [ASPLOS 11] and Atlas [OOPSLA 14] use word-level tracking NVthreads is 3x to 30x faster than fine-grained tracking Slowdown over pthreads (x) 250 225 200 175 150 125 100 75 50 25 0 5% 10% 25% 50% 75% 100% Percentage of page modified 56 NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas

Benefits of recovery (K-means) We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration NVthreads K-means provides up to 1.9x speedup vs pthreads NVthreads requires only 4 SLOC changes to make K-means crash tolerant Speedup over over pthreads (x) 2 1.5 1 0.5 0 Input size Pthreads NVthreads (nvm=1000ns) 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 10 50 75 150 Iteration when crash occured 58

Summary NVthreads allows programmers to easily leverage NVM with just few lines of source code changes Recovery requires only redo log because multi-process execution buffers private updates Coarse-grained page-level tracking amortizes logging overheads NVthreads prototype is publicly available at: https://github.com/hewlettpackard/nvthreads 61