Hardware Support for NVM Programming

Similar documents
Lazy Persistency: a High-Performing and Write-Efficient Software Persistency Technique

Blurred Persistence in Transactional Persistent Memory

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Dalí: A Periodically Persistent Hash Map

Loose-Ordering Consistency for Persistent Memory

Loose-Ordering Consistency for Persistent Memory

3. Memory Persistency Goals. 2. Background

System Software for Persistent Memory

arxiv: v1 [cs.dc] 3 Jan 2019

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Reliability, Availability, Serviceability (RAS) and Management for Non-Volatile Memory Storage

An Analysis of Persistent Memory Use with WHISPER

Hardware Undo+Redo Logging. Matheus Ogleari Ethan Miller Jishen Zhao CRSS Retreat 2018 May 16, 2018

Efficient Persist Barriers for Multicores

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

An Analysis of Persistent Memory Use with WHISPER

Topic 18: Virtual Memory

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Leave the Cache Hierarchy Operation as It Is: A New Persistent Memory Accelerating Approach

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Lecture 18: Reliable Storage

Phase Change Memory An Architecture and Systems Perspective

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

ThyNVM. Enabling So1ware- Transparent Crash Consistency In Persistent Memory Systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Relaxed Memory Consistency

Chapter. Out of order Execution

CMU Introduction to Computer Architecture, Spring 2014 HW 5: Virtual Memory, Caching and Main Memory

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Log-Free Concurrent Data Structures

Relaxing Persistent Memory Constraints with Hardware-Driven Undo+Redo Logging

Linearizability of Persistent Memory Objects

RISC-V Support for Persistent Memory Systems

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Phase Change Memory An Architecture and Systems Perspective

High Performance Transactions in Deuteronomy

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Distributed Shared Persistent Memory

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

LogSI-HTM: Log Based Snapshot Isolation in Hardware Transactional Memory

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

Lecture X: Transactions

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

DISTRIBUTED SHARED MEMORY

Linearizability of Persistent Memory Objects

Energy Aware Persistence: Reducing Energy Overheads of Memory-based Persistence in NVMs

DudeTx: Durable Transactions Made Decoupled

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caching and reliability

FAULT TOLERANT SYSTEMS

What are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Onyx: A Prototype Phase-Change Memory Storage Array

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Topic 18 (updated): Virtual Memory

JOURNALING techniques have been widely used in modern

Transaction Management: Introduction (Chap. 16)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Leveraging Flash in HPC Systems

Conventional Computer Architecture. Abstraction

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Memory. Lecture 22 CS301

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

arxiv: v1 [cs.dc] 8 Sep 2017

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

COMPUTER ORGANIZATION AND DESI

Soft Updates Made Simple and Fast on Non-volatile Memory

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Using Memory-Style Storage to Support Fault Tolerance in Data Centers

Virtual Memory and Interrupts

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Cache Performance (H&P 5.3; 5.5; 5.6)

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Precise Exceptions and Out-of-Order Execution. Samira Khan

What are Exceptions? EE 457 Unit 8. Exception Processing. Exception Examples 1. Exceptions What Happens When Things Go Wrong

NVthreads: Practical Persistence for Multi-threaded Applications

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Chapter 17: Recovery System

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

1. Memory technology & Hierarchy

Journaling. CS 161: Lecture 14 4/4/17

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

File Systems: Consistency Issues

Transcription:

Hardware Support for NVM Programming 1

Outline Ordering Transactions Write endurance 2

Volatile Memory Ordering Write-back caching Improves performance Reorders writes to DRAM STORE A STORE B CPU CPU B A Write-back Cache A B DRAM Reordering to DRAM does not break correctness Memory consistency orders stores between CPUs 3

Persistent Memory (PM) Ordering Recovery depends on write ordering STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF STORE valid = 1 CPU CPU V D D Write-back Cache D D Crash V NVM 4

Persistent Memory (PM) Ordering Recovery depends on write ordering STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF STORE valid = 1 NVM CPU CPU Write-back Cache data valid GARBAGE 1 Reordering breaks recovery Recovery incorrectly considers garbage as valid data 5

Simple Solutions Disable caching Write-through caching Flush entire cache at commit 6

Generalizing PM Ordering 1: STORE data[0] = 0xFOOD 2: STORE data[1] = 0xBEEF 3: STORE valid = 1 7

Generalizing PM Ordering 1: PERSIST data[0] 2: PERSIST data[1] 3: PERSIST valid 1 2 3 Program order implies unnecessary constraints Need interface to describe necessary constraints 8

Generalizing PM Ordering 1: PERSIST data[0] 2: PERSIST data[1] 3: PERSIST valid 1 2 3 Need interface to expose necessary constraints Expose persist concurrency; sounds like consistency! 9

Memory Persistency: [Pelley, ISCA14] Memory Consistency for NVM Framework to reason about persist order while maximizing concurrency Memory consistency Constrains order of loads and stores between CPUs Memory persistency Constrains order of writes with respect to failure 10

Memory Persistency = Consistency + Recovery Observer Abstract failure as recovery observer Observer sees writes to NVM Memory persistency Constrains order of writes with respect to observer STORES PERSISTS CPU CPU Consistency Memory Persistency (Observer) 11

Ordering With Respect to Recovery Observer STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF STORE valid = 1 STORES PERSISTS CPU CPU V D D Write-back Cache D D V NVM Memory Persistency 12

Ordering With Respect to Recovery Observer STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF STORE valid = 1 NVM CPU CPU Write-back Cache data valid GARBAGE 1 Memory Persistency (view after crash) 13

Happens Before: Persistency Design Space Volatile Memory Order Persistent Memory Order Strict persistency: single memory order Relaxed persistency: separate volatile and (new) persistent memory orders 14

Outline Ordering Intel x86 ISA extensions [Intel14] BPFS epochs barriers [Condit, SOSP09] Strand persistency [Pelley, ISCA14] Transactions Write endurance Relax persistency 15

Ordering with Existing Hardware Order writes by flushing cachelines via CLFLUSH But CLFLUSH: STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF CLFLUSH data[0] CLFLUSH data[1] STORE valid = 1 Stalls the CPU pipeline and serializes execution data[0] ST CLFLUSH data[1] valid stall (~200ns) ST CLFLUSH ST time 16

Ordering with Existing Hardware Order writes by flushing cachelines via CLFLUSH But CLFLUSH: STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF CLFLUSH data[0] CLFLUSH data[1] STORE valid = 1 Stalls the CPU pipeline and serializes execution Invalidates the cacheline Only sends data to the memory subsystem does not commit data to NVM 17

Fixing CLFLUSH: Intel x86 Extensions CLFLUSHOPT CLWB PCOMMIT 18

CLFLUSHOPT Provides unordered version of CLFLUSH Supports efficient cache flushing STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF CLFLUSHOPT data[0] CLFLUSHOPT data[1] Implicit orderings SFENCE // explicit ordering point STORE valid = 1 19

CLFLUSHOPT Provides unordered version of CLFLUSH Supports efficient cache flushing data[0] ST CLFLUSH data[1] ST CLFLUSH valid ST data[0] ST CLFLUSHOPT time data[1] ST CLFLUSHOPT valid ST time 20

CLWB Write backs modified data of a cacheline Does not invalidate the line from the cache Marks the line as non-modified Note: Following examples use CLWB 21

PCOMMIT Commits data writes queued in the memory subsystem to NVM STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF CLWB data[0] CLWB data[1] SFENCE // orders subsequent PCOMMIT PCOMMIT // commits data[0], data[1] SFENCE // orders subsequent stores STORE valid = 1 Limitation: PCOMMITs execute serially 22

Example: Copy on Write R Z X Y 23

Example: Copy on Write R X Z R STORE Y STORE Z Z CLWB Y Y Y CLWB Z PCOMMIT STORE R CLWB R 24

Example: Copy on Write R X X Z R Y Z R Z Y STORE Y STORE Z CLWB Y CLWB Z PCOMMIT STORE R CLWB R PCOMMIT STORE X STORE Z CLWB X CLWB Z PCOMMIT STORE R CLWB R 25

Example: Copy on Write Timeline Y ST CLWB Z ST CLWB PCOMMIT stall R ST CLWB PCOMMIT X Z R stall ST CLWB ST CLWB PCOMMIT stall ST CLWB PCOMMITs execute serially time 26

Outline Ordering Intel x86 ISA extensions [Intel14] BPFS epochs barriers [SOSP09] Strand persistency [ISCA14] Transactions Write endurance Relax persistency 27

BPFS Epochs Barriers [Condit, SOSP09] Barriers separate execution into epochs: sequence of writes to NVM from the same thread Epoch Epoch Epoch STORE STORE EPOCH_BARRIER STORE EPOCH_BARRIER STORE Writes within same epoch are unordered A younger write is issued to NVM only after all previous epochs commit 28

Example: Copy on Write R X Z R STORE Y STORE Z Z EPOCH_BARRIER STORE R Y Y Epoch Epoch 29

Example: Copy on Write R X X Z R R STORE Y STORE Z Z Z EPOCH_BARRIER STORE R Y Y EPOCH_BARRIER STORE X STORE Z EPOCH_BARRIER STORE R 30

Example: Copy on Write - Failure R R Z R STORE Y STORE Z Z EPOCH_BARRIER STORE R Y Y L1/L2 WRITEBACK X WRITEBACK Y Y Y X X NVM WRITEBACK R 31

PCOMMIT/CLWB VS Epochs Barriers Y ST CLWB Z R ST CLWB X ST CLWB PCOMMIT stall Commits happen synchronously PCOMMIT ST CLWB X ST WRITEBACK Z ST WRITEBACK time Commits happen asynchronously R ST WRITEBACK X ST WRITEBACK time 32

BPFS Epochs Barriers: Ordering between threads Epochs also capture read-write dependencies between threads Epoch Epoch Epoch Thread 0 Thread 1 STORE STORE EPOCH_BARRIER STORE R EPOCH_BARRIER STORE Must make dependency visible to NVM to ensure crash consistency Memory Consistency PERSIST makes Missing this dependency PERSIST persists visible to thread 1 PERSIST R LOAD R STORE V Recovery Observer PERSIST V 33

Controller Cache Lines Epoch Hardware Proposal CPU0 Cache EpochID EpochID P Stage/Tag/Data CPU1 EpochID EpochID P EpochID P EpochID P State/Tag/Data State/Tag/Data State/Tag/Data NVM CPUn EpochID CPU0 CPU1 CPUn Oldest in-flight EpochID Oldest in-flight EpochID Oldest in-flight EpochID Per-processor epoch ID tags writes Cache line stores epoch ID when it is modified Cache tracks oldest in-flight epoch per CPU 34

Epoch HW: Ordering Within a Thread CPU0 EpochID 0.0 0.1 0.2 STORE Y STORE Z EPOCH_BARRIER STORE R EPOCH_BARRIER STORE X EVICT R WRITEBACK Y WRITEBACK Z WRITEBACK R Cascading Writebacks EpochID 0.1 P 1 Stage/Tag/Data R EpochID 0.0 P 1 State/Tag/Data Y EpochID 0.0 P 1 State/Tag/Data Z EpochID 0.2 P 1 State/Tag/Data X CPU0 CPU1 CPUn Cache Oldest Epoch in-flight 0.0 EpochID Oldest in-flight EpochID Oldest in-flight EpochID NVM Epoch 0.1 is not oldest evict all earlier epochs 35

Epoch HW: Ordering Within a Thread Overwrites CPU0 EpochID 0.2 01 STORE Y STORE Z EPOCH_BARRIER STORE R EPOCH_BARRIER STORE X STORE R WRITEBACK Y WRITEBACK Z EpochID 0.1 P 1 EpochID 0.0 P 1 State/Tag/Data Z EpochID 0.2 P 1 CPU0 CPU1 CPUn Cache Stage/Tag/Data R EpochID 0.0 P 1 State/Tag/Data Y State/Tag/Data X Oldest Epoch in-flight 0.0 EpochID Oldest in-flight EpochID Oldest in-flight EpochID Flush epoch 0.1 (containing Y) and older epochs NVM 36

Epoch HW: Ordering Between Threads CPU0 EpochID 0.1 0 STORE Y EPOCH_BARRIER STORE R CPU1 EpochID 1.0 EpochID 0.1 P 1 EpochID 0.0 P 1 EpochID P EpochID P CPU0 CPU1 CPUn Cache Stage/Tag/Data R State/Tag/Data Y State/Tag/Data State/Tag/Data Oldest Epoch in-flight 0.0 EpochID Oldest Epoch in-flight 1.0 EpochID Oldest in-flight EpochID NVM LOAD R WRITEBACK Y WRITEBACK R Read data tagged by CPU0 flush all old epochs to capture dependency 37

Summary Ordering primitive Persists Commits CLFLUSH Serial N/A PCOMMIT/CLWB Parallel Synchronous Epochs Parallel Asynchronous 38

Outline Ordering Intel x86 ISA extensions BPFS epochs barriers [Condit, SOSP09] Strand persistency [Pelley, ISCA14] Transactions Write endurance 39

Limitation of Epochs seek (fd, 1024, SEEK_SET); write (fd, data, 128); Non conflicting writes seek (fd, 2048, SEEK_SET); write (fd, data, 128); 40

Limitation of Epochs seek (fd, 1024, ); write (fd, data, 128); X B Y D seek (fd, 2048, ); write (fd, data, 128); 41

Limitation of Epochs X Y STORE B EPOCH_BARRIER STORE X B D EPOCH_BARRIER STORE D EPOCH_BARRIER STORE Y 42

Limitation of Epochs 1:PERSIST B 2:PERSIST X 3:PERSIST D 4:PERSIST Y 1 2 3 Epoch order implies unnecessary constraint 4 Can we expose more persist concurrency? 43

Strand Persistency Divide execution into strands Each strand is an independent set of persists All strands initially unordered Conflicting accesses establish persist order NewStrand instruction begins each strand Barriers continue to order persists within each strand as in epoch persistency 44

Strand Persistency: Example Epoch Strand A C B A BARRIER B C or A B BARRIER C NEWSTRAND A BARRIER C NEWSTRAND B Strands remove unnecessary ordering constraints 45

Strands Expose More Persist Concurrency seek (fd, 1024, ); write (fd, data, 128); seek (fd, 2048, ); write (fd, data, 128); NEW_STRAND STORE B EPOCH_BARRIER STORE X NEW_STRAND STORE D EPOCH_BARRIER STORE Y 46

Strands Expose More Persist Concurrency 1:PERSIST B[0] 2:PERSIST B[1] 3:PERSIST X 4:PERSIST D[0] 5:PERSIST D[1] 6:PERSIST Y 1 2 3 4 5 6 47

Persistency Spectrum Strict Relaxed No caching PCOMMIT/CLWB Strand persistency Epoch persistency (BPFS like) 48

Summary Ordering primitive Persists Commits CLFLUSH Serial N/A PCOMMIT/CLWB Parallel Synchronous Epochs Parallel Asynchronous Strands Parallel Asynchronous + Parallel 49

Outline Ordering Transactions Restricted transactional memory [Dulloor, EuroSys14] Multiversioned memory hierarchy [Zhao, MICRO13] Write endurance 50

Software-based Atomicity is Costly Atomicity relies on multiple data copies (versions) for recovery Write-ahead logging: write intended updates to a log Copy on write: write updates to new locations Software cost Mem-copying for creating multiple data versions Bookkeeping information for maintaining versions 51

Restricted Transactional Memory (RTM) Intel s RTM supports failure-atomic 64-byte cache line writes XBEGIN STORE A STORE B XEND A, B can be now written back to NVM atomically Cache B NVM Existence proof that PM can leverage hardware TM ` A RTM prevents A, B from leaving cache before commit (for isolation) 52

Multiversioning: Leveraging Caching for In-place Updates How does a write-back cache work? A processor writes a value Old values remain in lower levels Until the new value gets evicted Cache Memory A New Ver A Old Ver 53

Multiversioning: Leveraging Caching for In-place Updates How does a write-back cache work? A processor writes a value Old values remain in lower levels Until the new value gets evicted Insight: multiversioned system by nature Allow in-place updates to directly overwrite original data No need for logging or copy-on-write NV-LLC NVM A B C Ver N A B C Ver N-1 A Multiversioned Persistent Memory Hierarchy 54

Preserving Write Ordering: Out-of-order Writes + In-order Commits Out-of-order writes to NV-LLC NV-LLC remembers the committing state of each cache line In-order commits of transactions Example: T A before T B T B will not commit until A 2 arrives in NV-LLC Committing a transaction Flush higher-level caches (very fast) Change cache line states in NV-LLC T A = {A 1, A 2, A 3 } T B = {B 1, B 2 } Higher-Level Caches Out-of-order A 3, B 2, A 1, B 1, A 2 NV-LLC NVM 55 Flush

A Hardware Memory Barrier Why Prevents early eviction of uncommitted transactions Avoids violating atomicity T A = {A 1, A 2, A 3 } A B C NV-LLC A 1 Crash How A B C NVM Extend replacement policy with transaction-commit info to keep uncommitted transactions in NV-LLC Handle NV-LLC overflows using OS supported CoW 56

Outline Ordering Transactions Write endurance Start-gap wear leveling [Qureshi, MICRO09] Dynamically replicated memory [Ipek, ASPLOS10] Note: Mechanisms target NVM-based main memory 57

Start-Gap Wear Leveling Table-based wear leveling is too costly for NVM Storage overheads and indirection latency Instead, use algebraic mapping between logical and physical addresses Periodically remap a line to its neighbor 58

Start-Gap Wear Leveling Table-based wear leveling is too costly for NVM Storage overheads and indirection latency Instead, use algebraic mapping between logical and physical addresses Periodically remap a line to its neighbor START GAP A B C D Memory lines Gap line 59

Start-Gap Wear Leveling Table-based wear leveling is too costly for NVM Storage overheads and indirection latency Instead, use algebraic mapping between logical and physical addresses Periodically remap a line to its neighbor Move start every one gap rotation START START A D B A C B D C D GAP GAP GAP GAP GAP Wear-leveling: Move gap every 100 memory writes NVMAddr = (Start+Addr); if (PhysAddr >= Gap) NVMAddr++ 60

Dynamically Replicated Memory Reuse faulty pages with non-overlapping faults FAULT FAULT FAULT FAULT FAULT FAULT Compatible pages Incompatible pages Record pairings in a new level of indirection Recorded in Real Table Virtual Address Space Recorded in Page Tables PhysicalAddress Space Real Address Space 61

Summary Ordering support Reduces unnecessary ordering constraints Exposes persist concurrency Transaction support Removes versioning software overheads Endurance support further increases lifetime Questions? 62

Backup Slides 63

Randomized Start Gap Start gap may move spatially-close hot lines to other hot lines Randomize address space to spread hot regions uniformly Physical Address Randomized Address NVM Address Line Addr Static Randomizer Start-Gap Mapping NVM Hot lines 64 X Y