Relaxing Persistent Memory Constraints with Hardware-Driven Undo+Redo Logging

Size: px

Start display at page:

Download "Relaxing Persistent Memory Constraints with Hardware-Driven Undo+Redo Logging"

Peter Garrett
5 years ago
Views:

1 Relaxing Persistent Memory Constraints with Hardware-Driven Undo+Redo Logging Matheus A. Ogleari Ethan L. Miller University of California, Santa Cruz November 19, 2016 Jishen Zhao Abstract Persistent memory is a new tier of memory that functions as a hybrid of traditional storage systems and main memory. It combines the benefits of both: the data persistence property of storage with the fast load/store interface of memory. Yet, efficiently supporting data persistence in memory requires non-trivial effort. In particular, logging is a widely used data persistence scheme due to its portability and flexibility benefits compared with logless mechanisms. However, traditional software logging mechanisms are expensive to adopt in persistent memory, due to their overhead on performance, energy, and implementation. Additionally, alternate proposed hardware-based logging schemes don t fully address the issues with software. To address the challenges, we propose a hardware-driven undo+redo logging scheme, which maintains data persistence by leveraging the otherwise largely wasted hardware information available in the cache hierarchy. Our method allows persistent memory to exploit undo+redo logging to relax data persistence constraints on caching without nonvolatile on-chip buffering or caching components. Our evaluation across persistent memory microbenchmarks and real workloads demonstrates that our design leads to significant system throughput improvement and reduction in both dynamic energy and memory traffic. 1 Introduction Persistent memory is a new technology that bridges the decades-long divide between memory and storage systems, by having the property of both. Recent studies implement persistent memory by placing emerging nonvolatile memory devices, such as phase-change memory [44], resistive RAM [7], spin-transfer torque RAM [51], and 3D Xpoint memory [17], on the memory bus. Persistent memory provides data persistence support in memory ( 2). One of the mechanisms that provides such support is logging ( 2.1). This paper focuses on devising efficient logging support in persistent memory systems. Software-based logging is one of commonly used schemes for maintaining data persistence in persistent memory systems. This gives programmers large control over data persistence. However, this is inefficient for a number of reasons. First, software logging must make changes to existing legacy code. Second, It does not guarantee data persistence in multithreaded programs in the face of context switching, further discussed in 2.5. Third, software logging inflicts tight ordering constraints because proper use of undo+redo logging in memory systems is expensive and unfeasible (discussed in 2.3 and 2.4. Other works have proposed hardware-based logging solutions (see 8). However, they lack advantages of either eliminating the need for memory fence instructions, relaxing ordering constraints, or are too costly in hardware. More importantly, they implement either undo or redo logging, but not both. This comes at an especially high hardware cost because they do not utilize key information from the cache hierarchy that our design does which allows us to efficiently implement this feature. Our goal in this paper is to design an efficient scheme that both enables high-performance, energy-efficient hardware-based undo+redo logging in persistent memory and relaxes data persistence constraints without adding nonvolatile buffers or caches to the processor. We propose a persistent memory undo+redo logging scheme that enables 1

2 steal/no-force (a concept borrowed from database design) [31] of persistent data updates across caches and memory controllers at low cost. Our scheme incorporates undo and redo logging. The undo log allows cache blocks of uncommitted transactions to steal (i.e., write-back) into NVRAM (eliminating memory barriers). The redo log allows transactions to restore their latest state without forcing (i.e., no-force) cache blocks of committed transactions out of the processor (reducing cache flushes and force write-backs). Supporting both undo and redo logging may seem expensive, but we found that the undo and redo information naturally exists in the cache hierarchy. This information, which is wasted in most previous persistent memory designs, allows us to enable practical undo+redo logging support at low performance and implementation cost. As a result, combining undo and redo logging provides both steal and no-force benefits, which significantly relaxes the ordering constraints that persistent memory has placed on caching of original data updates. In order to enable undo+redo logging in persistent memory at low cost, we propose two hardware mechanisms: WAL and FWB. WAL implements automatic write-ahead logging in hardware for persistent data. FWB implements automatic periodic forced write-back in the cache. This paper makes the following contributions: We identify the critical overheads in traditional software-based logging mechanisms, imposed by under-utilization of the information already available in CPU cache hierarchy of persistent memory systems. We propose a hardware-driven undo+redo logging scheme, which substantially reduces the implementation overheads and relaxes the data persistence constraints. This reduces the number of software instructions executed in the pipeline and simplifies the software interface. We ensure the order between original data and its log updates in NVRAM without explicit cache flushes or memory barriers. Our design exploits detailed hardware information that can secure data persistence in multithreading scenarios. Previous software-based logging schemes can risk data persistence with multithreaded workloads. We devise a set of efficient hardware design techniques that enable our logging scheme with lightweight software support and modifications to processors. We update the log by leveraging the hardware information naturally available in caches. 2 Background and Motivation Modern computer systems decouple fast data access and data persistence by adopting separate main memory (DRAM) and storage (disks and flash) components. This decades-long model is being enriched by a new tier of memory, persistent memory, which offers a single-level, flat data storage model by incorporating the persistence property of storage and the fast load/store access of memory [1, 4, 30, 24]. Persistent memory is enabled by attaching byteaddressable nonvolatile memories (NVRAMs) such as spin-transfer torque RAM [51], phase-change memory [44], resistive RAM [7], battery-backed DRAM [6], and 3D XPoint memory [17] directly to the processor-memory bus. As NVRAMs are expected to be on market in 2017 [17], persistent memory can radically change the landscape of software and hardware design in future computer systems. Persistent memory is fundamentally different from traditional DRAM-based main memory or their NVRAM replacement, due to its persistence (aka. crash consistency) property inherited from storage systems. It needs to ensure the integrity of in-memory data despite system failures, such as system crashes and power outages [1, 46, 37, 33, 36, 20, 26, 27, 19, 25, 8]. This persistence property is not guaranteed by memory consistency of traditional memory systems: memory consistency ensures a consistent global view of caches and the main memory. Whereas hardware can reorder loads and stores within this hierarchy, data in volatile caches can be lost during system failures. This leaves NVRAM in a partially updated, inconsistent state. Consequently, persistent memory systems need to carefully control the writes across the processor-nvram interface and ensure that the data in NVRAM standing alone is consistent [36, 46, 20]. One key data persistence scheme is logging, which maintains a historical record of critical data/metadata changes available for system recovery after system failures. Software logging is widely used not only because it is compatible with most log-based legacy code, but also due to several benefits that logless schemes [45, 11, 34, 33] do not support well. First, flexible recovery options log allows systems to either flexibly recover only the information recorded immediately prior to failures or roll back to older, recorded states. Second, less fragmented memory space systems typically allocate logs in one or multiple contiguous address spaces, leading to less memory fragmentation. 2

3 Yet, traditional software-driven logging schemes can impose substantial performance and energy overhead to persistent memory. These schemes require the processor pipeline to execute logging-related instructions in addition to application operations. They can also impose substantial ordering constraints to caches and memory controller operations: they typically employ memory barriers and cache flushes (or force write-backs) to ensure that log updates are committed to NVRAM before corresponding updates to the original data; otherwise, system failures can leave the original data partially updated without log protection [46, 10, 49]. Furthermore, adapting traditional logging mechanisms designed for block-level storage systems to be memory-friendly can substantially alter system software or library designs [46, 10, 45, 42, 9, 43, 1, 33]. These changes also introduce new programming models and complex APIs [46, 10, 43, 1]. As a result, previous research focused on reducing the software overhead of logging by introducing expensive nonvolatile on-chip buffers or caches to store persistent data [49, 47, 21]. Since NVRAM technologies are difficult to be integrated with CMOS (traditional processor technology) on the same die, such designs can be costly to fabricate, if not unfeasible. In addition, on-chip nonvolatile buffers/caches can be susceptible to persistent data overflow issues [49], which renders the persistence optimization mechanisms unusable. We motivate our design by discussing the background of logging and transactions in persistent memory, the benefit of undo+redo logging, and inefficiencies of traditional schemes. 2.1 Logging and Transactions Logging is an important class of data persistence schemes used in persistent memory [46, 10, 20]. It maintains a record of changes to critical data in a reserved NVRAM region alternative to the original data structures. The log can store undo (old data values) and redo (new data values) information. In the case where system failures leave the original data in NVRAM in a partially updated, inconsistent state, the data can be recovered to a consistent state in an old version (using undo log) or a new version (using redo log). Ensuring the persistence of continuous data updates is difficult, especially when manipulating complex data structures with fine-grained accesses. One way to ease this difficulty is to access persistent data through atomic, durable transactions. Similar to traditional storage transaction concept, each persistent memory transaction makes a group of updates appear as one atomic unit when system failure happens. Software can define such critical data that requires persistence guarantee by including them in transactions. Transactions have been used as a powerful and convenient software interface in most databases and file systems [15, 41] and as in many prior persistent memory designs [46, 10, 20, 49]. 2.2 Persistent Memory Write-Order Control Logging alone cannot secure persistent updates of transactions. CPU caches and memory controllers typically optimize system performance (e.g., to exploit the data locality in cache and memory interface access) by reordering the writes issued by software. As a result, the order of software-issued log and original data updates in a transaction does not necessarily match the order that they are written into NVRAM. In order to guarantee persistence, most loggingbased persistent memory systems enforce write-order control in two manners. First, complete log updates in NVRAM before updating original data. Otherwise, system failures can leave both log and original data partially updated and unrecoverable. Prior persistent memory systems typically employ memory barriers (e.g., mfence, sfence, and pcommit instructions [16]) between logging and original data updates to enforce the ordering. Second, flush original data updates into NVRAM. Undo-logging-based systems need to perform the cache flush before committing each transaction, in order to ensure that the data can be recovered to the new state of the committed transaction. Because logs typically have limited size, redo-logging-based systems need to flush the original data cache lines before recycling the associated log record to avoid losing of the persistent data. Both memory barrier and cache flush can degrade system performance by canceling out the performance optimization provided by the reordering in caches and memory controllers as shown by our experiments ( 6). 2.3 Relaxing Ordering Constraints Most previous work employs either undo [20] or redo [46, 10] logging in persistent memory but not both, due to the high performance and implementation costs of maintaining logging. Yet undo and redo logging schemes provide different benefits, respectively. Undo logging allows original data cache lines updates to steal the way into NVRAM before all log updates of a transaction complete: once the undo information of a piece of data (e.g., the old value of a 3

Tx_begin(TxID) do some reads do some computation UC_log(addr(A), new_val(a), old_val(a)) micro-ops: load A 1 load A 2 store log_a 1 store log_a 2.

Figure 1: Example code that updates an object A in a transaction. (a) A transaction with software-based uncacheable (UC) logging that supports steal/no-force. (b) The transaction code with our design.

transaction and caches can naturally evict the new updates using traditional cache eviction policies.

4 Tx_begin(TxID) do some reads do some computation UC_log(addr(A), new_val(a), old_val(a)) micro-ops: load A 1 load A 2 store log_a 1 store log_a 2... Memory_barrier Write A clwb // may be delayed Tx_commit (a) Tx_begin(TxID) do some reads do some computation Write A Tx_commit Trigger hardwarebased logging A free ride for memory barrier (b) Figure 1: Example code that updates an object A in a transaction. (a) A transaction with software-based uncacheable (UC) logging that supports steal/no-force. (b) The transaction code with our design. cache line) is written into NVRAM, this piece of data is recoverable; CPU can issue the corresponding piece of original data updates without waiting for the completion of all log updates in the transaction and caches can naturally evict the new updates using traditional cache eviction policies. Redo logging maintains the latest states of transactions, while relaxing the cache force write-back constraints of persistent memory ( no-force ): With redo log stored in NVRAM, dirty (updated) cache lines can remain in caches as long as determined by traditional cache eviction policies; without cache line flushes, the transaction can still commit and be recovered after system failures using the redo log. 2.4 Undo+Redo with Software-based Logging Undo+Redo logging has been used in many traditional database systems stored in disk/flash [32, 15]. But it may appear costly to be adopted in memory systems that provides much higher access bandwidth than disk/flash. Figure 1(a) shows an example code of a transaction that updates a data object (A) protected by software-based logging. The memory barrier between logging and original data update (Write A) ensures that persistent memory state is protected by completed logging before the the original data can be updated. Undo logging consists of a set of load and store instructions to read the old values of the data and then write them out to the log in NVRAM; redo logging consists of the store instructions that write the new values out to the log. As shown in the figure, the logging operations not only increase memory read/write traffic, but also also introduces extra load and store instructions in CPU pipeline. Worse yet, these instructions can block the subsequent instruction execution in the pipeline due to the memory barrier instruction between logging and original data updates. Due to such performance, energy, and implementation overhead, software-based logging is unfeasible for implementing undo+redo logging. 2.5 Software-based Logging and Multithreading Concurrent workload execution can further complicate software-based logging in persistent memory. Even if persistent memory issues clflush or clwb instructions in each transaction, a context switch by the OS can occur before a clwb instruction is executed. Such context switches will interrupt the control flow of transactions/epochs and divert the program to other threads, which reintroduces the risk of prematurely overwriting the log records in NVRAM. Implementing per-thread log circular buffers can mitigate such risk. But doing so can introduce new persistent memory API and complicate recovery. Such risk and software implementation complexities further increase the cost of software-based logging. 3 Our Design Our hardware-driven logging design also enables a progressive cache force-write-back scheme, where we invoke forcewrite-backs individual cache lines only when necessary. Furthermore, by invoking force-write-backs in hardware, we can address the persistence risk in multithreading scenario. This section describes our design principles. Detailed implementation methods and the required software support will be described in 4. 4

5 Assuming update a data object A: the old value of A consists of {A 1, A 2, } Redo information, the new value of A consists of {A 1, A 2,...} Undo information Cor Cor Processor (A n and A n are word sized values). e e Cor L1$ L1$ Log Processor Cor e 5 Buffer (Volatile) e 5 A 1 miss 2 A 1 hit 2 Tx_commit Last-level Cache L1$ Tx_commit L1$ 4 A Write-allocate Memory Controllers Hit in a Bypass A 1 Lower-level$ lower-level Log entries (64B + Torn bit) Caches 1 3 Bypass cache Log Buffer (FIFO) 1 Head Pointer Tail Pointer Caches 3 Log Buffer (FIFO) DRAM NVRAM Log Log record of A A 1 A 1 A 2 A 2 (circular A 1 A 1 A 2 A 2 buffer) Header: TxID, addr(a) 4 Word size TxID, addr(a) 4 Word size H Header Indicator Cache Controllers A 1 A 1 A 2 A 2 A 3 A 3 H B 1 NVRAM (Nonvolatile) Log (circular buffer) NVRAM (Nonvolatile) Log (circular buffer) TxID addr(a) (a) (b) (c) Figure 2: Overview of the proposed hardware-driven logging in persistent memory. (a) Architecture overview (NVRAM components are shaded) and log record format. Hardware-driven logging (b) on a write hit in L1 cache and (c) on a write miss in L1 cache. 3.1 Architecture and System Design Overview Figure 2(a) depicts an overview of our processor and memory architecture. The figure also shows the log structure that is stored in NVRAM. We assume that all storage components in the processor are volatile, caches adopt write-back caching scheme, and main memory consists of DRAM and NVRAM, which are both deployed on the processormemory bus with separate memory controllers [46, 49]. Failure Model. The data in DRAM or caches is lost across system reboots, while data in NVRAM remains. Our design focuses on maintaining persistence of user-defined critical data stored in NVRAM. After failures, the system can recover critical data structures by replaying the log in NVRAM. The DRAM can be used to store data without persistence requirement, such as stacks and data transfer buffers [46, 49]. Persistent Memory Transactions. Similar to most prior persistent memory designs [46, 20], we employ persistent memory transactions ( 2.1) as a software abstraction to enforce persistence of critical data. The writes that are included in persistent memory transactions are considered as those requiring persistence guarantee. Figure 1 illustrates simple code examples of a transaction implemented with traditional logging-based persistent memory transactions (Figure 1(a)) and our design (Figure 1(b), detailed software interface design is discussed in 4), respectively. The transaction defines object A as a piece of critical data that needs persistence guarantee. Compared with traditional logging-based persistent memory transactions, our transactions eliminate logging functions, cache force-write-back instructions, and memory barrier instructions. An Uncacheable Log in NVRAM. We employ a single-consumer/single-producer Lamport circular buffer [22] as the log. It can be allocated and truncated by system software ( 4). Our hardware mechanisms perform log appends. We adopt a circular buffer because it allows simultaneous appends and truncates without locking [22, 46]. As illustrated in Figure 2(a), each log record maintains undo and redo information of a single data object (e.g., A ). Each entry in the log record stores either a) a record header or b) word-sized undo/redo values. A record header in the log indicates the starting point of a log record it consists of a 128-bit transaction ID, a 64-bit address of the data object, and a special character (can be reserved through our software interface) to indicate that the entry is a header. We also adopt a torn bit per log entry to indicate the completion of each log update [46]. Torn bits have the same value for all writes in one pass over the log, and reverses when a log entry is overwritten. Thus, completely-written log entries have the same torn bit value, while incomplete entries have mixed values [46]. The log is shared by all threads of an application to simplify system recovery. The log needs to accommodate all the write requests of undo+redo logging. But lifetime of this NVRAM region is not an issue as we will discuss in 7. Log is typically used during system recovery, hence less likely reused during application execution. In addition, it is imperative to force log updates from caches to NVRAM in store-order by cache flushes and memory barriers. Therefore, we make the log uncacheable. This is inline with most prior works, which directly write log updates into write-combine buffer (WCB) in commodity processors [46, 50] that coalesce multiple stores to the same cache line. An Optional Volatile Log Buffer in Processor. To improve the performance of log updates written to NVRAM, we provide an optional log buffer design (a volatile FIFO, similar to WCB) in the memory controller to buffer and 5

6 coalesce log updates. Note that the log buffer is not required for ensuring data persistence but only for performance optimization. 6 evaluates system performance across various log buffer sizes. Natural Ordering Guarantee on Data and Log Updates. Our design does not require write-order control mechanisms (e.g., cache flushes and memory barriers) to enforce the order between original data and corresponding log updates. Such order is naturally enforced by our architecture design: Without the log buffer, the uncacheable writing of the log can guarantee that log updates will arrive at NVRAM earlier than corresponding original data updates (which need to traverse through the cache hierarchy); no memory barrier is needed. With the log buffer, we need to carefully configure the log buffer size such that we can still enforce the guarantee without memory barriers. 4.3 discusses the bound of the log buffer size in order to guarantee data persistence. 3.2 Enabling Undo+Redo Logging with Caches The goal of our logging mechanism is to enable feasible undo+redo logging of persistent data, which can relax the order constraints on caching to the point where neither one of undo and redo logging can achieve. To this end, our logging mechanism leverages the information naturally available in commodity cache hierarchy, without the performance overhead of unnecessary data movement or executing logging, cache flush, and memory barrier instructions in CPU pipeline. Commodity processors employ write-back write-allocate policy in the cache hierarchy [35]. On a write hit, the cache hierarchy will only update the cache line in the hitting level with old values remaining in lower levels; a dirty bit in the cache tag indicates such inconsistency across different levels of caches. On a write miss, the write-allocate (also called fetch-on-write) policy enforces that the cache hierarchy first loads the missing cache line from a lower-level, followed by a write hit operation. In order to enable efficient steal/no-force support in persistent memory, we propose a hardware-driven logging mechanism by leveraging the write-back write-allocate scheme. In our design, a write in a transaction will trigger hardware-based log appends that record both redo and undo information in NVRAM (as shown in Figure 1(b)). We acquire the redo information from the write operation itself, which is currently in-flight. We acquire the undo information from the cache lines, where the current write will be placed. If the write request hits in L1 cache, we read the old value before overwriting the cache line and use that for the undo logging. If the write request misses in L1 cache, we acquire the undo information (the old value) from the cache line that will be write-allocated by the cache hierarchy. The log buffer if adopted manipulates log entries (transaction ID, data object address, and undo and redo cache line values) in a FIFO manner. The entries will be written out to NVRAM using the head and tail pointer information, which is stored in special registers ( 4). 3.3 Progressive Cache Force-write-back It appears that writes are persistent once their logs are written into NVRAM. In fact, our design allows us to commit a transaction once our hardware-driven logging of that transaction is completed. Yet this does not secure data persistence due to the circular buffering nature of the log in NVRAM ( 2.2). Inserting force-write-back instructions (such as clflush and clwb) in transactions can impose substantial performance overhead ( 2.2) and complex data persistence support in the multithreading scenario ( 2.5). In order to guarantee persistence in multithreaded applications yet eliminate the need for force-write-back instructions, we design a hardware-based, progressive force-write-back mechanism that only force certain original data blocks out of the caches when necessary. To this end, we introduce a force-write-back bit (fwb) into the cache tag of each cache line. With fwb bit and the dirty bit existing in traditional cache tags, we maintain a finite state machine of each cache block ( 4.4). The dirty bit is maintained by traditional caching schemes: a cache line update will set the bit and a cache eviction (write-back) will reset it. The fwb is maintained by cache controllers that periodically scan all cache tags. Per every two scans, the cache controllers will set the fwb bit of the dirty cache lines in one scan, and force-write-back all the cache lines with { f wb,dirty} = {1,1} in the other scan. If the dirty bit value of any cache lines changes between the two scans, the cache line will not be force-written-back later on. Our mechanism maintains force-write-backs in a hardware-controlled manner decoupled from software multithreading mechanisms. As such, our mechanism is robust to the interruption of software context switches. The frequency of this force-write-backs can vary. But force-write-backs need to be faster than the rate at which the log updates wrap around and overwrite the first entry. In fact, we can determine force-write-back frequency (associated 6

7 with the scanning frequency) based on the log size and the NVRAM write bandwidth as we discuss in 4.4. Our evaluation shows that the frequency can be every three million cycles ( 6). 3.4 A Free Ride for Transaction Commits Most previous designs require software or hardware memory barriers (and/or cache flushes) at transaction commits to enforce the order of writing log updates (or persistent data) into NVRAM across consecutive transactions [49, 25]. Instead, our design gives transaction commits a free ride no actual instructions need to be executed in the CPU pipeline. In the log, we identify the commit of a transaction by the presence of the log record header of its immediately subsequent transaction. Our mechanisms also naturally enforce the order of intra- and inter-transaction log updates: we issue log updates in the order of writes to corresponding original data; we also write the log updates into NVRAM in the order as they are issued (the log buffer is a FIFO so does not affect such order). As such, log updates of subsequent transactions can only be written into NVRAM after the current log updates are written. 3.5 Putting It All Together Figure 2(b) and (c) illustrate a high-level summary of how our hardware-driven logging works. Hardware treats all writes encompassed in transactions e.g., write A in the transaction delimited by tx begin and tx commit in Figure 1(b) as persistent writes. Those writes automatically trigger the logging mechanisms described in the above sections. The mechanisms work together as follows. Note that the log updates will be directly send to the WCB or NVRAM, if the system does not adopt the log buffer. Processor core sends the writes of data object A (a variable or other data structures), which consists of the new values of one or multiple cache lines {A 1,A 2,...} to L1 cache. Upon updating an L1 cache line (e.g., from old value A 1 to a new value A 1 ): 1. Update the redo information of the cache line (➊). (a) If the update is the first cache line update of the data object A, the CPU core which has the information of transaction ID and the address of A will write a log record header into the log buffer. (b) Otherwise, the L1 cache controller writes the new value of the cache line (e.g., A 1 ) into the log buffer. 2. Obtain the undo information of the cache line (➋). This step can be parallelized with Step-1. (a) If the cache line write request hits in L1 cache (Figure 2(b)), the L1 cache controller will extract the old value (e.g., A 1 ) directly from the cache line before writing. Rather than using an additional read instruction, the L1 cache controller reads the old value from the hitting line out of the L1 cache read port and write it into the log buffer in the Step-3. Note that the read port is typically not used when servicing write requests, so it would otherwise be idle. (b) If the write request misses in L1 cache (Figure 2(c)), the cache hierarchy will naturally perform write-allocate on that cache line. The cache controller at a lower-level cache that owns that cache line will extract the old value (e.g., A 1 ). The cache controller will send the extracted old value to the log buffer in Step Update the undo information of the cache line: the cache controller writes the old value of the cache line (e.g., A 1 ) to the log buffer (➌). 4. The L1 cache controller can update the cache line in the L1 cache (➍). The cache line can be evicted by natural cache eviction policy without being subjected to data persistence constraints. In addition, our log buffer is small enough to guarantee that the log updates traverse through the log buffer faster than the cache line traversing through the cache hierarchy ( 4.4). Therefore, this step can be performed without waiting for the corresponding log entries to arrive at NVRAM. 5. Memory controller evicts the log buffer entries to NVRAM in a first-in-first-out manner (➍). This step is independent from other steps. 6. Repeat Step-1-(b) through 5, if the data object A consists of multiple cache line writes. The log buffer will also coalesce the log updates of any writes to the same cache line. 7. After log updates of all the writes in the transaction are issued, the transaction can commit (➎). 8. Original data object updates can remain in caches until they are written back to NVRAM by a natural eviction or our progressive cache force-write-backs. 7

8 void persistent_update( int threadid ) { tx_begin( threadid ); // Persistent data updates write A[threadid]; tx_commit(); } //... int main() { // Executes one persistent // transaction per thread for ( int i = 0; i < nthreads; i++ ) thread t( persistent_update, i ); } Figure 3: Pseudocode showing an example use for the tx begin and tx commit functions in which the thread ID is assigned to the transaction ID to perform one persistent transaction per thread. 4 Implementation In this section, we describe implementation details of our design and hardware overhead (the impact of the hardware overhead and NVRAM space consumption and lifetime/endurance will be discussed in 7). 4.1 Software Support Our design employs software support for defining persistent memory transactions, allocating and truncating the circular log in NVRAM, and reserving a special character as the log header indicator. Transaction Interface. We employ a pair of transaction functions, tx begin( txid ) and tx commit(), define transactions which incorporate the writes in the program that need persistence support. We employ the txid to provide the transaction ID information used by our hardware-based logging scheme. This ID is used to group writes from the same transaction together. Such transaction interface has been used by many previous persistent memory designs [49, 10]. Figure 3 shows a multithreaded example in pseudocode with these transaction functions in use. System Library Functions that Maintain the Log. Our hardware mechanisms perform the log updates, whereas the log structure is maintained by system software. In particular, we employ a pair of system library functions, log create() and log truncate() (similar to the ones used in prior work [46]), to allocate and truncate log, respectively. The memory controller obtains such log maintenance information by reading special registers ( 4.2) that indicate the head and tail pointers of the log. Furthermore, a single transaction that exceeds the originally allocated log size can corrupt persistent data. We provide two options to prevent such overflows: 1) The log create() function allocate a large-enough log by taking an hint of the maximum transaction size from the program interface (e.g., #define MAX TX SIZE N); 2) An additional library function log grow() allocates additional log regions when the log is filled by an uncommitted transaction. 4.2 Special Registers The txid argument provided through the tx begin() function will be translated into an 8-bit unsigned integer a physical transaction ID that is stored in a special register in the processor. Because the transaction IDs are used to group writes of the same transactions, we can simply pick a not-in-use physical transaction ID to represent a newly received txid. An 8-bit length can accommodate 256 unique active persistent memory transactions at a time. A physical transaction ID can be reused after the transaction commits. We also employs two 64-bit special registers to store the head and tail pointers of the log. The system library initializes the pointer values when allocating the log by log create(). During log updates, the pointers can be updated by the memory controller and log truncate() function. If log grow() is used, we will employ additional registers to store the head and tail pointers of newly allocated log regions and an indicator of the active log region. 8

9 ~dirty ~dirty IDLE fwb = 0 dirty FLAG STABLE Technical Report No (November 19, 2016) not dirty reset fwb,dirty ={0,0} cache line write write-back force-write-back IDLE FLAG FWB fwb,dirty = {0,1} set fwb=1 fwb,dirty = {1,1} Figure 4: writeback The finite-state machine in the cache controller for implementing the progressive cache write-back. fwb = Determining the Log Buffer Size dirty We employ a log buffer in the memory controller to improve the write performance of log updates in NVRAM. Data persistence requires that a log record arrives at NVRAM before the corresponding cache line (original data) update. To ensure such ordering, we bound the log buffer size such that the time for an undo+redo log record to traverse FWB through the log buffer is shorter than the minimum time that a cache line traverses through the cache hierarchy the dirty = 0 time that an L1 cache line directly gets evicted all the way until out of the last-level cache. As such, the maximum allowable log buffer size is 15 cache-line-size entries based on our processor configuration (Table 2); each log entry also employs two additional bits: one indicates whether the entry stores a record header; the other is a torn bit [46]. Further increasing the number of log buffer entries can improve the performance of log updates until we hit the write bandwidth provided by NVRAM as we demonstrated in 6, but will risk losing data persistence. 4.4 Cache Modifications To implement our progressive cache write-back scheme, we add one fwb to the tag of each cache line, along with the dirty bit existing in conventional cache implementations. We also maintain three states IDLE, FLAG, and FWB for each cache block using these tag bits. Cache Block State Transition. Figure 4 shows a finite-state machine that we implement in the cache controller of each level. When an application starts to execute, cache controllers initialize (reset) each cache line to IDLE state by setting fwb bit to be 0 (conventional cache implementations will also initialize dirty and valid bits to be 0s). During application execution, existing cache controllers implementations will set the dirty and valid bits to 1 whenever a cache line is written and set the dirty bit back to 0 after the cache line is written back to the lower level. To implement our state machine, the cache controllers will also periodically scan the tag of each cache line and act as following. A cache line with { f wb,dirty} = {0,0} is in IDLE state; the cache controller will do nothing to those cache lines; A cache line with { f wb,dirty} = {0,1} is in FLAG state; the cache controller will set the fwb to 1. This indicates that the cache line needs to be force-write-back during the next scanning iteration, in case it still remains in the cache. A cache line with { f wb,dirty} = {1,1} is in FWB state; the cache controller will force-write-back this line. After the force-write-back, the cache controller changes the line back to IDLE state by resetting { f wb,dirty} = {0,0}. If a cache line is evicted out of the cache at any time point, the cache controller will reset that cache line to IDLE. Determining the Cache Force-Write-Back Frequency. The tag scanning frequency determines the frequency of our cache force write-back operations. The force write-back (i.e., the scanning) needs to be performed as frequent as to ensure that the original data is written back to NVRAM before the corresponding log records are overwritten by newer updates. As a result, the more frequent the write requests are serviced, the more frequent the log will be overwritten; the larger the log, the less frequent the log will be overwritten. As such, the scanning frequency can be determined by the maximum log update frequency (bounded by NVRAM write bandwidth because applications cannot write to the NVRAM faster than its bandwidth) and log size (see the sensitivity study in 6). 4.5 Summary of Hardware Overhead Table 1 amalgamates the hardware overhead of our implementation mechanisms in the processor. Note that these values may vary depending on the native processor and ISA. Our implementation assumes a 64-bit machine which is why the circular log head and tail pointers are 8 bytes. Only half of these bytes are required in a 32-bit machine. The size of the log buffer varies based on the size of the cache line. The size of the overhead needed for the fwb state 9

10 Mechanism Logic Type Size Transaction ID register flip-flops 1 Byte Log head pointer register flip-flops 8 Bytes Log tail pointer register flip-flops 8 Bytes Log buffer (optional) SRAM 964 Bytes Fwb tag bit SRAM 768 Bytes Table 1: Summary of major hardware overhead. varies on the total number of cache lines at all levels of cache. Our value in the table was computed based on the specifications of all our system caches described in 5. Also note that these are the major logic components that make up storage on-chip. Additional gates for logic operations are also necessary to implement the design mechanisms described in previous section. However, the simplicity of the design means the gates used will be primarily small and medium-sized gates, on the same complexity level as a multiplexer or decoder. To get the exact impact and count of these gates would require further, lower-level study into this design. We discuss this further in Recovery We outline the steps of recovering the critical in-memory data in systems that adopt our design. Step 1: Following a power failure, the first step is to obtain the head and tail pointers of the log in NVRAM. These pointers are part of the log structure. They allow systems to correctly order the log entries. We employ only one circular buffer as the log of all transactions of all threads. Therefore, we use the reference to look up these pointers. Step 2: System recovery handler fetches log entries from NVRAM and use the address, old value, and new value fields to generate writes to NVRAM to the addresses specified. We identify which writes did not commit by going backwards from the tail pointer. Each log entry that does not match its value in NVRAM is considered a non-committed entry. The address stored with each entry corresponds to the address of the data member in the original persistent data structure. Aside from the head and tail pointers, we use the torn bit to correctly order these writes [46]. Step 3: The generated writes go directly to NVRAM and do not use the cache hierarchy. We use volatile caches so their state is reset and all these generated writes on recovery are persistent, so they can bypass the caches without issue. Step 4: With each persistent write generated from the circular log, update the head and tail pointers accordingly. After all the updates from the log are redone (or undone), the head and tail pointers of the log should point to the same location and the entries need to be invalidated. 5 Experimental Setup We evaluate our design by implementing our mechanisms in McSimA+ [2], a Pin-based [28] cycle-level multi-core simulator. The simulator is configured to model a multi-core out-of-order processor and NVRAM DIMM described in Table 2. We adopt phase-change memory parameters in the NVRAM DIMM [23]. Yet, because all of our performance numbers shown in 6 are relative, the same observations should be valid for various NVRAM latencies and access energy. Our work focuses on improving persistent memory access so we do not evaluate DRAM access in our experiments. We evaluate both microbenchmarks and real workloads in our experiments. The microbenchmarks repeatedly update persistent memory storing various data structures, including hash table, red-black tree, array, B+tree, and graph. These are data structures widely used in various applications, such as databases and file systems [10]. Table 3 describes the details these benchmarks. Our experiments use several versions of each benchmark and vary the data type between integers and strings within them. Data structures with integer elements pack less data (smaller than a cache line) per element, whereas those with strings have multiple cache lines per element and allow us to explore more complex scenarios found in real-world applications. We compile these benchmarks in native x86 and run on the McSimA+ simulator. Our paper simulates both single-threaded and multi-threaded behavior of each benchmark. In addition, we evaluate a persistent version of memcached [29], which is a real-world in-memory workload studied in prior works [10]. 10

11 Processor Similar to Intel Core i7 / 22 nm Cores 4 cores, 2.5GHz, 2 threads/core IL1 Cache 32KB, 8-way set-associative, 64B cache lines, 1.6ns latency DL1 Cache 32KB, 8-way set-associative, 64B cache lines, 1.6ns latency L2 Cache 8MB, 16-way set-associative, 64B cache lines, 4.4ns latency Memory Controller 64-/64-entry read/write queues NVRAM DIMM 8GB, 8 banks, 2KB row, 36ns row-buffer hit, 100/300ns read/write row-buffer conflict [23] Power and Energy Processor: 149W (peak) NVRAM: row buffer read (write): 0.93 (1.02) pj/bit, array read (write): 2.47 (16.82) pj/bit [23] Memory Name Footprint Description Table 2: Processor and memory configurations. Hash [10] 256 MB Searches for a value in an open-chain hash table. Insert if absent, remove if found. RBTree [49] 256 MB Searches for a value in a red-black tree. Insert if absent, remove if found SPS [49] 1 GB Random swaps between entries in a 1 GB vector of values. BTree [5] 256 MB Searches for a value in a B+ tree. Insert if absent, remove if found. SSCA2 [3] 16 MB A transactional implementation of SSCA 2.2, performing several analyses of large, scale-free graph. 6 Results Table 3: A list of evaluated benchmarks. To demonstrate the benefits of our design, we study instruction per cycle (IPC), operational throughput, NVRAM traffic reduction, and dynamic energy. The operational throughput measures the number of executed application-level operations (e.g., node inserts, deletes, swaps, etc.) per second. 6.1 Microbenchmark Results In our microbenchmarks, the number of operations is proportional to the data structure size, listed as memory footprint in Table 3. We observe that processor dynamic energy does not significantly change across various workloads so we only show our memory dynamic energy results. All results are speedup (higher is better) normalized to an unsafe baseline (dashed line), which performs logging but does not employ cache flushing or write-back. However, this unsafe baseline does not attempt to guarantee data persistence. The threshold line in the figures shows the best case of the unsafe baseline (i.e., the best value achieved by employing redo- or undo-logging). Each cluster of bars (from left to right along the x-axis) shows the other three baselines (non-pers that employs NVRAM as a working memory without data persistence guarantee or logging; redoclwb-base and undo-clwb-base that issue clwb instructions to ensure persistence on top of the unsafe baseline), our logging mechanism with steal/no-force support (wal), and employing our progressive cache write-back on top of our logging mechanism (fwb) which shows the full implementation of all our proposed mechanisms. Note that the non- non-pers redo-clwb-base undo-clwb-base wal fwb Figure 5: Operational throughput speedup of microbenchmarks (normalized to the unsafe baseline, dashed line). 11

12 non-pers redo-clwb-base undo-clwb-base wal fwb Figure 6: IPC speedup of microbenchmarks (normalized to the unsafe baseline, dashed line) non-pers redo-clwb-base undo-clwb-base wal fwb hash-int-1c hash-int-2c hash-int-4c hash-str-1c hash-str-2c hash-str-4c rbtree-int-1c rbtree-int-2c rbtree-int-4c rbtree-str-1c rbtree-str-2c rbtree-str-4c 0.2 sps-int-1c sps-int-2c sps-int-4c sps-str-1c sps-str-2c sps-str-4c SSCA2-1c SSCA2-2c SSCA2-4c btree-1c btree-2c btree-4c average-1c average-2c average-4c Figure 7: Dynamic energy reduction across all microbenchmarks (normalized to unsafe baseline, dashed line). pers bar may go beyond the displayed axis. This configuration can yield an ideal yet unachievable performance for persistent memory systems [49]. We execute our microbenchmarks with single, two, and four threads on one, two, and four cores, respectively. The corresponding sets of bars are illustrated by appending -1c, -2c, and -4c to each benchmark name. We make the following major observations out of our microbenchmark results. Overall Performance and Energy Improvement. In general, wal alone already significantly improves IPC, dynamic energy consumption, system throughput, and write memory traffic. With all our proposed mechanisms, fwb extends the improvements of all measured metrics further. Our design improves IPC by up to 1.73 over the baseline, with an average of 1.49 compared with the unsafe baseline. Operation throughput is improved by up to 2.12, with average of 1.51 over the unsafe baseline. Our design significantly reduces NVRAM traffic, by an average factor of 1.79 against the unsafe baseline. Due to the memory traffic reduction, fwb reduces dynamic memory energy 2.2 non-pers redo-clwb-base undo-clwb-base wal fwb Figure 8: Memory write traffic reduction across all microbenchmarks (normalized to unsafe baseline, dashed line). 12

13 Normalized Throughput NVRAM Write Bandwidth Limita4on 1.4 Persistence Constraint (a) Log Buffer Entries Force-write-backs per Cycle 2.0E E E E E+00 (b) 2.8E Log Size (KB) Figure 9: Sensitivity studies of (a) system throughput with various log buffer sizes and (b) cache force write-back frequency with various NVRAM log sizes Memcached redo-clwb-base undo-clwb-base wal fwb ipc-t1 ipc-t2 ipc-t4 energy-t1 energy-t2 energy-t4 thrpt-t1 thrpt-t2 thrpt-t4 memwr-t1 memwr-t2 memwr-t4 Figure 10: Single and multithreaded memcached results on performance, energy, throughput, and memory write traffic. consumption by an average The energy reduction can achieve as high as 2.38 over the baselines with clwb. The net decrease in memory accesses, shown in Figure 8, leads to a decrease in the Average Memory Access Latency (AMAL) which directly improves IPC and system throughput. Performance Sensitivity to the Number of Cores. Our results map the change across various microbenchmarks as we increase the number of processor cores (also threads). Our results show that wal IPC and throughput appears insensitive to the number of cores. In all experimental cases with multi-core or multithreaded, our design always yields significant improvement over all baselines. Performance Sensitivity to Log Buffer Size. As we discussed in 4.3, the log buffer size is bounded by the data persistence requirement: the log updates need to arrive at NVRAM before the corresponding original data updates. This bound is 15 entries based on our processor configuration. Indeed, larger log buffers better improve throughput as we studied using Hash benchmark (Figure 9(a)): an 8-entry log buffer improves system throughput by 10%; our implementation with a 15-entry log buffer improves throughput by 18%. Further increasing log buffer size although will no longer guarantee data persistence can continue to improve system throughput until hitting the NVRAM write bandwidth limitation (64 entries based on our NVRAM configuration). Note that the system throughput results with 128 and 256 entries are generated assuming infinite NVRAM write bandwidth. The Relationship Between Force Write-back Frequency and Log Size. As we discussed in 4.4, the force writeback frequency is determined by the NVRAM write bandwidth and log size. With given NVRAM write bandwidth, we studied the relationship between the required force write-back frequency and the log size. As shown in Figure 9(b), we only need to perform force-write-backs every three million cycles once we have a 4MB log. The Impact of Application Complexity. Note, however, that results for the SSCA2 and BTree benchmarks do not follow the pattern of clear improvement over the baselines. Rather, their metrics fall within 5-10% of the baseline. We attribute this to the fact that the SSCA2 and BTree employ data structures, whose complexities far outweigh the 13

Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems

2018 IEEE International Symposium on High Performance Computer Architecture Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems Matheus Almeida Ogleari, Ethan L. Miller,,