Logged Virtual Memory. David R. Cheriton and Kenneth J. Duda. Computer Science Department. Stanford University. Stanford, CA 94305

Size: px

Start display at page:

Download "Logged Virtual Memory. David R. Cheriton and Kenneth J. Duda. Computer Science Department. Stanford University. Stanford, CA 94305"

Dora Henry
6 years ago
Views:

1 Logged Virtual Memory David R. Cheriton and Kenneth J. Duda Computer Science Department Stanford University Stanford, CA 9435 Abstract Logged virtual memory (LVM) provides a log of writes to one or more specied regions of the virtual address space. Logging is useful for applications that require rollback and/or persistence such as parallel simulations and memory-mapped object-oriented databases. It can also be used for output, debugging and distributed consistency maintenance. This paper describes logged virtual memory as an extension of the standard virtual memory system software and hardware, our prototype implementation, and some performance measurements from this prototype. Based on these measurements and the experience with our prototype, we argue that logged virtual memory can be supported with modest extensions to standard virtual memory systems, provides signicant benet to applications and servers, and is faster than other log-generation techniques. 1 Introduction Logged virtual memory (LVM) is a virtual memory system extension that provides logs of write activity to specied virtual memory regions. Each log is a time-ordered sequence of records, one record per memory write. Each record contains the address, data value written to the address, and the data size. A variety of important applications can use logged virtual memory if it is provided eciently by the operating system. For example, parallel discrete-event simulators that use optimistic concurrency control (TimeWarp, described in [11] and [12]) can use logged virtual memory to log modications to the states of objects to support rollback. This state saving is a major overhead in optimistic simulation without LVM because it is memory-intensive, and because the cost is incurred even by the slowest (bottleneck) process in the simulation. A highly ecient logged virtual memory system reduces or eliminates this performance-limiting overhead. The log generated by LVM can also be used to provide data for post-execution analysis, giving a more compact and complete indication of state changes than the sequence of checkpoints generated by conventional techniques. Object-oriented database management systems can also use logged virtual memory to log updates to the objects mapped into a virtual memory region. The resulting redo log in combination with checkpointing can be used to implement transaction atomicity and recoverability eciently. With an ecient logged virtual memory facility, persistent objects supporting atomic transactions can be read and written in virtual memory with the same eciency as standard C++ objects. Besides being ecient, logged virtual memory is less error-prone than an approach of modifying application code to explicitly generate log updates. As a nal example, a debugger can use logged virtual memory to log the writes of a program being debugged. The debugger can then determine when data was erroneously overwritten as well as generally monitor the state updates in a program under development. The log can also be used to support reverse execution [7], a debugging technique in which a program is allowed to run until it fails, and then backed up or reverse-executed until the problem is located 1. Logging can also be used to obtain a detailed address trace of a program, which can be useful for detecting and isolating performance problems or as input to memory system simulators. These benets and others prompted us to explore extensions to the conventional virtual memory system software and hardware to support logging. We were intrigued by the idea of making logging support a standard part of the virtual memory system. This paper describes the design of a logged virtual memory system and a prototype implementation we have built. The prototype consists of extensions to the virtual memory system software of the V++ Cache Kernel [3], an experimental operating system kernel, hardware support for logging in the ParaDiGM experimental multiprocessor [4] and a user-level library. Our experience and measurements suggest that a logged virtual memory system can be implemented eciently with a modest extension to virtual memory hardware and software, making it an attractive addition to the basic facilities of standard computer systems. The next section describes the overall design of our logged 1 The logging does not directly handle the problem of undoing system calls unless the calls are performed through a logged virtual memory region. These actions must otherwise be logged by a separate mechanism.

2 virtual memory system, with example applications to illustrate the advantages of having such a facility. Section 3 describes our prototype implementation. Section 4 describes the performance of the implementation. Related work is discussed in Section 5. We close with overall conclusions and an indication of open issues and directions for future research. 2 Logged Virtual Memory Logged virtual memory provides logged regions, log segments to which the log records are written, and a deferredcopy mechanism to support ecient checkpointing. This section describes these facilities using the interfaces of our prototype implementation. 2.1 Logged Regions and Log Segments Figure 1 illustrates the mapping of a virtual memory region to a segment and a log segment. The segments A and B are 2.2 Application Program Interface The C++ application program interface for LVM is given in Table 1. As an example of its usage, the following code sequence creates the mapping structure shown in Figure 1. Segment * seg_a = new StdSegment(size); Region * reg_r = new StdRegion(seg_a); // Create log segment and specify to region. LogSegment * ls = new LogSegment(); reg_r->log(ls); as = thisprocess()->addressspace(); reg_r->bind(as); This code samples illustrates the simplicity of adding logging, namely the two lines to create a new LogSegment and associate it with the region. The creation of the log segment and its association with an existing segment can also be performed by a separate program, such as a debugger. address space region R (logged) data memory write log record 2.3 Deferred Copy: Support for Rollback LVM also provides deferred copy, a facility much like copyon-write, in support of checkpointing. This facility is illustrated in Figure 2. In this gure, segment B has segment segment A (data) segment B (log) address space deferred copy region physical memory (page frames) Figure 1: Overview of Logged Virtual Memory memory segments, virtual memory system objects that can be mapped to a region (a contiguous range of virtual memory addresses). Segment A is mapped into the application's address space, bound to region R. Region R is called a logged region because it has a segment (segment B) specied as its log segment. Every time the program writes to this region, the virtual memory hardware automatically appends a record of the write operation onto the log (in segment B). This log record contains the virtual address written, the datum written there, the datum size, and a timestamp. These log records are arranged sequentially in the log segment so that an earlier write is stored in a lower oset than a later write. The log segment may also be mapped into the address space, so that the same (or a dierent) application can read the log records. Specifying the log at the region level allows one segment, such as that containing an object-oriented database, to be accessed by multiple processes simultaneously, with their write operations logged to separate segments, one per process. deferred copy source (segment A) deferred copy deferred copy destination (segment B) Figure 2: Deferred-Copy Mapping A specied as the deferred-copy source. Segment B appears initialized by segment A; that is, initial reads from a region bound to B retrieve data from A. Writes are only reected in memory segment B, leaving A unchanged. Subsequent reads from the modied location retrieve data from B. The resetdeferredcopy() operation resets the deferred copy region to the state it was in when the original deferred copy took eect. The semantics of this operation are the same as copying A to B. However, it signicantly outperforms bcopy() in the expected case, as described in Section 4.4. If segment A is a checkpoint of segment B, then a resetdeferredcopy() eectively rolls the state of B back to this checkpoint. This facility together with the logging capability can be used by optimistic parallel simulation, as described in the next section.

3 1 Standard Virtual Memory Functions new StdSegment(unsigned size, unsigned flags =, SegmentMan * SegmentMan = defaultsegmentman) Create a memory segment. The given segment manager implements user-level page-fault handling. Note that StdSegment is a \standard" implementation of the abstract Segment base class. new StdRegion(Segment * segment) Create a region representing a mapping to the given segment. This region can later be bound to an address space. StdRegion is an implementation of the abstract Region base class. VirtAddr Region::bind(AddressSpace * as, VirtAddr virtaddr = ) Bind a region into an address space at the given virtual address. 2 Extensions for Logging new LogSegment() Create an log segment to holds log records. LogSegment is also derived from Segment. void Region::log( LogSegment * ls) Declare that ls is the log segment for this region. Log records for all writes to region this appear in ls. 3 Extensions for Deferred Copy void Segment::sourceSegment(Segment * source, u int offset = ) Declare that segment source is the deferred-copy source for segment this, starting at the specied oset. This function sets up the deferred-copy mechanism described in Section 2.3. void AddressSpace::resetDeferredCopy( VirtAddr start, VirtAddr end ) Undo all modications to the deferred-copy destination, i.e., for each memory address in the given range that is mapped in deferred-copy mode, make sure that the next read from that address returns the datum from the deferred-copy source. Table 1: C++ Virtual Memory System Interface 2.4 Example Application: Optimistic Parallel Simulation Optimistic parallel simulation is a demanding application domain that can benet signicantly from LVM. In these simulations, the objects being executed on one process or scheduler can run ahead in virtual time of objects on other processes. The current time of a given scheduler is called the scheduler's local virtual time (LVT). The minimum of the LVT's of all the processes is referred to as global virtual time (GVT). If a scheduler receives an event timestamped for a virtual time earlier that its LVT, it rolls its state back to the time of that event or earlier, processes the event and then recontinues the processing forward in virtual time. There is no need to roll back earlier than GVT provided each process can roll back to GVT (and not some earlier time) because a process cannot receive a event for an earlier time than GVT. Figure 3 illustrates the per-scheduler segments and mappings to support ecient rollback of the simulation state using logged virtual memory. The working segment, accessed through the working region, contains the scheduler's current simulation state, i.e. the state of the objects associated with this scheduler. The working region is logged so records of updates appear in the log segment as the program updates the simulation state. The checkpoint segment contains the state of the scheduler's objects at an earlier checkpoint time (no later than GVT). The checkpoint segment is the deferredcopy source for the working segment, so that when the program reads the working region, if the location read has not been modied since the last checkpoint, the data is loaded from the checkpoint segment. address space checkpoint segment checkpoint region deferred copy working region logging working segment log region log segment Figure 3: Using Logged Virtual Memory in a Simulation To roll back, a scheduler rst resets the contents of the working segment to that of the checkpoint segment by calling resetdeferredcopy(). The scheduler then \rolls" the working segment forward by applying each update found in the log to the working segment until it reaches the time of the newly-received event 2. It then resumes normal execution, processing the event that caused the rollback. To advance the checkpoint segment to the state of the scheduler's objects as of time T, the scheduler applies all logged updates older than T to the checkpoint segment. It may optionally truncate the log segment at this time. This 2 The scheduler writes a certain memory location each time local virtual time changes. Log records of these writes serve as markers so that the rollback algorithm can tell which log records correspond to what virtual time.

4 checkpoint update and log truncation (CULT) processing is normally undertaken when a scheduler determines that global virtual time has advanced to time T. However, if the scheduler thinks it might be the bottleneck process (if LVT is not far ahead of GVT), then it may defer CULT until it catches up with the other processors or actually runs out of memory for the log. The CULT processing can also be performed by a separate parallel process to avoid slowing down the simulation itself. LVM largely eliminates of state saving overhead as part of the application process execution, especially compared to the conventional rollback implementation which makes a copy of the aected object state before processing each event. State saving slows down all processors, including the slowest (bottleneck) processor. For typical simulations, CULT is considerably less expensive than state saving, and can be performed asynchronously, or deferred until the process is not the bottleneck in advancing GVT. Although rollback can be more expensive than going directly to a checkpoint copy of the state corresponding to the rollback time (as used in a conventional approach), the rollback cost is proportional to how far ahead of GVT the process is (because of the cost of rollforward). Slowing down a process that is far ahead does not slow overall simulation progress so this cost is not signicant. (A process proceeding ahead in virtual time can be thought of as performing speculative execution as an alternative to going idle waiting for the bottleneck process, as would occur in conservative simulation.) In theory, one could turn o state saving for the slowest process because it never needs to roll back rather than relying on LVM to reduce the logging overhead for this case. However, in practice, which process is the slowest in virtual time may change quickly over time. It also adds further expense to check whether a process needs to state save or not, and to determine which process is the slowest at a given time. The log generated by LVM can also be used for debugging the simulation or for better understanding its results, such as by performing postprocessing, visualizing, etc. 2.5 Example Application: RLVM LVM can be used to implement recoverable logged virtual memory (RLVM), the recoverable virtual memory (RVM) system of Camelot and Coda [16]. Coda RVM requires that the application programmer insert a call to set range() before modifying recoverable memory to inform the library of the pending modication. On transaction commit (or abort), the library saves or restores only the address ranges specied with set range(). The authors of the Coda RVM report [16] acknowledge the problem of correctly specifying all the address ranges to be modied and suggest using language-level support to automatically add calls to set range() for any write that could possibly be logged. However, this approach has not to our knowledge been tried. In RLVM, no set range() calls are needed. Instead, all recoverable segments are logged so all modications of a logged segment in the context of a transaction are automatically recorded. By writing the transaction identier to a special logged location (whenever it changes), RLVM can determine the transaction to which a log record belongs. Using a separate log per region means that each process can have a separate log so transactions are not randomly intermixed in the log. Using these techniques, RLVM provides atomic transactions and recoverability with less processing overhead and error-prone programming than with Coda RVM. We measure the performance dierence between RVM and RLVM in Section Other Uses LVM can be used for high-performance output. For example, a program supporting visualization can set the segment containing its state to be logged. A separate process can then interpret this log and display the visual representation of the program. This approach eectively ooads the application process of this activity and allows a separate process and processor to generate the visual image. This type of clean functionality partitioning often provides more eective use of parallelism than simply increasing the degree of (symmetric) application parallelism. In this case, for instance, the output process executes asynchronously with respect to the application process and only synchronizes on the end of the log. Note that, in this use of logging, the log can also be used for rollback and recovery because the same state should be logged for output as for these other purposes. Output is also supported by two additional modes to the logging: direct-mapped and indexed. In direct-mapped mode, the logged updates to a segment are written to the corresponding oset in the log segment. This mode allows an output device to be written using mapped I/O without having to support storage and read-back to handle the case of a cache line being loaded corresponding to this area of memory. Here, cache reload is handled by normal memory and updates are written to a log segment corresponding to the device address range. In indexed mode, the log generates a sequence of data values into the log segment without addresses or other information. This mode can be used to generate streamed output to a device. Logging can also be used for consistency. For example, producer-consumer objects in Munin [2] could use LVM to identify updates in the producer, which are then transmitted to the consumers to update their copies. LVM reduces the overhead of determining the updates to transmit and allows just the updated data to be transmitted, rather than whole pages. Moreover, it facilitates streaming the updates to the consumers so that the time for processing on lock release (when these updates are ushed) is reduced to the time required to synchronize with consumers. That is, there should be little or no backlog of data updates to transmit at this time. In Munin, determining the updates is implemented by write-protecting pages, taking a page fault on write to such a page, creating a twin of the page and performing a word-byword comparison to generate a list of dierences when sending an update on a write-shared object. Munin also defers sending the updates until lock release time. (The amount of data transmitted can be more with LVM if locations are updated repeatedly between acquiring and releasing locks, but we believe this behavior is relatively uncommon in practice.) We use the term log-based consistency to refer to a consistency protocol that uses logging to identify and send data

5 updates, using the ownership transfer only to synchronize between processes or processors 3. This class of consistency protocols appears attractive when the unit of consistency is large relative to the size of typical update, such as can arise with virtual memory pages as the unit of consistency. Log-based consistency can also be applied to microprocessors designed to support multiprocessing as the cache line size becomes larger. In this case, on-chip support for logging would also support the interprocessor consistency mechanism. Note that the portions of the state that are most likely to be shared between processors are also the most likely to be logged. Thus, the bus overhead for logging provides interprocessor consistency with no additional overhead; the consistency snoop simply monitors the logging bus trac. 2.7 Advantages and Issues The LVM approach to logging has a number of signicant advantages. First, LVM avoids the error-prone and tedious approach of manually specifying in the source code the write operations to be logged. The number of lines of code that may require such a annotation can be a signicant fraction of the total source for a program, namely thousands of annotations in a non-trivial program. Moreover, if one inadvertently omits an annotation, the eects of missing the log entry may not be visible except under particular circumstances, such as failure to perform a rollback correctly in a parallel simulation, making it hard to detect and track down the problem. In contrast, LVM only requires that the application programmer specify the logging for each region and place each object in the right region. The number of regions is typically quite small, less than 1. Specifying placement is simpler than annotating each write because there are fewer object creations than write operations. Moreover, misplacement of objects in regions can be detected by audit code in most cases. Second, the specication of logging by memory region allows logging to be dynamically specied, orthogonal to type or other static criteria. The logging of a region can be dynamically enabled and disabled so, for example, a program may log a region only if a given command line option is specied. Similarly, a separate program such as a debugger can dynamically modify the memory regions used by a program to cause them to log updates when required with no change to the program binary. This avoids the danger of modifying or obscuring a bug as a result of this code modication. It also reduces the performance impact of logging, thus making this debugging technique applicable to timing-critical software. Furthermore, a given data type can be instantiated in both logged and unlogged memory regions, providing logging only for ones in the logged region. For example, a class in C++ can be dened with an overloaded new operator that allows instances of the class to be created in either region, thereby determining whether it is logged or not. Attaching the logging to a memory region also ts with application structuring required with mapped les and mapped I/O. In contrast, implementing logging by modifying the program code, either manually or automatically, requires the inserted logging code 3 Log-based consistency is similar to the log-based coherency described by Feeley et al. [6]. to check ags at run-time to modify the logging behavior with the attendant run-time overhead or else causes all instantiations of a particular type or procedure to be logged or not logged. Finally, LVM incurs basically no overhead on the writing process when implemented with the appropriate hardware and software support. Even the performance of our prototype implementation comes reasonably close to this ideal for expected computation characteristics, as described in Section 4, even though it was hampered by the lack of logging support in the microprocessor chip. In contrast, instrumenting the application code or trapping every write operation using the page-protect mechanism imposes a signicant overhead on the application process. LVM's advantages become more signicant in sophisticated simulations where ne-grained events on complex object push up the cost of copy-based recovery. This cost may force less frequent state saving using the conventional approach, which means that a simulation may roll back to an earlier time than GVT because the state at GVT has not been saved. It also avoids the roll-forward event processing and message logging overhead incurred in TimeWarp [12] to accomplish the same precise rollback behavior. The worst case for LVM arises with an application using objects that have many elds being rapidly updated, but only a few of these elds need to be logged, and yet these elds need to be logged in all objects of this type. In this case, annotating the class denition to indicate the specic elds to log appears to be more ecient. However, if it is a relatively xed requirement to log these elds, the software can be restructured to move these elds into a separate memory region that is logged, leaving the rest in the original (unlogged) memory region. Then, LVM only logs the elds that do need to be logged. This fracturing of an object is not always convenient to do, such as in the case of logging updates to the last entries in an array. Fortunately, in applications of LVM for recoverability and consistency, all updates to an object need to be logged. (We have not encountered anything close to this worst case in practice.) LVM performance can also suer if application code places rapidly changing temporary variables in logged objects or repeatedly writes the same location when only the last write is of interest to log. However, avoiding these situations is consistent with good programming style and improves application performance independent of logging. Moreover, the logs provide the information required to identify and eliminate these redundant writes. 3 Prototype Implementation The prototype implementation consists of a hardware logger that snoops the ParaDiGM multiprocessor's bus, logging write activity, plus software extensions to the V++ Cache Kernel virtual memory to associate a log segment with a segment of memory. Figure 4 illustrates the logical structure of this implementation. We rst describe the hardware in detail.

6 Virtual address space Logger logged segment Logging address space Page Mapping Table addr addr addr Log Table log addr log addr log addr VM software log segment control Write FIFO data flow Log Record FIFO CPU System bus Logger Snoop/ DMA Second level Cache Figure 4: Prototype Implementation Block Diagram 3.1 The Logger The logger is a hardware device that snoops the system bus for write operations to logged segments and translates each such write operation into a log record, storing it in the associated log segment. The internal structure of the logger is shown in Figure 5. When the application writes to a logged region, the write operation appears on the system bus (because the processor on-chip cache is in write-through mode). The logger's snoop mechanism reads the operation and checks that it is tagged to be logged. In our prototype, a bus signal controlled by the page mapping associated with the address indicates whether the write operation is to be logged or not. If the write operation is to be logged, the logger stores the address and data in its write FIFO. When a stored write address and data reaches the head of the FIFO, the logger looks up the physical address (\data address") in the page mapping table, which maps physical page addresses to log table entry indices. The log table contains one entry per log indicating the address of the end of that log. The logger then retrieves the log address from the log table corresponding to the index from the page mapping table to determine where the log record should be written. It places the log address and a 16-byte log record in the log record FIFO. The log record contains the original data address, value written, size of the write, and a high-resolution timestamp (6.25 Mhz). It then increments the log address in the log table entry by 16. If this address crosses a page boundary, it is marked as invalid. Finally, when the log record reaches the head of the log record FIFO, this record is DMA'ed into memory. If the data address is not found in the page mapping table, System bus Snoop Figure 5: Logger Block Diagram DMA or if the log table entry is invalid (typically because the log address has just crossed a page boundary), the hardware generates a logging fault, implemented as a hardware interrupt, and suspends operation until the kernel xes the problem. In the prototype implementation, the page size is four kilobytes. The logger's page mapping table is implemented as a direct-mapped TLB-like structure. A physical page address is looked up in this table by splitting it into a tag (upper ve bits) and index (lower 15 bits). The logger is built primarily using FPGAs and SRAM Example The following example illustrates the operation of the prototype hardware mechanism. Consider log tables set up as in Figure 6. The second and third entries of the page map- Page Mapping Table Index 1 2 Log Table Tag Log Index Index Log Address Figure 6: Logger Hardware Tables 7d2 ping table indicate that writes to physical pages 1xxx and 2xxx are logged in log 1. Entry 1 in the log table indicates that the next record for log 1 is to be written to physical address 7d2. Suppose the CPU writes 4321 to physical address The logger maps this address to page mapping table entry 1. The tag matches (), so it looks at log

7 table entry 1, which indicates that the log record is to be written to address 7d2. The logger updates the log address in the entry to 7d3 and writes the following data to address 7d2: < timestamp > Virtual versus Physical Addresses The prototype logger receives physical addresses over the bus rather than virtual addresses. This has two major implications in the prototype. First, the prototype only supports a single logged region per segment because the physical address species a segment in memory, not a region. The logger could be extended to use the processor number of the processor generating the logged write to provide per-processor logs. A context switch could then unload logs from the logger tables as necessary to implement per-region logs. However, there was inadequate space in our prototype hardware to provide this capability. Second, physical addresses are stored into the log rather than virtual addresses. To provide logs with virtual addresses, the logger could store a reverse-translation in its page mapping table, relying on there being a single logged region per segment or per-processor page mapping table entries as above. There was inadequate space for this reversetranslation in the FPGA-based logger implementation, but an ASIC implementation would accommodate this feature. With a virtually-addressed cache, the processor could generate virtual addresses on the bus, eliminating the need for this reverse translation in the logger and directly support perregion logging. Alternatively, logging support directly inside the CPU's virtual memory unit, as described in Section 4.6, would allow per-region logging and virtual addresses to be stored in the log, while still using a physically address cache Logger Overruns In the prototype implementation, the processors can generate log records faster than the logger can DMA them to memory, causing the data to ll up the logger's FIFOs and eventually overow. The FIFOs hold 819 entries. When the amount of data goes over a threshold (512 entries), the logger is \overloaded", and generates an interrupt. The kernel responds to the interrupt by suspending all processes that might be generating log data until the FIFOs drain. This action is a signicant performance penalty for a logging application but it only occurs when the logged write rate is high, as discussed in Section Logging support in the processor chip eliminates this problem, as described in Section Virtual Memory System Extensions To support LVM in the operating system kernel, the virtual memory data structures are augmented to allow a log segment to be associated with a virtual memory region. The virtual memory fault handling code is extended to handle faults on logged pages. On a page fault for a page that belongs to a logged region, the page fault handler rst executes the normal page fault handling code, allocating a page frame and bringing the data into main memory. It then puts the onchip data cache in write-through mode for the logged page so that all logged writes are immediately visible to the logger. Then, if there is no entry for the page's log in the logger's log table, the page fault handler loads an entry. Finally, it loads an entry in the logger's page mapping table that maps the page's physical address to the log's index. The faulting process is then allowed to resume the application. On a logging fault, the kernel determines whether the fault is caused by a missing page mapping table entry or by an invalid log address. In the former case, the fault handler selects a table location, unloads the current contents and then initializes the entry to correspond to the address on which the fault occurred. It also loads an entry for the log into the log table if that is missing, initializing the log address to be the end of the log segment data. In the latter case, it determines the address of the next page frame for the log segment. In our implementation, the user explicitly extends the log segment, normally in advance of a fault at the end of the log segment. The kernel then can eciently resume the log writing after the logger crosses a page boundary. If the user has not provided a page, the kernel uses a default log page to absorb the log records to clear the logger FIFOs. Log records may be lost in this case. However, the kernel needs to be prepared to discard data in any implementation when the amount exceeds system resources or user limits because the logger can generate arbitrary amounts of data. 3.3 Deferred Copy Implementation The prototype implements the deferred-copy mechanism using extensions in the second-level cache to associate a source address and a destination address with each cache line, as developed earlier in VMP [5]. A deferred-copy mapping at the software level associates a source page address with each page of the destination segment corresponding to the appropriate page frame in the source segment. When a cache line in the destination segment is referenced, it is loaded into the second-level cache from the source page. When a cache line is written back, it is written to that destination address and its source address is set to that destination address so that subsequent loads are taken from the destination page, When a resetdeferredcopy operation is performed, the cache lines of the segment that have been modied but not written back are invalidated in the cache and the source addresses for all portions of the segment are reset to point to the source segment. As an optimization, our implementation checks the per-page dirty bit to detect the pages that have been modied rather than inspecting the tags of every cache line just to nd that they are all clean. With this implementation, the logical \copy" incurs no processor overhead because it is performed as part of cache load and writeback. Moreover, resetting the deferred copy requires no copying; rather, the processor just resets the source addresses and invalidates the modied cache lines. If only a small localized portion of the segment has been modied, this reset is much less expensive than copying, as shown by measurements presented in Section 4.

8 4 Performance We measured the performance of the prototype so as to quantitatively evaluate the benets of LVM relative to other techniques used to create logs. We evaluate its benet at the application level using RLVM and a \simulated" simulation as example applications. Finally, the raw overhead of logged writes is measured, comparing the prototype implementation to the \ideal" implementation. Measurements are also provided for the deferred copy mechanism. We rst describe the parameters of the prototype implementation used for these measurements. 4.1 Prototype Parameters The ParaDiGM multiprocessor contains four 25-megahertz 684's sharing the system bus with the logger and a fourmegabyte second-level cache. The 684's have an eightkilobyte split I/D cache with a 16-byte line size. Table 2 lists the cost of some basic machine operations, indicating both the total cost and the portion that uses the bus. The measured Operation Total time Bus time Word write-through 6 cycles 5 cycles Cache block write 9 cycles 8 cycles Log-record DMA 18 cycles 8 cycles Table 2: Basic Machine Performance times are provided in cycles to more easily interpret these results for other (newer) hardware. A cycle is 4 nanoseconds in the prototype. 4.2 RLVM RLVM is a version of the Coda RVM described in Section 2.5. It was measured to determine the performance benets of using LVM for this application. The results are shown in Table 3. The rst line gives the processing cost in cycles Benchmark RVM RLVM Single write 3515 cycles 16 cycles TPC-A throughput 418 trans/sec 552 trans/sec Table 3: Performance of RVM with and without LVM of a single recoverable write in the two implementations. A recoverable write is a single write operation to a recoverable segment including the cost of modifying the segment, adding a record of the write to the load and ensuring an \old value" exists to undo the transaction if necessary. LVM reduces the cycle cost of a recoverable write by a factor of approximately 2. The second line of Table 3 indicates the improvement in transaction processing performance as a result of using LVM for the TPC-A benchmark, using a RAM disk to hold the log 4. 4 RLVM does not actually use the log generated by LVM to do rollback or recovery; the TPS given is estimated by adding RLVM's transaction time to RVM's commit and log truncation times. The performance improvement for TPC-A on RLVM is less than one might expect because only about 25% of the CPU time in RVM is actually spent inside the transaction. The rest is spent performing the commit and truncating the log. RLVM does not reduce these costs. However, it does reduce the time TPC-A spends inside the transaction to less than 1% of the benchmark's total runtime. Moreover, optimizing the commit and log truncating processing would further improve the benets of LVM. Longer transactions would also show greater benet from LVM, assuming correspondingly more write operations as well. TPC-A is a sequence of simple debit-credit operations. Transactions in objectoriented database systems tend to be longer and involve far more processing. 4.3 Optimistic Simulation A \simulated" simulation was developed that could use either of copy-based state saving and LVM to support rollback. The elapsed time performance of this application was measured, varying: 1. c compute cycles per event 2. s size in bytes of object 3. w writes per event The results are shown in Figure 7 for 4 dierent sizes of objects. The graph shows that LVM provides a speedup over Speed-up w=1,s=32 w=2,s=64 w=4,s=128 w=8,s= Compute cycles Figure 7: LVM versus Copy-based Checkpointing copy-based checkpointing ranging from 3 percent for large values of c to 25 percent for smaller values of c. The larger values of s provide the greatest improvement in performance and are the most important values to consider because sophisticated simulations use fairly large objects to hold the state associated with a detailed model. (The performance for larger values of w drops o for LVM when c is below 2 cycles or so because the logger overows. Overow is an artifact of restrictions in the implementation of our prototype. In a production-quality implementation, this overow would not occur and the benet of LVM would increase with decreasing c. See Section for measurements of the cost incurred

9 by log FIFO overload.) LVM should thus provide greater benet in the future as simulations move to more detailed models and therefore, to ner grain event processing and larger objects. Varying the number of write operations per event does not signicantly aect the performance because the copybased approach is independent of the number of writes and LVM only incurs a slight increase in overhead for the additional write-throughs to the system bus. (Even that overhead would largely disappear if the processor had more write buffers on-chip.) Fig. 8 indicates this behavior for a range of writes per event and object sizes. As expected, the speedup Speed-up s=32,c=256 s=64,c=512 s=128,c=124 s=256,c= Fraction written Figure 8: Eect of Number of Writes on LVM Performance decreases slowly as the fraction of the object being written is increased, up to the occurrence of log FIFO overow (which is not covered by this graph). For example, with an s of 64 bytes and a c of 512 cycles, there is relatively little change in the speedup between writing 1/8, 1/4 or 1/2 of the object. It is only as the fraction approaches one that the dierence becomes signicant, and that overhead is largely due to write-through overhead. Note that in practice there are enough computation cycles required for event scheduling and dispatch that a processor would rarely overload the log FIFO. Thus, these measurements should be indicative of real application performance, even with this prototype implementation. The measurements do not incorporate the overhead for rollbacks, advancing global virtual time, and performing log truncation. The process that is the furthest behind in an optimistic simulation does not perform rollbacks so these overheads are not expected to aect the progress of a simulation. Moreover, the eect of rollback and recovery performance is suciently complex that full simulations using the two forms of state saving are required to provide an accurate indication of overall performance benet of LVM. However, we do not expect these other factors to detract from LVM's benets in practice. 4.4 Deferred Copy Performance resetdeferredcopy() is a key operation when logging is used for checkpointing and rollback. Its performance was Time (kilocycles) Time (kilocycles) Time (kilocycles) Reset deferred copy bcopy Dirty data (kilobytes) Reset deferred copy bcopy Dirty data (kilobytes) Reset deferred copy bcopy Dirty data (kilobytes) Figure 9: Execution time of resetdeferredcopy()

10 measured by performing resetdeferredcopy() on a pair of segments, varying the fraction of dirty pages. Figure 9 shows its performance relative to bcopy(), using 32-kilobyte, 512- kilobyte, and 2-megabyte segments. (These sizes were chosen to represent small, medium and large-sized segments.) These measurements show that resetdeferredcopy() performs better than a raw copy if less than about two-thirds of the segment is dirty. A sophisticated simulation generally has a large amount of state, most of which does not change even after several event processing steps. Thus, only a small portion of the segment is expected to be modied on most rollbacks. Moreover, if the segment has a high proportion of modied data relative to the checkpoint segment (the state at approximately GVT), the associated process is probably far ahead of GVT in virtual time. In this case, the cost of the rollback is less important because it does not tend to slow down the overall progress of the simulation. Therefore, resetdeferredcopy() appears to provide an improvement over bcopy() in the important cases of TimeWarp simulation. 4.5 Write-through and Overow Overheads Ideally, LVM would be as fast as ordinary (unlogged) memory. This performance could be achieved with on-chip logging support. Unfortunately, the performance of the prototype suered from the need to do write-through to make the writes visible outside the processor chip and from the need to handle logger overload. Logger overload occurs when the application issues logged writes faster than the logger can output them, causing the logger's FIFOs to hit its threshold and forcing the system to pause the processors until the FI- FOs drain. This section describes measurements that quantify these eects Test Methodology All tests presented in this section consist of the following steps: 1. Ensure the relevant memory regions are in the secondlevel cache 2. Start the timer 3. Run several thousand iterations of the following code sequence: (a) Perform c compute cycles (b) Perform w normal write operations (c) Perform l logged write operations 4. Stop the timer The addresses of the writes and logged writes increase as the test proceeds, so accesses always hit in the second-level cache but not generally in the rst-level cache. Cycles per write: cluster of 2 writes Cycles per write: cluster of 4 writes Cycles per write: cluster of 8 writes with logging without logging Compute cycles per iteration with logging without logging Compute cycles per iteration with logging without logging Compute cycles per iteration Figure 1: CPU Cost of Logged Writes

11 4.5.2 CPU Slowdown due to Logging Figure 1 shows the results of a series of tests with w =, varying c and using l 2 f2; 4; 8g. For comparison, the tests were rerun with l =, using w 2 f2; 4; 8g to measure the direct cost of logging a write relative to not logging it. For small values of c, the logger is overloaded, resulting in poor performance. For larger values of c (corresponding to the at portion of the curve), the dierence between logged and unlogged is the cost of the write-through mode of the cache. The cost of the write-through increases with the size of write burst that occurs as part of the processing. A larger write buer in the processor would largely eliminate the difference between logged and unlogged for sizes of bursts that the write buer could handle Overload Penalty The cost of overloading the logger was measured in a series of tests with c = [ : : : 63], w =, and l = 1. Figure 11 shows total run time per iteration (in CPU cycles) as a function of c, indicating how much overload hurt overall throughput when the compute cycles per iteration was less than 3. Figure 12 shows how often the logger's threshold was Average cyles per iterations with logging without logging Compute cycles per iteration Figure 11: Total Cost of Logged Write exceeded for the same series of tests (overloads per 1 cycles). These graphs show that overloading the logger is so expensive (more than 3, cycles) that the time per iteration decreases as computation per loop increases. However, this overload is avoided as long as there is no more than one logged write per 27 compute cycles on average. (The logger FIFOs can absorb many bursts of writes without overloading, given their 512-entry capacity.) 4.6 Next-Generation LVM Hardware A processor designed to support logging could tag cache blocks to be logged either in the cache tags or in the TLB entries. The TLB design is illustrated in Figure 13. TLB entries are extended to contain a log table index and the log table is stored inside the CPU. Overload events per 1 iterations Compute cycles per iteration Figure 12: Overload Events TLB Log Descriptor Table virtaddr physaddr log index physaddr d2 Figure 13: Logging inside the CPU's VM Unit flags With this design, the processor can generate log records containing the virtual address rather than the physical address of the write operation. There is the option of placing other information in the log records (such as the memory data before the write and the program counter value). Perregion logging is also directly supported while still allowing the processor to use a physically address cache. Finally, the processor is automatically stalled if there is an excessive level of write activity to a logged region, the same as if it is writing rapidly to a write-through region, eliminating the need for large log FIFOs and a software overload-handling mechanism. Large processor write buers would reduce this stalling, just as they would reduce the penalty with write-through. With this on-chip logging support, the cost of logged writes should be essentially the same as unlogged writes (except for the bus overhead of the log records), allowing logging to be used in a variety of applications without performance concerns, just as virtual memory is often used without concern for its performance overhead. Moreover, no extra logic would be required in the CPU motherboard because it would be contained on the processor chip. This benet is analogous to that of providing virtual memory mapping in the processor rather than on the motherboard, as was done in early microprocessors such as the Motorola 68. The more modest approach of simply providing large write buers on-chip would reduce the write-through overhead for logging as well as other uses of write-through and may be suf- cient for most applications from a performance standpoint. However, it imposes the development overhead of providing a logging module on the motherboard and makes provision of per-region log segments more dicult. If logging is to be provided as a standard facility in the future systems (as we

12 believe it should be) the processor chip support is preferred, especially since the chip real estate for this facility will be available on future processor chips. Overall, our measurements suggest that LVM signicantly reduces the cost of logging compared to the alternatives, and that hardware support on the processor chip, as can be expected in the future, would provide even better performance benets. 5 Related Work Related work can be divided into: i) operating system work that uses and/or provides logging facilities, ii) hardware support for logging and state saving, and iii) application software-level techniques. 5.1 Logging in Operating Systems Most related operating system work focuses on using logs in le systems, such as the Sprite log-structured le system [15], Hagmann's log-based le system [1] and the standard syslog facility in Unix. The Finlayson/Cheriton log server [8] provides a logging facility for storing log data on disk, particularly WORMs, and is not concerned with the data collection aspect of logging, the primary concern here. The virtual-memory-based checkpoint facility of Li and Appelt [13] provides checkpointing rather than true logging. In that work, the operating system uses page write-protect to force a trap on the rst write to a page after a checkpoint to save a copy of the page as part of this earlier checkpoint. Resetting to a previous checkpoint requires resetting the mappings to the pages of the checkpoint corresponding to these modied pages. Creating a new checkpoint entails writeprotecting all the virtual pages in the region to be checkpointed. Their mechanism is strictly oriented to applications using checkpointing, and does not provide logging 5. This scheme is similar to our deferred copy mechanism for checkpointing with the resetdeferredcopy but requires explicit creation of checkpoints rather than simply rolling forward the checkpoint segment by applying the log updates, as we do. It would be relatively straightforward to extend our implementation to provide their form of checkpointing and allow the applications to choose. However, it is impractical to extend their techniques to provide logging at the level of individual write operations as we provide because of the cost of taking a write protection fault on every write to a page in the logged region. A write fault including completing the write operation and logging the data would take over 3 cycles on current processors, even if implemented at a low level in the operating system. This cost motivates providing hardware support to allow an operating system solution to logging to be practical. 5.2 Other Hardware Mechanisms There are no hardware systems to our knowledge that provide logging support similar to what we have developed. 5 It could however be extended to provide logging at a page granularity. Write buers provide an asynchronous writeback facility but do not provide two destinations for the write, namely the backing store for the virtual memory region and the log segment. As another mechanism, the history registers on some processors retain the immediately previous values of some registers but are extremely limited relative to our logger hardware. Fujimoto [9] proposed a virtual time machine architecture for optimistic parallel discrete event simulation in which a multi-level memory stores the states of the simulation corresponding to the most recent timesteps. Resetting the state to an earlier timestep simply requires resetting the memory to make that corresponding level the current timestep (and removing memory levels corresponding to the later timesteps). A chip implementation was proposed and a VME memory board was apparently actually constructed [1]. This approach optimizes for rollback at the cost of increasing the normal read access time. In particular, a processor on-chip cache would have to support this multi-level structure unless it used a single-word cache line granularity (and the processor was a word-only load/store architecture), and either approach would degrade read access and increase design complexity in modern processor architectures. In contrast, our mechanism optimizes for state saving with no penalty for read access. Even the added mechanism for a logged write is minimal given the provision of write buers (for asynchronous write back) and write-through caching in current processors. Our approach is slower for rollback, but rollback only occurs for processes that are not the performance bottleneck, and so have little eect on the application performance. A \recursive" cache similar to Fujimoto's approach was implemented in the emulator layer of Newcastle's recovery block system [14]. Here, the rollback mechanism was used to retry a dierent algorithm within the recovery block on failure. The same issues with performance apply here as with Fujimoto's design. 5.3 Application-level Techniques The most competitive alternative to LVM as part of the virtual memory system is to insert logging instructions directly into the application code. RVM [16] is an example of a system in which the user has to insert calls explicitly, e.g., set range() in RVM. This approach is imposes a development burden on the application programmer, is error-prone and does not provide the word-level logging that our facility provides, as is useful for debugging and monitoring. LVM is also signicantly faster than RVM implemented with application-level logging calls. In theory, an RVM application can be structured to call set range() once and then write the same recoverable location multiple times, avoiding the overhead with RLVM, namely creating a new log record for each write. Moreover, the performance of RVM can be improved by calling set range() only once over a large region, amortizing its cost over several writes. However, there is a conict between these two techniques and encapsulation. Proper object encapsulation requires that only the object is aware of its lay-

Logged Virtual Memory

Logged Virtual Memory David R. Cheriton and Kenneth J. Duda Computer Science Department Stanford University Stanford, CA 9435 fcheriton,kjdg@cs.stanford.edu Abstract Logged virtual memory (LVM) provides