Enabling Transparent Memory-Compression for Commodity Memory Systems

Size: px
Start display at page:

Download "Enabling Transparent Memory-Compression for Commodity Memory Systems"

Transcription

1 Enabling Transparent Memory-Compression for Commodity Memory Systems Vinson Young *, Sanjay Kariyappa *, Moinuddin K. Qureshi Georgia Institute of Technology Abstract Transparent Memory-Compression (TMC) allows the system to obtain the bandwidth benefits of memory compression in an OS-transparent manner. Unfortunately, prior designs for TMC (MemZip) rely on using non-commodity memory modules, which can limit their adoption. We show that TMC can be implemented with commodity memories by storing multiple compressed lines in a single memory location and retrieving all these lines in a single memory access, thereby increasing the effective memory bandwidth. TMC requires metadata to specify the compressibility and location of the line. Unfortunately, even with dedicated metadata caches, maintaining and accessing this metadata incurs significant bandwidth overheads and causes slowdown. Our goal is to enable TMC for commodity memories by eliminating the bandwidth overheads of metadata accesses. This paper proposes PTMC (Practical and Transparent Memory- Compression), a simple design for obtaining bandwidth benefits of memory compression while relying only on commodity (non- ECC) memory modules and avoiding any OS support. Our design uses a novel inline-metadata mechanism, whereby the compressibility of the line can be determined by scanning the line for a special marker word, eliminating the overheads of metadata access. We also develop a low-cost Line Location Predictor (LLP) that can determine the location of the line with 98% accuracy and a dynamic solution that disables compression if the benefits of compression are smaller than the overheads. Our evaluations show that PTMC provides a speedup of up to 73%, is robust (no slowdown for any workload), and can be implemented with a total storage overhead of less than 300 bytes. Index Terms Memory; Compression; Bandwidth; DRAM; I. INTRODUCTION As modern compute systems pack more and more cores on the processor chip, the memory systems must also scale proportionally in terms of bandwidth in order to supply data to all the cores. Unfortunately, memory bandwidth is dictated by the pin count of the processor chip, and this limited memory bandwidth is one of the bottlenecks for system performance. Data compression is a promising solution for increasing the effective bandwidth of the memory system. Prior works on memory compression [1] [2] [3] aim to obtain both the capacity and bandwidth benefits from compression, trying to accommodate as many pages as possible in the main memory, depending on the compressibility of the data. As the effective memory capacity of such designs can change at runtime, these designs need support from the Operating System (OS) or the hypervisor, to handle the dynamically changing memory capacity. Unfortunately, this means that such memory compression solutions are not viable unless both the hardware vendors (e.g. Intel, AMD etc.) and the OS vendors (Microsoft, Linux etc.) can coordinate with each other on the interfaces, or such solutions will be limited to *These authors contributed equally to this work. systems where the same vendor provides both the hardware and the OS. Ideally, we want a memory-compression design that can be implemented entirely in hardware, without requiring any OS/hypervisor support. Transparent Memory-Compression (TMC) can provide bandwidth benefits of memory compression in an OS-transparent manner by trying to exploit only the increased bandwidth and not the extra capacity. 1 An example of TMC is the MemZip [5] design that tries to increase the memory bandwidth using hardware-based compression and avoids any OS support. Unfortunately, MemZip requires significant changes to the memory organization and the memory access protocols. Instead of striping the line across all the chips on a memory DIMM, MemZip places the entire line in one chip, and changes the number of bursts required to stream out the line, depending on the compressibility of the line. MemZip requires significant changes to the data organization of commodity memories and the memory controller to support variable burst lengths. Ideally, we want to enable TMC for commodity memory modules while retaining conventional data organization and bus protocols. We observe that TMC can be implemented with commodity memories by storing multiple compressed lines in a single memory location and retrieving all these lines in a single memory access, thereby increasing the effective bandwidth. For example, if two neighboring lines A and B are compressible then they can both be stored in the location of A, and a single access to A can provide both A and B, as shown in Figure 1(a). If the lines are not compressible (such as lines X and Y) then they can be resident in their respective locations. Compression can change both the size and location of the line. Without additional information, the memory controller would not know how to interpret the data obtained from the memory (compressed or uncompressed). For example, an access to A can either give only A (uncompressed) or both A and B (compressed). Therefore, all compressed-memory designs require additional metadata information that indicates the compression status of each line. Thus, for lines A and B the compression status would be 1 (compressed) and for lines X and Y the compression status would be 0 (uncompressed). 1 In fact, Qualcomm s Centriq [4] system was recently announced with a feature that tries to provide higher bandwidth through memory compression while forgoing the extra capacity available from memory compression. Centriq s design relies on a large linesize of 128 bytes, striping this wide line across two channels, having ECC DIMMs in each channel to track compression status, and obtaining the 128-byte line from one channel if the line is compressible. Ideally, we want to obtain bandwidth benefits independent of linesize, without relying on ECC DIMMs, and without getting limited to 2x compression ratio. Nonetheless, the Centriq announcement shows the commercial appeal of Transparent Memory-Compression.

2 Compressed? Even LineOdd Line 1 AB A B AB AB Metadata 0 X Y X Y X Y X Y Metadata Metadata Table 1 0 [Accesses=4] [Accesses=3] [Accesses=5] (a) TMC on Commodity Memory (b) No Compression (c) Ideal TMC (No Metadata) (d) TMC (With Metadata Accesses) Fig. 1. (a) Transparent Memory-Compression (TMC) design for commodity memories. Sequence of accesses to read 4 lines: A, B, X, Y for (b) uncompressed memory, (c) ideal compressed memory, and (d) TMC with metadata lookups. Note that the metadata overhead is not unique to TMC and also occurs in all prior designs of compressed memory. Conventional designs for memory compression typically store the per-line metadata in a Metadata Table, which is stored in a dedicated region of memory. Unfortunately, accessing the metadata can incur significant bandwidth overhead, even in the presence of dedicated metadata caches [3] and can cause significant slowdown (as much as 49%). We explain the problem of bandwidth overhead of metadata with an example. Figure 1(b-d) shows three memory systems, each servicing four memory requests A, B, X and Y. A and B are compressible and can reside in one line, whereas X and Y are incompressible. For the baseline system (b), servicing these four requests would require four memory accesses. For an idealized compressed memory system (c) (that does not require metadata lookup), lines A and B can be obtained in a single access, whereas X and Y would require one access each, for a total of 3 accesses for all the four lines. However, when we account for metadata lookup (d), it could take up to 5 accesses to read and interpret all the lines, causing degradation relative to an uncompressed scheme. Our goal is to implement TMC on commodity memory modules and without incurring the bandwidth overheads of metadata accesses. To this end, we propose PTMC (Practical and Transparent Memory Compression), an efficient hardwarebased main-memory compression design for improving bandwidth. Our design eliminates metadata lookup by decoupling and separately solving the issue of (i) how to interpret the data, and (ii) where to look for the data. Overall, our design makes the following three contributions. Contribution-1: Inline-Metadata to Reduce Overheads To efficiently identify if the line retrieved from memory is compressed or not, we propose a novel inline-metadata scheme, whereby compressed lines are required to contain a special value, called a marker value at the end. We leverage the insight that compressed data rarely uses the full 64-byte space, so we can store compressed data within 60 bytes and use the remaining four bytes to store the marker. On a read, if the line contains the marker it is considered a compressed line and uncompressed otherwise. The likelihood that an uncompressed line coincidentally matches with a marker is quite small (less than one in a billion with a 4-byte marker), and PTMC handles such rare cases simply by identifying lines that cause marker collisions on a write and storing such lines in an inverted form (more details in Section IV-C). Contribution-2: Low-Cost Predictor for Line Location The inline-metadata scheme eliminates the need to do a separate metadata lookup. However, we still need an efficient way to predict the location of the line, as the location may depend on compressibility. For example, in Figure 1(b), when A and B are compressible, both A and B are resident in the location of A. Therefore, the location of A remains unchanged regardless of compression. However, for B, the location depends on compressibility. We propose a history-based Line Location Predictor (LLP), that can identify the correct location of the line with a high accuracy (98%). The LLP is based on the observation that lines within a page tend to have similar compressibility. The prediction of LLP is verified using the inline-marker obtained from the retrieved line, therefore, in the common case (of correct LLP prediction) memory access is performed with only a single memory access. Contribution-3: Dynamic Compression for Robustness We observe that even after eliminating the bandwidth overheads of the metadata lookup, some workloads still have slowdown with compression due to the inherent bandwidth overheads associated with compressing memory. For example, the maintenance operation of compressing and writing back clean-lines incurs bandwidth overhead, as those lines would not be written to memory in an uncompressed design. For workloads with poor reuse, this bandwidth overhead of writing compressed data does not get amortized by the subsequent accesses. To avoid performance degradation in such scenarios, we develop Dynamic-PTMC, a sampling-based scheme that can dynamically disable compression based on the cost-benefit analysis. Dynamic-PTMC ensures no slowdown for workloads that do not benefit from compression. 2 To the best of our knowledge, PTMC is the only design that provides bandwidth benefits of memory compression without requiring any OS support and while using only commodity (non-ecc) DIMMs and interfaces. It does so while retaining support for 64-byte linesize and is also applicable to single channel systems. Our evaluations show that PTMC provides a speedup of up to 73%, while ensuring no slowdown for any workloads. PTMC can be implemented with only minor changes to the memory controller and incurs a total storage overhead of less than 300 bytes. 2 Such dynamic-compression designs become viable only with inlinemetadata and not with table-based metadata (Section V-A).

3 II. BACKGROUND AND MOTIVATION We first provide the background on hardware-based memory compression, then discuss an address mapping scheme that can extend TMC for commodity memories, followed by discussion on the bandwidth overhead of metadata accesses, and the potential available from an idealized design that does not incur metadata overheads. We finally provide an insight that can avoid the metadata lookups. A. Hardware-Based Memory-Compression Compression exploits the redundancy in data values to store the contents in reduced space. There are several compression algorithms [6] [7] [8] [9] [10] [11] [12] in literature to compress memory contents. A hardware-based memory compression design uses these algorithms to increase not only the capacity of the main memory but also the bandwidth (as the data can be obtained with fewer accesses). Figure 2 describes the architecture of a typical hardware-based memory compression design. Without loss of generality, we assume that the data obtained from compressed memory is decompressed at the memory controller before being sent to the cache hierarchy. As each line in memory can either be compressed or uncompressed, depending on the data values, there is metadata required for each line to identify the compression status (and location) of the line. This per-line metadata is stored in a memory-mapped table and cached on-chip in the metadata cache. Read/Write Requests LLC Compression- Decompression Engine Processor Memory Controller DRAM Core Metadata Cache: Compressibility & Location Info Metadata Metadata Transfer Fig. 2. Overview of Typical Hardware-Based Memory Compression. Metadata for compression status of each line is stored in memory and cached on-chip. B. Address Mapping for Enabling TMC Transparent Memory-Compression (TMC) can provide bandwidth benefits of memory compression in an OS-transparent manner. While prior design [5] for TMC was developed in the context of non-commodity memory modules, we are interested in developing a solution that works with commodity DIMMs. We observe that TMC can be implemented with commodity memories by co-locating multiple compressible lines at a single location and streaming out all these lines in a single memory accesses. Figure 3 shows a simple memory mapping scheme that can enable TMC for commodity memories without changing the standard burst length. If the lines are compressible, we can place up to 4 adjacent compressed lines in one location 3 and stream out all these lines in one access. This allows us to retrieve 4 lines at the bandwidth cost of a single memory access and thus constitutes bandwidth-free prefetches (our design installs all the freely obtained lines in the L3 cache as we found that doing so improves performance). If the lines become incompressible, they get stored in an uncompressed format in the original location, without having to relocate the entire page. uncompressed 2:1 compressed 4:1 compressed A Addr 0 Addr 1 Addr 2 A A A B C D A B B B C D B C C C D D Addr 3 Fig. 3. Address mapping to enable TMC for commodity memories (for supporting up to 4x compression). C. The Challenge of Metadata Accesses An access to TMC on commodity memory would obtain a 64-byte line; however, the memory controller would not know if the line contains compressed data or not. For example, an access to A would provide both A and B, if both lines are compressible, and only A if the lines are uncompressed. Simply obtaining the line from location A is insufficient to provide the information about compressibility of the line. To help interpret the lines read, conventional designs keep track of the Compression Status Information (CSI) for each line in a separate region in memory, which we refer to as the Metadata- Table. Compressed memory needs the CSI to not only interpret the data line read from memory, but also to determine the location of the data line. Even if we provisioned only one bit per line in memory, the size of this Metadata-Table would be quite large (e.g., 32 MB). We can keep the metadata for TMC in a memory-mapped table as well, similar to Figure 2. TMC needs to access this metadata to determine the location (and contents) of the line before servicing the read. The metadata cache can alleviate some of these accesses; however, on a miss in the metadata cache, the bandwidth overhead of accessing the metadata from the Metadata-Table is still incurred. D. Bandwidth Overhead of Metadata Figure 4 shows the breakdown of the bandwidth consumed by designs with table-based metadata design, normalized to the uncompressed memory. In general, compression is effective at reducing the number of requests for data. However, depending 3 When a cacheline gets evicted from LLC, the memory controller checks if (a) the neighboring cachelines are present in the LLC, and if (b) the group of 2 or 4 cachelines can be compressed to the size of a single uncompressed cacheline (64 Bytes). If so, the memory controller compacts them together and issues a write containing the 2 or 4 compressed lines to one physical location. D

4 Normalized Bandwidth Consumption Data Additional Writes Metadata Fig. 4. Bandwidth consumption for data, additional writes to compress data, and metadata for TMC design with table-based metadata, normalized to uncompressed memory. Metadata accesses incur significant bandwidth overhead. on the workload, the metadata cache can have poor hit-rate, and require frequent access to obtain the metadata. These extra metadata accesses can constitute a significant bandwidth overhead. For example, needs over 50% extra bandwidth just to fetch the metadata. Thus, schemes that require a separate metadata lookup can end up degrading performance relative to uncompressed memory. It is important that the implementation of compressed memory does not degrade the performance of such workloads and solving the challenge of bandwidth bloat due to metadata access is key to developing an effective design. E. Potential for Improvement In order to evaluate the impact of metadata, we compare the performance of a TMC design with Metadata-Table access, to an idealized design that does not require metadata access. The table-based design is equipped with a dedicated 32KB of metadata cache and is designed to capture spatial locality (similar to prior work). In contrast, the idealized compression scheme does not maintain any metadata and simply streams out lines in the same location that are compressed together. Figure 5 compares the performance of the two designs. Speedup Ideal TMC (no metadata) TMC (with metadata) Fig. 5. Speedup from ideal TMC (no metadata lookup) and TMC with metadata (w/ metadata cache), relative to uncompressed memory. While an ideal design can provide a speedup of 12.3% on, prior designs fall short of achieving this speedup due to the overheads associated with additional metadata-accesses. In fact, prior schemes degrade performance significantly (up to 49%) for many Graph () workloads. Therefore, unless bandwidth overhead of metadata can be reduced, we can deem TMC design as not viable for commodity memories. The goal of our paper is to enable TMC for commodity memories by mitigating the bandwidth overhead of metadata accesses. We provide an insight for a practical solution. F. Insight: Store Metadata in Unused Space To reduce the metadata access of compressed memory, we leverage the insight that not all the space of the 64 byte line is used by compressed memory. For example, when we are trying to compress two lines (A and B) they must fit within 64 bytes; however, the compressed size could still be smaller than 64 bytes (and not large enough to store additional lines C and D). We can leverage the unused space in the compressed memory line to store metadata information within the line. For example, we could require that the compressed lines store a 4-byte marker (a predefined value) at the end of the line, and the space available to store the compressed lines would now get reduced to 60 bytes. Figure 6 shows the probability of a pair of adjacent lines compressing to 64B and 60B. As the probability of compressing pairs of lines to 64B and 60B are 38% and 36%, respectively, we find that reserving space for this marker does not substantially impact the likelihood of compressing two lines together and thus would not have a significant impact on compression ratio. % of compressible lines Double 64 Double 60 average Fig. 6. Probability of a pair of adjacent lines compressing to 64B and 60B. We can use 4B to store metadata, without significantly affecting compressibility. We can use this insight to store the metadata within the line, and avoid the bandwidth overheads of accessing the metadata separately. If the line obtained from memory contains the marker value, the line is deemed compressed, whereas, if the line does not have the marker value, then it is deemed uncompressed. However, there could be a case where the uncompressed line coincidentally stores the marker value. A practical solution must efficiently handle such collisions, even though such collisions are expected to be extremely rare. We discuss our methodology before discussing our solution.

5 III. METHODOLOGY A. Framework and Configuration We use USIMM [13], an x86 simulator with detailed memory system model. Table I shows the configuration used in our study. We assume a three-level cache hierarchy (L1, L2, L3 being on-chip SRAM caches). All caches use line-size of 64 bytes. The DRAM model is based on DDR4. We model a virtual memory system to perform virtual to physical address translations, and this ensures that the memory accesses of different cores do not map to the same physical page. Note that, other than the virtual memory translation, the OS is not extended to provide any support to enable compressed memory. For compression, we use a hybrid of FPC and BDI algorithms and compress with the one that gives better compression. We assume a decompression latency of 5 cycles in our evaluations. Information about the compression algorithm used and the compression-specific metadata (e.g. base for BDI) are stored within the compressed line, and are counted towards determining the size of the compressed line. B. Workloads TABLE I SYSTEM CONFIGURATION Processors 8 cores; 3.2GHz, 4-wide OoO Last-Level Cache 8MB, 16-way Compression Algorithm FPC + BDI Main Memory Capacity 16GB Bus Frequency 800MHz (DDR 1.6GHz) Configuration 2 channel, 2x rank, 64-bit bus tcas-trcd-trp-tras ns We use a representative slice of 1-billion instructions selected by PinPoints [14], from benchmarks suites that include 2006 [15], 2017 [16], and [17]. We evaluate all 2006 and 2017 workloads, and mark 06 or 17 to denote the version when the workload is common to both. We additionally run suite, which is graph analytics with real data sets (twitter, web sk-2005). We show detailed evaluation of the workloads with at least five misses per thousand instructions (MPKI). The evaluations execute benchmarks in rate mode, where all eight cores execute the same benchmark. Table II shows L3 miss rates and memory footprints of the workloads we have evaluated in detail. In addition, we also include 6 mixed workloads by randomly mixing the and workloads. We perform timing simulation until each benchmark in a workload executes at least 1 billion instructions. We use weighted speedup to measure aggregate performance of the workload normalized to the baseline uncompressed memory and report geometric mean for the average speedup across the 15 High-MPKI workloads (7 2006, ). Additionally, we evaluate and MIX workloads to show that our compressed memory design does not degrade performance (in contrast to prior work) for workloads with limited spatial locality. For other workloads that are not memory bound, we present full results of all 64 benchmarks evaluated ( , , 6, 6 MIX) in Section VI-B. TABLE II WORKLOAD CHARACTERISTICS Suite Workload L3 MPKI Footprint GB GB GB MB GB GB GB MB MB MB GB GB MB MB MB GB GB GB GB GB GB IV. DESIGN OF PRACTICAL TMC To enable Transparent Memory-Compression for commodity memory system while avoiding the bandwidth overhead of metadata lookups, we propose Practical Transparent Memory Compression (PTMC). PTMC tries to obtain bandwidth benefits using memory compression without requiring OS support, without changes to bus protocol, and while maintaining the existing organization for the memory modules. Additionally, our design avoids the use of a metadata-table using novel in-lined markers, effectively side-stepping the overheads of metadata access inherent to prior designs. We start by providing a brief overview of PTMC and then explain the role of the different components of the design. A. Overview of PTMC PTMC is based on two key innovations: (1) inline-metadata and (2) line-location prediction (LLP). Figure 7 shows the overview of PTMC. Instead of keeping metadata explicitly in a table, inline-metadata leverages the unused space in compressed memory lines to store a marker (4-byte value in our design). If a line retrieved from memory has the marker value, it is deemed to be compressed, and uncompressed otherwise. Due to the address mapping used for TMC, the location of the line may depend on the compressibility status of the line (for example, a line B may be present along with line A at the location of A, if both lines are compressible). Therefore, for obtaining line B, TMC must access the original location of B if B is incompressible and A otherwise. To avoid the requirement of looking up both places, we develop a low-cost line location predictor that can correctly predict the compressibility status (and location) of the line with high accuracy. 4 4 The proposed LLP can also be used for compressed memory designs that use table-based metadata. However, for such designs LLP does not reduce the bandwidth overhead of metadata accesses, as metadata is required for verifying LLP predictions. Our inline-metadata makes it possible to verify LLP predictions with a single access to the data line, thus avoiding bandwidth of metadata lookups.

6 2 1 Predicts 2:1 compression Line Location Predictor 3 Read Request: Line B Read Addr. 0 A B DRAM C D Hash Page Addr Last Compressibility Table Predicted Compression Status Compressed Lookup Line Addr Predicted location 4 Line A Line B Marker 60 B 4 B Confirm Prediction with Marker Fig. 7. Overview of PTMC: Line location is predicted with LLP and confirmed using the in-line Marker. Figure 7 illustrates the sequence of events involved in a read operation for PTMC with an example. A read request to Line B is serviced by first consulting with the Line Location Predictor to determine its location. In this case, the LLP predicts that Line B is 2:1 compressed, and hence the read request is sent to the address location of line A. The address mapping scheme ensures that if lines A and B can be compressed together, they must be resident at the location of line A. Once the data is retrieved from memory, the prediction is confirmed by checking the value of the marker (the last 4 Bytes) to ensure that the retrieved line indeed contains the compressed line for B. If the line does not contain the marker, then there is a misprediction of the LLP, and we access line B from the original location. As LLP accuracy is high, the second access is quite uncommon. Note that there is no need for location prediction while accessing line A, as it is always resident in the same location, regardless of compressibility. The two components of PTMC: 1. Line Location Predictor and 2. Inline-Metadata work together to obtain the bandwidth benefits of compression while avoiding the overheads associated with accessing metadata. We now provide a detailed explanation of each of these individual components. B. Line Location Predictor The address mapping scheme (discussed in Section II-B) ensures that we can determine the location of the line if we know the compressibility of the line with its neighbors. Thus, the Line location Predictor (LLP) can locate the compressed line by predicting its compression status. To design a lowcost LLP, we exploit the observation that lines within a page tend to have similar compressibility [3] [18]. 5 Figure 8 shows LLP organization. LLP contains the Last Compressibility Table (LCT), that tracks the last compression status seen for a given index. The LCT is indexed with the hash of the page address. So, for a given access, the index corresponding to the page address is used to predict the compressibility, then line location. The LCT is used only when a prediction is needed (e.g., Line A is always resident in one location and does not need prediction). We use a 512-entry LCT with storage cost of 128B. For prior designs with metadata-table, if there is a hit in the metadata cache, then location of the line is known and 5 We describe a dynamic fallback solution in Section V-A that can disable compression if compressibility fails to be predictable [19]. Fig. 8. Line Location Predictor uses line address and compressibility-prediction (based on last-time compressibility) to predict location. the line can be retrieved in one memory access. However, a miss in the metadata cache results in two memory accesses: one for the metadata and the second for the data. Figure 9 compares the hit-rate of the metadata cache (32KB) with the prediction accuracy of the LLP (128 bytes). Even though the LLP is quite small, it provides an accuracy of 98%, much higher than the hit-rate of the metadata cache. On an LLP misprediction (determined by our Inline-Metadata), we re-issue the request to the other possible locations of the line and the corresponding entry in the LCT is updated with the correct compression status of the line. Location Prediction Accuracy (%) Metadata-Cache Hit LLP Prediction Accuracy Fig. 9. Probability of finding line in one access for prior designs with metadata cache vs PTMC with LLP. C. Inline-Metadata The LLP predicts the compression status and the location of the line in memory. However, we need a way to detect LLP mispredictions so that we don t read incorrect data from memory. It would be nice if the data that was read also conveys information about the compression status of the line. This would let us confirm the compression status (and hence the prediction) without the need for extra metadata accesses. Our proposal is to inline the metadata within the 64B of the line that is read from memory using special 4 Byte value called the marker. Line A Line B Marker x to-1 compressed A B C D Marker 60 B Line X 4 B x x xffffffff 4-to-1 compressed Uncompressed Fig. 10. Inline metadata using markers: Compressed lines always contain a marker in the last four bytes. Figure 10 shows the inline-metadata design using markers, for lines that are compressible with 2-to-1 compression, 4-to- 1 compression, or no compression. If the line contains two compressed lines (e.g. A and B both reside in A), then the line is required to contain the marker corresponding to 2-to-1 compression (e.g., x ) in the last four bytes. Similarly,

7 if the line contains four compressed lines (A, B, C, and D, all reside in A), then the line is required to contain the marker corresponding to 4-to-1 compression (e.g., x ). Marker reduces the available space for storing compressed lines to 60 bytes. If the compressed lines cannot fit within 60 bytes, then it is stored in uncompressed form. 6 An incompressible line is stored in its original form, without any space reserved for the marker. The probability that the uncompressed line coincidentally matches with the 32-bit marker is quite small (less than 1 in a billion). Our solution handles such extremely rare cases of collision with marker values by storing such lines in an inverted form. This ensures that the only lines in memory that contain the marker value (in the last four bytes) are the compressed lines. Determining Compression Status with Markers: When a line is retrieved, the memory controller scans the last four bytes for a match with the markers. If there is a match with either the 2-to-1 marker or the 4-to-1 marker, we know that the line contains compressed data for either two lines, or four lines, respectively. If there is no match, the line is deemed to store uncompressed data. Thus, with a single access, PTMC obtains both the data and the compression status. Handling Collisions with Marker via Inversion: We define a marker collision as the scenario where the data in an uncompressed line (last four bytes) matches with one of the markers. Our design generates per-line markers (more details in Attack-Resilient Marker Codes), so the likelihood of marker collision is quite rare (less than one in a billion). However, we still need a way to handle it without incurring significant storage or complexity. PTMC handles marker collisions simply by inverting the uncompressed line and writing the inverted line to memory, as shown in Figure 11. Doing so ensures that the only lines in memory that contain the marker (in last four bytes) are compressed lines. A dedicated on-chip structure, called the Line Inversion Table (LIT), keeps track of all the lines in memory that are stored in an inverted form. The likelihood that multiple lines resident in memory concurrently encounter marker collisions is negligibly small. For example, if the system continuously writes to memory, then it will take more than 10 million years to obtain a scenario where more than 16 lines are concurrently stored in inverted form. Therefore, for our 16GB memory, we provision a 16-entry LIT in PTMC. When a line is fetched from memory, it is not only checked against the marker, but also against the complement of the marker. If the line matches with the inverted value of the marker, then we know that the line is uncompressed. However, we do not know if the retrieved data is the original data for the line or if the line was stored in memory in an inverted form due to a collision with the marker. In such cases, we consult the LIT. If the line address is present in the LIT, then we know 6 We choose a 4B marker because our baseline 16GB memory contains 2 28 lines, so a 32-bit marker would cause less than 1 colliding line residing in memory on average, while limiting the space used for storing the marker. For systems with hundreds of gigabytes of memory, we recommend a 5B marker. Marker Collision? Yes Do: 1 & 2 No Install as is Marker x x Line to install in mem addr A 1 Invert on collision x BBBBBBBB Inverted line 2 Update LIT A Line inversion Table Fig. 11. Line inversion handles collisions of uncompressed lines with marker. the line was stored in an inverted fashion and we will write the reverted value in the LLC. Otherwise, the data obtained from the memory is written as-is to the LLC. On a write to the memory, if the line address is present in the LIT, and the last four bytes of the line no longer match with any of the markers, then we write the line in its original form and remove this line address from the LIT. Each entry in the LIT contains a valid bit and the line address (30 bits), so our 16-entry LIT incurs a storage overheads of only 64 bytes. We recommend that the size of the LIT be increased in proportion to the memory size. Efficiently Handling LIT Overflows: In the extremely rare cases LIT can overflow, we have two solutions to handle this scenario: (Option-1) Make the LIT memory-mapped (one inversion-bit for every line in memory, stored in memory) and this can support every line in memory having a collision. On marker-collision, the memory system has to make two accesses: one access to the memory, and another to the LIT to resolve collision. Under adversarial settings, the worst-case effect would simply be twice the bandwidth consumption. We implement updates to the LIT by resetting the LIT entry when lines with marker-collisions are brought into the LLC and marking these cachelines as dirty. On eviction, these lines will be forced to go through the marker-collision check and will appropriately set the corresponding LIT entry. (Option-2) On an LIT overflow, PTMC can regenerate new marker values using the random number generator, encode the entire memory with new marker values, and resume execution. As LIT overflows are rare (once per 10 million years), the latency of handling LIT overflows does not affect performance. Attack-Resilient Marker Codes: The markers in Figure 10 were chosen for simplicity of explanation. We generate perline markers based on a hash of the line address and the global marker values. However, markers generated from simple address based hash functions can be a target for a Denial-of- Service Attack. An adversary with knowledge of the hash function can write data values intended to cause frequent LIT overflows resulting in severe performance degradation. We address this vulnerability by using a cryptographically secure hash function (e.g. such as DES [20], given that marker generation can happen off-the-critical path) to generate marker values on a per-line basis. This would make the marker values impractical to guess without knowledge of the secret-keys of the hash function, which are generated randomly for each machine. Furthermore, the secret-keys are regenerated in the event of an LIT overflow which changes the per-line markers.

8 Speedup TMC (Metadata-Table + Cache) 1.74 PTMC (Inline-Metadata + LLP) mix1 mix2 mix3 mix4 mix5 mix6 MIX Fig. 12. Speedup of prior TMC designs with metadata-table vs PTMC with inlined-metadata and line-location-predictor, normalized to uncompressed memory. PTMC eliminates metadata lookup and improves performance. Efficiently Invalidating Stale Copies of Data: Compression can relocate the lines, and, when lines get moved, they can leave behind a potentially stale copy of the line. For example, in Figure 13, if adjacent lines A and B became compressible (into values A and B ), we could move B and store lines A and B together in one physical location. However, an old value of B would still exist in the previous location. An LLP misprediction can result in an erroneous access to this stale data value, which can still be interpreted as valid. Keeping all locations of the line in sync requires significant bandwidth overheads. Therefore, we simply mark such lines as invalid using a special 64-byte marker value, called Invalid Line Marker (Marker-IL). Marker-IL is also initialized at boot time using a randomly generated value. Per-line Marker-IL can be generated as in Section IV-C. Collisions with Marker-IL are extremely rare (1 in probability, less than one in quadrillion years), and are also handled using line inversion, and are tracked by the LIT. Before Compression After Compression Line A Line B A' B' B Stale value A' Compression with Invalidate B' 64B Marker Invalid Marker Fig. 13. Compression relocates lines and can create copies of data. We mark such lines as invalid to ensure correct operation. Handling Updates to Compressed Lines: An update to a compressed line can render the entire group (of two or four lines) from compressible to incompressible. Such updates must be performed carefully so that the data of the other line(s) in the group gets relocated to their original location(s). To accomplish this, we need to know the compressibility of the line when the line was obtained from memory. To track this information, we provision 2-bits in the tag store of the LLC that denotes the compression level when the data was read from memory. On an eviction, we can determine if the lines were previously uncompressed, 2-to-1 compressed, or 4-to-1 compressed by checking these two bits, and we can send writes and invalidates (when applicable) appropriately. Ganged Eviction: Write-back of a cacheline that belongs to a compressed group can require a read-modify-write operation if the other cachelines in the group are not present in the cache. Our design avoids this by using a ganged-eviction scheme which forces the eviction of all members of a compressed group if one of its members gets evicted. This ensures that all the members of a compressed group are either simultaneously present or absent from the LLC, effectively avoiding the need for read-modify-write. Our evaluations show that ganged eviction has negligible impact on the LLC hit rate. 7 D. Speedup of PTMC PTMC, with its inline-metadata and LLP, can accomplish the task of locating and interpreting lines, without the need for a separate metadata lookup. Figure 12 shows the performance of PTMC (with inline-metadata + LLP) compared to a practical implementation of prior TMC designs (with metadata-table and metadata-cache). PTMC eliminates the metadata lookup, which significantly helps both compressible and incompressible workloads. For and MIX workloads, PTMC provides substantial speedup. However, for Graph () workloads, PTMC still causes a slowdown. We investigate bandwidth of PTMC to determine the cause. E. Bandwidth Breakdown of PTMC Figure 14 shows the bandwidth consumption of PTMC (with inline-metadata + LLP), normalized to uncompressed memory. The components of bandwidth consumption of PTMC are data, second access due to LLP mispredictions, and clean writebacks + invalidates (for writing compressed data). High location prediction accuracy means we are able to effectively remove the cost of metadata lookup, except for. For Graph workloads, the inherent cost of compression (i.e., compressing and writing back clean lines, and invalidating) is the dominant source of bandwidth overhead and the cause for performance degradation. We develop an effective scheme to disable compression when compression degrades performance. Normalized Bandwidth Consumption Data Clean Evict + Inv. LLP Mispredict Fig. 14. Bandwidth consumption for PTMC approach, normalized to uncompressed memory. 7 We have evaluated a more complex scheme that retains lines (requires refetching adjacent lines to recompress lines together), and found the difference to be minimal. This is because the remaining lines are often similarly old and likely to be evicted soon after.

9 Speedup TMC Static-PTMC Dynamic-PTMC Ideal TMC (no metadata) 1.74 mix1 mix2 mix3 mix4 mix5 mix6 MIX Fig. 15. Speedup of Static-PTMC, Dynamic-PTMC, and Ideal TMC, normalized to uncompressed memory. Dynamic-PTMC avoids slowdown for workloads that do not benefit from compression, and performs similar to Ideal. V. PTMC: DYNAMIC DESIGN Thus far, we have focused only on avoiding the metadata access overheads of compressed memory. However, even after eliminating all of the bandwidth overheads of metadata, there is still slodown for several workloads. Compression requires additional writebacks to memory which can consume additional bandwidth. For example, when a cacheline is found to be compressible on eviction from LLC, it needs to be written back in its compressed form to memory. What could have been a clean evict in an uncompressed memory is now an additional writeback, which becomes a bandwidth overhead. 8 Additionally, PTMC requires invalidates to be sent, which further adds to the bandwidth cost of implementing compression. In general, if the workload has enough reuse and spatial locality, the bandwidth cost of compression yields bandwidth savings in the long run. But for a workload with poor reuse and spatial locality (such as several Graph workloads), the cost of compression does not get recovered, causing performance degradation. A. Design of Dynamic-PTMC To get around the inherent cost of compression, we need an effective mechanism to disable compression when the costs of compression outweigh the benefits. Fortunately, due to our robust inline-metadata approach, we can remove the cost of compression by simply deciding to stop actively compressing lines. This is in contrast to prior approaches as they would need to globally decompress memory to disable their major cost of compression, metadata-table access. We simply need an effective mechanism to determine when to disable compression to enable robust performance. We make such a decision by comparing at runtime the bandwidth cost of doing compression with the bandwidth benefits from compression, and enabling/disabling compression based on this cost-benefit analysis. We call this design Dynamic-PTMC. Bandwidth Cost of Compression: The bandwidth overhead of compression comes from sending extra writebacks (compressed writebacks from clean locations), sending invalidates, and sending requests to mispredicted locations. These are additional requests incurred due to compression that could have been avoided if we had used an uncompressed design. Bandwidth Benefits of Compression: Compression provides bandwidth benefits by enabling bandwidth-free prefetch- 8 PTMC installs new pages in an uncompressed form to avoid inaccurate prefetches. We have performed evaluations where compressible lines are initialized compressed, but we find many cases where bringing in neighboring but never-used-together lines causes cache pollution and degrades performance. As our primary goal is to design a robust no-hurt memory compression scheme, we opted for a conservative approach of initializing uncompressed. ing. On reading a compressed line, adjacent lines get fetched without any extra bandwidth. This saves bandwidth if the prefetched lines are useful. Tracking useful prefetches can allow us to determine the benefits from compression. Dynamic-PTMC monitors the bandwidth costs and benefits of compression at run-time, to determine if compression should be enabled or disabled. To efficiently implement Dynamic- PTMC, we use set-sampling, whereby a small fraction of sets in the LLC (1% in our study) always implement compression and we track the cost-benefit statistics only for the sampled sets. The decision for the remaining (99%) of the sets is determined by the cost-benefit analysis on the sampled sets, as shown in Figure 16. To track the cost and benefit of compression, we use a simple saturating counter. The counter is decremented on seeing the bandwidth cost and is incremented on seeing the bandwidth benefit of compression. The Most Significant Bit (MSB) of the counter determines if the compression should be enabled or disabled for the remaining sets. We use a 12-bit counter in our design. We extend Dynamic-PTMC to support per-core decision by maintaining per core counters and a 3-bit storage per line in the sampled sets to identify requesting core. 1 Set 0 Useful Prefetch 2 Sampled Set 3 (always compress) Compressed Writeback 3 Misprediction 4 Invalidate Request Increment Utility Counter Decrement Utility Counter Enable Disable Saturating Counter Enforce Policy Set 1 Set 99 Fig. 16. Dynamic-PTMC analyzes cost-benefit of compression on sampled sets to decide compression policy. B. Effectiveness of Dynamic-PTMC Figure 15 shows the speedup of PTMC (that always tries to compress), and Dynamic-PTMC. PTMC without Dynamic optimization provides speedup for ; however, it causes slowdown for. Meanwhile, Dynamic-PTMC eliminates all of the degradations, ensuring robust performance the design is able to obtain performance when compression is beneficial and avoid degradation when compression is harmful, be it from increased writes or low location-prediction accuracy. Dynamic- PTMC provides speedup of up to 74% (with worst-case slowdown within 1%), nearing two-thirds of the performance of an idealized TMC design that does not incur any bandwidth overheads. Thus, Dynamic-PTMC is a robust and practical way to implement hardware-based main memory compression.

10 VI. RESULTS AND ANALYSIS A. Storage Overhead of PTMC Structures PTMC can be implemented with minor changes at the memory controller. Table III shows the storage overheads required for Dynamic-PTMC. The total storage of the additional structures at the memory controller is less than 300 bytes. In addition to these structures, PTMC needs 2-bits in the tag-store of each line in LLC to track prior-compressibility (<0.1% overhead). And, per-core Dynamic-PTMC needs 4-bits per line in sampled sets (1%) for reuse and core id. TABLE III STORAGE OVERHEAD OF PTMC STRUCTURES Structure Storage Cost Marker for 2-to-1 4 Bytes Marker for 4-to-1 4 Bytes Marker for Invalid Line 64 Bytes Line Inversion Table (LIT) 64 Bytes Line Location Predictor (LLP) 128 Bytes Dynamic-PTMC counter 12 Bytes Total 276 bytes B. Extended Evaluation We perform our study on 27 workloads that are memory intensive. Figure 17 shows the speedup with Dynamic-PTMC across an extended set of 64 workloads ( , , 6, and 6 mixes), including ones that are not memory intensive. Dynamic-PTMC is robust in terms of performance, as it avoids degradation for any of the workloads while retaining improvement when compression helps. Speedup Fig. 17. Speedup of Dynamic-PTMC for 64 workloads, sorted by speedup. C. Impact on Energy and Power Figure 18 shows the power, energy consumption and energydelay-product (EDP) of a system using Dynamic-PTMC, normalized to a baseline uncompressed main memory. Energy consumption is reduced as a consequence of fewer number of requests to main memory. Overall, Dynamic-PTMC reduces energy by 5% and improves EDP by 10%. Normalized to Baseline Speedup Power Energy EDP Fig. 18. Dynamic-PTMC impact on energy and power D. Sensitivity to Memory Channels PTMC offers bandwidth-free adjacent-line prefetch, which are latency benefits that exist regardless of the number of memory channels. Table IV shows that PTMC provides consistent speedup, even with larger number of channels. TABLE IV SENSITIVITY TO NUMBER OF MEMORY CHANNELS Num. Channels Avg. Speedup 1 8.1% 2 8.5% 4 7.8% E. Impact of PTMC on Hit-Rate of L3 Table V shows the hit rate of the L3 cache for the baseline (uncompressed memory), and a system using PTMC. Under PTMC, L3 Hit-rate is improved significantly. Thus, the adjacent lines obtained due to compression with PTMC are useful, and installing them in the L3 cache provides speedup. TABLE V EFFECT OF PTMC ON L3 HIT-RATE Baseline Dynamic-PTMC 17.3% 23.9% 15.7% 15.7% MIX 13.3% 15.8% F. Comparison to Larger Fetch for L3 PTMC can install adjacent lines from the memory to the L3 cache. While this may resemble prefetching, we note there is a fundamental difference. PTMC installs additional lines in L3 only when those lines are obtained without bandwidth overhead. Meanwhile, prefetches result in extra memory accesses which incur additional bandwidth. We compare the performance of next-line prefetching and Dynamic-PTMC in Table VI. PTMC avoids the bandwidth overheads of next-line prefetching and provides robust speedup across all workloads. TABLE VI COMPARISON OF PTMC TO NEXT-LINE PREFETCH Next-Line Prefetch Dynamic-PTMC -5.7% +8.5% -21.1% +0.0% MIX -7.3% +4.2% G. Applicability to Multi-Socket or DMA Every access to a particular DRAM channel is managed by its corresponding memory controller, even if the access is on behalf of another socket or a DMA request. As we implement PTMC in the memory controller, every write to or response from DRAM can be intercepted and inverted appropriately. Thus, PTMC works even in multi-socket or DMA scenarios. VII. RELATED WORK To the best of our knowledge, this is the first paper to propose a robust hardware-based main-memory compression for bandwidth improvement, without requiring any OS-support and without causing changes to the memory organization and protocols. We discuss prior research related to our study. A. Low-Latency Compression Algorithms As decompression latency is in the critical path of memory accesses, hardware compression techniques typically use simple per-line compression schemes [6] [7] [8] [9] [10] [11] [12]. We evaluate PTMC using a hybrid compression using FPC [6] and BDI [10]. However, PTMC is orthogonal to compression algorithm and can be implemented with any compression algorithm, including dictionary-based [21] [22] [23] [24] [25].

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction Vinson Young, Chiachen Chou, Aamer Jaleel *, and Moinuddin K. Qureshi Georgia Institute of Technology

More information

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems

ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems Xun Jian University of Illinois at Urbana-Champaign Email: xunjian1@illinois.edu Rakesh Kumar University of

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Motivation for Caching and Optimization of Cache Utilization

Motivation for Caching and Optimization of Cache Utilization Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation Xiangyao Yu 1 Christopher J. Hughes 2 Nadathur Satish 2 Onur Mutlu 3 Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich ABSTRACT

More information

SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization

SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization 2017 IEEE International Symposium on High Performance Computer Architecture SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization Jee Ho Ryoo The University of Texas at Austin Austin, TX

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

PreSET: Improving Performance of Phase Change Memories by Exploiting Asymmetry in Write Times

PreSET: Improving Performance of Phase Change Memories by Exploiting Asymmetry in Write Times PreSET: Improving Performance of Phase Change Memories by Exploiting Asymmetry in Write Times Moinuddin K. Qureshi Michele M. Franceschini Ashish Jagmohan Luis A. Lastras Georgia Institute of Technology

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

arxiv: v1 [cs.ar] 10 Apr 2017

arxiv: v1 [cs.ar] 10 Apr 2017 Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, Srinivas Devadas CSAIL, MIT, Intel Labs, ETH Zurich {yxy, devadas}@mit.edu,

More information

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk 5-23 March 23, 2 Topics Motivations for VM Address translation Accelerating address translation with TLBs Pentium II/III system Motivation #: DRAM a Cache for The full address space is quite large: 32-bit

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

Doppelgänger: A Cache for Approximate Computing

Doppelgänger: A Cache for Approximate Computing Doppelgänger: A Cache for Approximate Computing ABSTRACT Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Andreas Moshovos University of Toronto moshovos@eecg.toronto.edu Modern

More information

Intel Software Guard Extensions (Intel SGX) Memory Encryption Engine (MEE) Shay Gueron

Intel Software Guard Extensions (Intel SGX) Memory Encryption Engine (MEE) Shay Gueron Real World Cryptography Conference 2016 6-8 January 2016, Stanford, CA, USA Intel Software Guard Extensions (Intel SGX) Memory Encryption Engine (MEE) Shay Gueron Intel Corp., Intel Development Center,

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Memory Management. Dr. Yingwu Zhu

Memory Management. Dr. Yingwu Zhu Memory Management Dr. Yingwu Zhu Big picture Main memory is a resource A process/thread is being executing, the instructions & data must be in memory Assumption: Main memory is infinite Allocation of memory

More information

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management CSE 120 July 18, 2006 Day 5 Memory Instructor: Neil Rhodes Translation Lookaside Buffer (TLB) Implemented in Hardware Cache to map virtual page numbers to page frame Associative memory: HW looks up in

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 11 th, 2017 Neeraja J. Yadwadkar http://cs162.eecs.berkeley.edu Recall: Caching Concept Cache: a repository

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Virtual Memory 11282011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Cache Virtual Memory Projects 3 Memory

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk class8.ppt 5-23 The course that gives CMU its Zip! Virtual Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs Motivations for Virtual Use Physical DRAM as a Cache

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE 571 Advanced Microprocessor-Based Design Lecture 13 ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache

More information