Cache Memory Configurations and Their Respective Energy Consumption

Size: px

Start display at page:

Download "Cache Memory Configurations and Their Respective Energy Consumption"

Mark Abraham Malone
6 years ago
Views:

1 Cache Memory Configurations and Their Respective Energy Consumption Dylan Petrae Department of Electrical and Computer Engineering University of Central Florida Orlando, FL Abstract When it comes to accessing data in a computer system, the memory hierarchy becomes very critical. Accessing data that has a larger capacity or that is found in the Last Level Cache (LLC) of the system quickly becomes time and energy-consuming. The purpose of computer systems is to provide accurate, quick, and efficient data manipulation, and accessing these LLCs are becoming troublesome for fast bit crunching. This paper will analyze the benefits of using Spin-Transfer Torque Random Access Memory (STT-RAM), instead of Static Ram (SRAM) in different levels of cache and how different applications and cache configurations can provide different cache latency and energy consumption. Keywords CPU, DRAM, SRAM, STT-RAM, Cache, Registers, Memory Hierarchy, Blocks, Lines, Hit Rate, Miss Rate, Associativity, Direct-Mapped Cache, Fully Associative, Set Associative, Volatile Memory, Non-Volatile Memory I. INTRODUCTION On the motherboard, there is the central processing unit (CPU) and dynamic random access memory (DRAM). Instructions and data are stored in the DRAM and needs to be referenced and accessed by the CPU, since there is limited space in the CPU. This limited space available in the CPU introduces the need for a memory hierarchy to improve efficiency of the machine. The proximity of an aspect of memory compared to the CPU correlates to a faster access time, since the electrons have a shorter distance to travel. Another contributing factor to the speed of memory access is the size of the memory device in question. Smaller memory, such as the limited number of registers in the CPU itself, can be accessed much faster than the much larger memory space of DRAM, since there are a smaller amount of registers and less bits need to be crunched in order to find the data that is being sought. There is an element of memory that falls between the registers of the CPU and the DRAM within the memory hierarchy. This element is called cache. The cache also appears in multiple levels. Since programs can only access small amount of address space at a time, it is very beneficial to place the more frequently used aspects of the program in the memory devices that are closest to the CPU and that are contained in a smaller memory space. The cache levels (ordered L1, L2, L3, etc.) that are closer to the CPU and smaller in memory space can be accessed at a higher rate than the upper levels of cache. Following the 9/1 rule, which states that 9% of your work that your program does stems from 1% of the code, you would put frequently used aspects of the program, such as loops, inside the L1 level of cache. The goal of a program is to access the further elements of memory as least often as possible. There are two metrics that measure the effectiveness of the memory access. Hit rate is the portion of the memory access found in cache. The higher this metric is, the faster the program is going to run, since the data you are trying to access is in the first place you are looking for it. The second metric is miss rate, which is simply the portion of memory access that is NOT found in the lower levels of the memory hierarchy, which can be derived from [1 Hit Rate]. Associativity is a technique to reduce these conflict misses and improve hit rate. There are three design strategies of cache associativity that can be used in different applications. The first design for cache block placements is direct-mapped cache, which actually contains no associativity. In directmapped design, only one cache line is mapped to by a memory block. The advantage of this block placement is that it utilizes a bitwise mod of the

2 address to find the sought line, while the disadvantage is that the conflict misses, which is a scenario where more than one memory location ends up being mapped to the same location in cache, will lower the hit rate. The second cache associativity is fully associative, which is unrestrictive associativity. The advantage of this design is that it contains a fully flexible mapping, which extends up to the capacity of cache. The disadvantage of this design is that it has the largest tag field, which results in a higher number of comparisons and yields a longer access time for tag search (slower). The final design is set associative, which is bounded associativity. This design is a hybrid of the previous two block placement designs. It is also known as k-way set associative, where k lines can store each block. The advantage of this hybrid functionality is that it balances the flexibility of fully associative with the complexity of tag-matching. The contents of the cache for the different cache strategies are accessed in different ways. For directmapped, where the width of the address bus is a, where each word contains 2 w Bytes, the cache size is 2 n blocks and n bits are necessary to explain the line index. The block size is 2 m words and m bits are necessary to indicate the word index within the block. To find the tag field size, calculate the difference between the width of the address bus and the sum of n, m, and w. Figure 2. Set-associate cache fields There is a discrepancy between device technologies that are labeled as volatile and nonvolatile. Volatile is RAM that requires voltage supply to maintain values. Two devices that are volatile are static RAM (SRAM) and dynamic RAM (DRAM). SRAM storage is reliant on transistors. DRAM, on the other hand, is reliant on capacitors. STT-RAM is non-volatile memory. Non-volatile memory s primary use is primarily for application in secondary storage devices found further away from the CPU. Since it does not need a voltage supply to maintain values, non-volatile memory is perfect for long-term storage that can be preserved when the power is turned off. In this paper, cache latency and energy consumption will be the two metrics covered. Energy consumption is measured in nano joules (nj) or joules (j) and is the energy required to access memory and crunch bits. Cache latency is measured in nano seconds (nsec) or seconds. Eight sources will be identified and analyzed, examining these two specific metrics and how improvements have been made from the year 2 through today with the potential switch to STT-RAM from SRAM-based designs for cache. Figure 1. Direct-mapped cache fields The set-associative strategy is laid out in a similar way as direct-mapped, but the line (n) field is replaced with a set (s) field. Whatever value the s holds, is how many ways the set-associate method can be formed. Basically every block of memory has s number of opportunities to be located in the cache. Every block maps to one set, but there are s lines in each of those sets. The cache capacity is 2 s sets. II. LITERATURE REVIEW In the last 1 years, we have seen Spin-Transfer Torque Random Access Memory (STT-RAM) emerge as a new memory design. It has a multitude of trade-offs versus one of its predecessors, SRAM. SRAM has equal R/W speeds, but STT-RAM, since it is non-volatile, has slow write speeds and high dynamic energy. However, STT-RAM has no standby power needed, high density, and low leakage power, opposed to SRAM. As we are reaching a technological limit where speeds cannot be much faster, since old RAM technologies used the elevation of electron charge to determine data. The new spintronic devices use the angular momentum of an electron. Magnetic Tunnel

3 Junctions (MTJ) are what comprise an STT-RAM device. The magnetization direction determines the state of the layer in memory and a switch in magnetized direction can switch from a 1 to a. If the reference layer and storage layer are moving in the same direction, the value of the data is, since it s in low resistance. When they are moving in opposite directions, it is in high resistance and has a value of 1. With the STT RAM, the device only needs to send a magnetic field, where the current is orthogonal to the magnetic field, over the MTJs that need to be altered. STT-RAM can result in a slower read and write latency and higher energy usage (nj), but the higher density and low power leakage [4] In the years of , propositions were made to create multi-level cells (MLC) for the STT- RAM design. This would increase density of the cell, but create a writing disturbance. Since the soft bits of the cell have to be reset right after the hard bits change, these resets can become cumbersome and create an energy consumption issue. A possible solution to this is overwriting these soft bits using a read, reuse, distance (RRD) replacement policy, which is where they will use the instructions in cache for predicting the reuse of blocks. [3] Another technology that will be analyzed is the edram design for memory and how different node technologies probability of failure correlate to their retention time. Evidence shows that a lower nanometer technology node has more likelihood for retention failure than a larger node. [2] In the year 213, a study that sought to find whether edram could be a viable alternative to SRAM was carried out. While edram has low energy leakage and a higher density, the frequent refreshes will become the primary energy consumer. Their results show that with proper control of the refreshing functions, edram is potentially a very viable memory device technology because it can be much more energy efficient. [9] In 212, studies were conducted to determine whether STT-RAM can outperform SRAM, despite its slow write speeds and its higher retention duration. Figure 3 is a graph that compares SRAM to STT-RAM of varying retention times and also an iteration of STT-RAM with a refresh period. The total energy is reduced because of the simultaneously reduction of power leakage. [6] The design of STT-RAM can also be improved with adding a refresh period to make up for the sped-up retention turnover. Even though the dynamic energy increases because of more frequent writing within the caches, it s negligible compared to total energy usage. A similar study in 211 investigated exchanging the non-volatility aspect of STT-RAM for a more energy and performance efficient device, utilizing the aforementioned refresh scheme and a reduced retention time. Figure 4 shows how much better the power leakage is for STT-RAM, compared to SRAM. The study s system also had varying retention levels between cache levels, optimizing for differing patterns. [8] In 26, a study suggested that an increase in data cache size correlates to more efficient data sharing. This correlation means that it is beneficial to have a last level cache (LLC) that is shared, opposed to a partitioned LLC with numerous private caches. [7] The latency can be significantly reduced between LLC and memory devices beyond it, utilizing the increased efficiency of data-sharing. Looking a little further into the past, in the year 1994, the debate of Concurrent-Read-Exclusive- Write access (CREW) versus Concurrent-Read- Concurrent-Write access (CRCW), comparing their percent read against the time (in nanoseconds) to complete them. It is found that the fastest access time comes from a cache-to-cache scheme for transferring data, but similar results can be provided for directory schemes that are much less complex. [1]

4 Leakage Power (mw) Noramlized Energy Usage III. DATA ANALYSIS Total Energy Usage of SRAM vs. STT-RAM of varying retention times S-1MB M-4MB M-4MB(1s) M-4MB(1s) M-4MB(1ms) Figure 3. Total Energy Usage SRAM vs. STT-RAM This initial graph depicts that STT-RAM can become more energy efficient than the standard SRAM when the retention time is reduced to avoid the potentially large energy overhead that STT-RAM has with its slow write speeds. Power Leakage of SRAM vs. STT-RAM SRAM Low Retention STT-RAM Med-Retention STT-RAM Hi-Retention STT-RAM Figure 4. SRAM vs STT-RAM power leakage This graph shows how significantly lower the power leakage is for STT-RAM opposed to SRAM, which can benefit power consumption.

5 Latency (ns) Area (mm 2 ) Read and Write Latency Discrepencies Between SRAM, STT-RAM, and edram SRAM STT-RAM edram Read Latency Write Latency Figure 5. R/W latencies for SRAM, STT-RAM, and edram Read and write speeds are pretty symmetrical for SRAM and edram, however the writing latency for STT-RAM is where it differs greatly. The write speed for STT-RAM is drastically slower Area of SRAM, STT-RAM, and edram Technologies SRAM STT-RAM edram Figure 6. Areas of SRAM, STT-RAM, and edram The areas shown in this graph for the different memory device technologies lends to the idea that STT-RAM is a much denser option for cache, which can benefit fabrication costs, as well as data access TABLE I. COMPARING MEMORY SPACE AND LATENCY BETWEEN CACHE LEVELS Processor Level 1 (L1) for Instruction (I) or Data (D) Level 2 (L2) Level 3 (L3) or Last Level Cache (LLC) # of cores Freq. Capacity Set Assoc. Device Tech. # of CL Protocol Capacity Set Assoc. Device Tech. # of CL Protocol Capacity Set Assoc. Device Tech. # of CL Protocol Khoshavi [2] 8 3GHz 32KB 8-way SRAM 512 MESI 512KB 8-way SRAM 8192 MESI 96MB 16-way edram ~1M WB Sun [8] 4 2GHz 32KB 4-way SRAM 512 N/A 256KB 8-way SRAM 496 N/A 4MB 16-way STT- RAM N/A Crawford[1] N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Chen[3] 4 3.3GHz 32KB 8-Way SRAM STT Khosavi[4] 8 3GHz 32KB 8-way SRAM 512 Jog[6] 4 2GHz 32KB (per core) 4-way SRAM 512 Jaleel[7] 8 N/A 32KB 4-way DRAM 512 Chang[9] 8 2GHz 32KB 8-way SRAM STT edram Sun[1] 8 2GHz 16KB 2-way SRAM 256 Lin[5] 2 8 MHz 32KB 4-way DRAM N/A 4MB 8-way STT- RAM N/A 8-way SRAM N/A 65,536 N/A N/A N/A N/A N/A N/A 96MB 16-way edram ~1.5M N/A 1MB 16-way SRAM 16,384 N/A N/A N/A N/A N/A N/A Through 256KB 8-way DRAM MESI 256KB 8-way SRAM 496 MESI 32MB 16-way Through 8MB 32-way STT- RAM 512KB 16-way DRAM ,72 The formula for the number of cache lines = cache capacity (Bytes) / cache line size (Bytes) Assuming the cache line size is 64 Bytes in all of the cases Protocol column = {Write (WB), Write Through (WT), MESI, MOESI, Not Available (N/A)} 64MB 16-way DRAM ~1M SRAM STT 524,288 edram N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

6 IV. CONCLUSION I conclude from the reading that the STT-RAM device technology has the capability of being an efficient alternative to SRAM and edram. When it is presented in an iteration that contains a clock-controlled refresh scheme and lower retention time, it proves to be extremely energy efficient without sacrificing crucial read latency. The devices tend to be much denser than other memory devices, which benefits fabrication costs. With upcoming roadblocks from fundamental physical limitations in technology, the industry needs a denser and energy efficient memory device. Baseline designs: REFERENCES [1] S. E. Crawford and R. F. DeMara, "Cache coherence in a multiport memory environment," in Proceedings of the Second International Conference on Massively Parallel Computing Systems (MPCS-95), pp , Ischia, Italy, May 2-6, [2] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, Bit-Upset Vulnerability Factor for edram Last Level Cache Immunity Analysis, Proceedings of 17th International Symposium on Quality Electronic Design (ISQED 216), Santa Clara, CA, USA, March 15-16, 216. [3] X. Chen, N. Khoshavi, J. Zhou, D. Huang, R. F. DeMara, J. Wang, W. Wen and Y. Chen, AOS: Adaptive Overwrite Scheme for Energy-Efficient MLC STT-RAM Cache, 53rd Design Automation Conference, Austing, TX, USA, 216. [4] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, "Read-Tuned STT-RAM and edram Cache Hierarchies for Throughput and Energy Enhancement, arxiv preprint, 216. [5] M. Lin, et al. "ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism" Microprocessors and Microsystems 39.7 (215): Comparison designs: [6] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das, Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs, in Proceedings of 49th Annual Design Automation Conference (DAC). 212, pp [7] A. Jaleel, M. Mattina, and B. Jacob, Last Level Cache (LLC) Performance of Data Mining Workloads on a CMP-a Case Study of Parallel Bioinformatics Workloads, in Proceedings of 12th International Symposium on High Performance Computer Architecture (HPCA), 26, pp [8] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, Multi Retention Level STT-RAM Cache Designs with a Dynamic Refresh Scheme, in Proceedings of 44th annual IEEE/ACM International Symposium on Microarchitecture. 211, pp [9] M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, Technology Comparison for Large Last-level Caches (L 3 Cs): Low-leakage SRAM, Low energy STT-RAM, and Refresh-optimized edram, in Proceedings of 19th International Symposium on High Performance Computer Architecture (HPCA), 213, pp [1] Z. Sun, X. Bi, and H. Li, Process variation aware data management for stt-ram cache design, in Proceedings of the 212 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED, 212, pp

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy

Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando,