Analysis of Cache Configurations and Cache Hierarchies Incorporating Various Device Technologies over the Years

Size: px

Start display at page:

Download "Analysis of Cache Configurations and Cache Hierarchies Incorporating Various Device Technologies over the Years"

Rodney McBride
6 years ago
Views:

Analysis of Cache Configurations and Cache Hierarchies Incorporating Various Technologies over the Years Sakeenah Khan EEL 30C: Computer Organization Summer Semester Department of Electrical and

1 Analysis of Cache Configurations and Cache Hierarchies Incorporating Various Technologies over the Years Sakeenah Khan EEL 30C: Computer Organization Summer Semester Department of Electrical and Computer Engineering University of Central Florida Orlando, FL Abstract The objective of this paper was to evaluate fundamental metrics for selected studies cache configurations, observe trends in the cache configurations, and compare design approaches and device technologies. Metrics that were studied include read and write energy consumption, read latency, k-way setassociativity, and cache capacity. Trends were identified in regards to cache hierarchy, device technology, and k-way set-associativity. Over time, the amount of cache levels and sets for set associativity have increased in general. While S was originally the predominant device technology, newer technologies such as STT-, Re, ed, and have replaced S s role in L and L3/LLC caches. The performances of different device technologies were compared by examining the read and write energy and the read latency. STT- on average had the highest energy consumption. and ed on average had the lowest read latency. Keywords fast/slow memory, memory hierarchy, multilevel cache, hit/miss ratio, hit time, miss penalty, associativity, directmapped, fully-associative, set-associative, S, STT-, Re, ed,, cache lines, read/write energy, read latency I. INTRODUCTION Primary memory consists of registers, cache, and main memory (D). The registers and cache are located in the CPU and are considered fast memory, while the main memory is considered slow memory. This stems from the fact that capacity and speed are opposing properties. Memory hierarchy uses the principle of locality (which states that programs use a small part of their memory space frequently) in order to create memory that behaves large, fast, and inexpensive. This is accomplished by storing commonly used data in fast memory and not commonly used data in slow memory. The cache holds recently and frequently used data for fast reference, and can be extended to a hierarchy of levels. With each level that is farther from the CPU, the speed decreases while the capacity increases. Multilevel caches are useful because they decrease the memory access time in general, provided that the requested data is in the cache. Multiple levels optimize the cache by reducing the miss penalty. FIGURE I. A TYPICAL MEMORY HIERARCHY If the requested data is located in the cache, it s a hit, and if not then it s a miss. The hit ratio is the portion of memory accesses found in the cache, while the miss ratio is the portion of memory accesses not found in the cache. Likewise, the hit time is the time to access data in the cache, and the miss penalty is the time to access data from main memory and replace a block in the cache. When the CPU requests a word from the contents of a read address (RA), first the cache is checked to see if it has the block containing the RA s contents; if it s a hit then the memory access is a fast process. However, if it s a miss, then the main memory must be accessed, the block containing the RA s contents is read and transferred to the cache, and finally the requested word is forwarded to the processor. This is a longer process, thus in general the hit time is much smaller than the miss penalty. Associativity is a design approach that provides flexibility by associating a block of memory with a corresponding line in the cache. There are three design strategies pertaining to cache associativity: direct-mapped, fully-associative, and setassociative. In the direct-mapped strategy, there is no associativity; each block is mapped to only one possible line. When data is referenced that is in the cache, it is located by the tag and the line index. The tag and line index are chosen depending on the memory address. Under this strategy, there are more conflict misses, which occur when multiple memory locations are mapped to the same cache location. If the memory access is a miss and the cache is full, a capacity miss has occurred. The data that was just accessed will replace a line in the cache

2 (which line depends on the tag; the line index and data will be rewritten). In the fully-associative strategy, there is unrestricted associativity; any block of memory can be mapped to any line of the cache. This allows for fully flexible mapping up to the capacity of the cache, however since this strategy has the largest tag field the time for tag search is longer. There are no conflict misses, only capacity misses. In the set-associative strategy (referred to as k-way set-associative), there is bounded associativity. Each block of memory is mapped to one set. There are k lines in each set, so a given block of memory could be stored in one of the k lines. When data is referenced that is in the cache, it is located by the set and the tag. Some conflict misses can occur, but there are less than the direct-mapped strategy. For both the fully-associative and set-associative strategies, if there is a capacity miss, then the blocks most likely to be used are kept in the cache. Either the LRU (least recently used) or LFU (least frequently used) block is replaced. The device technologies studied in this paper include S, STT-, ed,, and Re. S, ed, and are volatile memory technologies, meaning they require voltage supply to maintain values. STT- and Re are non-volatile []. Within this paper, thirteen studies with unique cache configurations were analyzed and compared. These studies span from 995 to. Section II summarizes the literature and observes trends in cache hierarchies, device technologies, and set-associativity. Section III analyzes the data and observes how different device technologies have different energy consumptions and read latencies. II. LITERATURE REVIEW Thirteen research studies were analyzed throughout this paper. The studies span from 995 to. Most of the studies cache configurations employed set-associativity strategies. From 005 to 007, the three selected studies all had two cache levels. The set-associativity for L was -way for sources [] and [] and 4-way for source [0]. For L, the set-associativity was 8-way for sources [0] and [] and 6- way for source []. From 0 to 03, S and STT- technologies were popular between the four selected studies. The study from 0 employed three cache levels [7], while the others employed two cache levels [6, 8, 3]. The device technology for L of the studies was S, with either -way or 4-way set associativity [7, 6, 8]. The study from 0 used S for L, L, and L3 [7]. In the two studies from 0, STT- technology was used for L with 6-way and 8-way set associativity [6, 8]. From the studies of, two of the studies had two cache levels [, 3] while the other two had three [4, 9]. S and STT- were still popular, while new technologies emerged such as ed,, and Re. Three of the studies used S technology for L and all four of the studies used 8-way set-associativity for L [, 3, 4, 9]. None of the studies employed S technology for L or L3. Over the past decade, three-level caches became more common. The number of sets for the L cache has generally increased. For most of the studies, S technology was used for the L cache. As for L and L3, as the decade proceeded different alternatives were introduced to replace S technology. In order to meet the demands for high performance and energy efficiency, a large portion of modern processors is occupied by multilevel S caches. This fast, low-capacity technology is most often employed in the L and L levels of the cache. However, S s significant leakage power and cell area are great disadvantages [3, 4]. Leakage power can be greatly decreased by using nonvolatile memory technologies to replace S LLCs, such as STT- (spin-transfer torque ) and Re (resistive ). STT- s advantages include its near-zero power leakage, high cell density, and short read access time [3, 4, 6, 8, 9]. However, key drawbacks to STT- include its long write latency and high write energy [6, 8]. Re s largest advantage is its high compatibility with CMOS, which makes it a strong cost competitor to S. However, it has a longer access latency and lower cell endurance than STT-, making it more suitable for the LLC technology in a deep cache hierarchy (e.g. a three level cache) [9]. More recently, large ed (embedded dynamic ) has been introduced as the LLC cache technology to further alleviate the core-memory speed gap. ed offers a high cache capacity, smaller area, and faster on-chip communication. However, it also has a high refresh demand due to the need to keep the stored value in the valid state, which increases the dynamic energy consumption [, 4]. Another recently introduced technology is (read reference activity persistent), which optimizes the L cache and maximizes the benefit of STT- s extra capacity by using a heterogeneous STT-. provides accelerated service to the critical load requests from LLC. accelerates the service to critical requests from LLC while also efficiently managing regular L cache requests. III. DATA ANALYSIS Table I details the information and metrics provided in each the studies, including the cache hierarchies, cache capacities, set-associativity, device technologies, and protocols. The number of cache lines was also included in Table I, and was calculated by using the following equation (assuming the cache line size is always 64 Bytes): EQUATION I. [# of CL] = [cache capacity] / (64 Bytes) Table II contains the read and write energy and the read latency from five studies. The read and write energy comes from the addition of the read energy and the write energy. To obtain the read latency in units of ns from cycles, the following equation was used: EQUATION II. [latency (ns)] = [cycles] / [frequency (GHz)] Figures II and III illustrate the data given in Table II in the form of bar graphs. Figure II shows the read and write energy consumption for different device technologies over the years.

3 Figure III shows the cache read latency for different device technologies over the years. As Figure II shows, using S for L requires little energy consumption. The order of technologies from least to most energy consuming on average is S,, ed, and STT-. STT- s average energy consumption is significantly higher than that of the other technologies. The order of technologies from having the lowest to highest read latency on average is ed,, S, and STT-. ed and s average read latencies are almost half that of S and STT- s. IV. CONCLUSION After analyzing the literature, clear trends became apparent over time and for different device technologies. Most of the studies cache configurations were set-associative. Over the past decade, three-level caches became more common. The amount of sets used for set associativity generally increased. Across the board, S technology was used for L cache. In earlier literature, S technology was much more prevalent and often used for each cache level. As years passed, S technology was increasingly replaced in L and L3 by nonvolatile technologies such as STT- and Re. Most recently, ed and technologies have emerged. ed is often used for LLC, while strategy would be used in L. As observed from the literature, between the different device technologies, STT- required the highest average energy consumption, while S and STT- had the highest average read latencies. [0] D. Chandra, et al. Predicting inter-thread cache contention on a chip multi-processor architecture th International Symposium on High- Performance Computer Architecture, 005. [] J. Huh, et al. "A NUCA substrate for flexible CMP cache sharing." IEEE transactions on parallel and distributed systems 8.8 (007): [] M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: demand-based associativity via global replacement 3nd International Symposium on Computer Architecture (ISCA'05), 005. [3] R. Parihar, et al. "Protection, utilization and collaboration in shared through rationing." URL cs. rochester. edu/u/cding/documents/publications/tr-ration. pdf (03). R. F. DeMara. Memory Hierarchy [Module PowerPoint]. EEL 380C: Computer Organization; University of Central Florida,. REFERENCES [] S. E. Crawford and R. F. DeMara, "Cache coherence in a multiport memory environment," in Proceedings of the Second International Conference on Massively Parallel Computing Systems (MPCS-95), pp , Ischia, Italy, May -6, 995. [] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, Bit-Upset Vulnerability Factor for ed Last Level Cache Immunity Analysis, Proceedings of 7th International Symposium on Quality Electronic Design (ISQED ), Santa Clara, CA, USA, March 5-6,. [3] X. Chen, N. Khoshavi, J. Zhou, D. Huang, R. F. DeMara, J. Wang, W. Wen and Y. Chen, AOS: Adaptive Overwrite Scheme for Energy- Efficient MLC STT- Cache, 53rd Design Automation Conference, Austing, TX, USA,. N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, "Read-Tuned STT- and ed Cache Hierarchies for Throughput and Energy Enhancement, arxiv preprint,. [5] M. Lin, et al. "ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism" Microprocessors and Microsystems 39.7 (05): [6] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das, Cache Revive: Architecting Volatile STT- Caches for Enhanced Performance in CMPs, in Proceedings of 49th Annual Design Automation Conference (DAC). 0, pp [7] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, Multi Retention Level STT- Cache Designs with a Dynamic Refresh Scheme, in Proceedings of 44th annual IEEE/ACM International Symposium on Microarchitecture. 0, pp [8] Z. Sun, X. Bi, and H. Li, Process variation aware data management for stt-ram cache design, in Proceedings of the 0 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED, 0, pp [9] M. R. Jokar, M. Arjomand, and H. Sarbazi-Azad, Sequoia: High- Endurance NVM-Based Cache Architecture, IEEE Transactions on Very Large Scale Integration (VLSI) Systems,. 3

4 TABLE I. METRICS FOR VARIOUS MULTILEVEL CACHE DESIGNS Parameters for Processor the below techniques, Year # of Freq. Capacity cores Crawford [] 995 Khoshavi [] Chen [3] Koshavi Lin [5] 05 Jog [6] 0 Level (L) for Instruction (I) or Data (D) Level (L) Level 3 (L3) or Last Level Cache (LLC) N/A N/A infinite N/A N/A N/A # of CL Protocol Capacity # of CL Protocol Capacity # of CL Protocol CREW/ CRCW infinite N/A N/A N/A CREW/ CRCW infinite N/A N/A N/A CREW/ CRCW 8 3GHz 3KB 8-way S 5 MESI 5KB 8-way S 89 MESI 96MB 6-way ed ~00M WB 4 3.3GHz 3KB 8-way S 5 WB 4MB 8-way STT WB N/A N/A N/A N/A N/A 8 3GHz 3KB 8-way S 5 WB 04KB 8-way WB 96MB 6-way ed ~00M WB N/A N/A 3KB N/A N/A 5 MOESI 5KB N/A N/A 89 MOESI N/A N/A N/A N/A N/A 4 GHz 3KB 4-way S 5 WB MB or 4MB 6-way S or STT or WB N/A N/A N/A N/A N/A Sun [7] 0 4 GHz 3KB 4-way S 5 MESI 56KB 8-way S 4096 WB 4MB 6-way S WB Sun [8] STT- 8 GHz 6KB -way S 56 WT 8MB 3-way WB N/A N/A N/A N/A N/A Jokar [9] MB or STT GHz 3KB 8-way N/A 5 WB 8-way 56 KB or 4096 WB 8MB 8-way Re 307 WB Chandra [0] GHz 3KB 4-way N/A 5 WB 5KB 8-way N/A 89 WB N/A N/A N/A N/A N/A Huh [] 007 N/A 5GHz 3KB -way N/A 5 N/A 56KB 6-way N/A 4096 N/A N/A N/A N/A N/A N/A Qureshi [] 005 N/A N/A 6KB -way N/A 56 N/A 56KB 8-way N/A 4096 N/A N/A N/A N/A N/A N/A Parihar [3] 03 N/A N/A 3KB -way N/A 5 N/A 5KB 8-way N/A 89 N/A N/A N/A N/A N/A N/A TABLE II. ENERGY & LATENCY FOR DIFFERENT DEVICE TECHNOLOGIES Technology and Details Read and Write Energy (nj) Read Latency (ns) 3 KB S, L - 0 [7] KB S, L - 0 [7] MB S, L3-0 [7] MB S, L - 0 [6] KB S, L - [3] S, MB STT-, L - 0 [6] MB STT-, [3] MB STT-, MB STT-, [9] N/A LRSC, HRSC, MB ed,

5 .5 FIGURE II. READ AND WRITE ENERGY.06 Energy (nj) KB 56 KB 4 MB MB 3 KB 4 MB STT- S, L -S, L -S, L3 -S, L -S, L -S, L -, L - 0 [7] 0 [7] 0 [7] 0 [6] [3] 0 [6] MB STT-, L - [3] MB STT-, L LRSC, L HRSC, L MB ed, Latency (ns) KB S, L - 0 [7].8 56 KB S, L - 0 [7] MB S, L3-0 [7] FIGURE III. READ LATENCY.0 MB S, L - 0 [6].5 3 KB S, L - [3].77 S, L MB 4 MB MB MB STT-, STT-, STT-, STT-, L - 0 [6] [3] [9].6.6 LRSC, L - HRSC, L -.07 MB ed, 5

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy

Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando,