Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando, FL 32816-2362 Abstract Many devices are currently being tested to replace the antiquated S, static random access memory, which has been used for nearly a decade. New technological devices such as or ed or even P are being introduced and presently being tested to replace S. Not only are these devices eventually being replacing S but also certain testing is being conducted to determine which cache level configuration these devices should be placed in for maximum efficiency and data retrieval performance; whether they be placed in cache level 1, level 2, or level 3. Although most devices such as are placed in cache level 3 hierarchy but now new designs incorporate these devices on level 1 or level 2. Keywords, S, ED P,, Volatile, Non- Volatile, Cache, Level 1, Level 2, Level 3, LLC, Associativity, Protocol, write instruction, read instruction I. INTRODUCTION As with many technological advancements and changes in recent years, developments in computer system processors are being realized. Within these processors exist memory called cache, section of memory located inside the CPU that is used to store data needed to be retrieved as requested by the CPU. Cache when compared to memory located on the motherboard is incredibly fast and with the placement of cache come a variety of configurations for multi-leveled cache: Level 1, Level, 2, and Level 3. Either the cache level can be optimized or cache itself to store more data. Variability exists when optimizing a processor s cache, either individual cache level capacity can be increased to store additional data or cache levels themselves can be optimized to have two level cache levels up to three level cache structures. Multi-level caches are important and vital for data retrieval performance. When cache reaches its maximum capacity rather than having to store data in main memory, data can be also placed in another level 2 cache. Miss penalty also decreases with multi-leveled caches [13]. As well has having multiple cache level organizations; there exist several cache associativity approaches for data retrieval by placing data blocks within cache. Some of these methods include full associativity, set associativity, and direct mapping each having their own unique advantages and disadvantages upon implementations [13]. Memory is another very important aspect of computer systems. Many technological devices exist such as Static Random Access Memory (S), Dynamic Random Access Memory (D), Spin Transfer Torque Ram () and Embedded Random Access Memory (ed) [13]. While every particular technological device listed are used for storing data, some devices differ from other in that some are considered volatile while other non-volatile. Memory components such as S, D, and ed are volatile. These devices have the capability of losing data without the consistent provision of voltage. Other devices such as are non-volatile, no data leakage or loss without the source of voltage. Depending on the cache associativity, the manner in which data is stored and accessed differs from method to method. Using direct mapped cache, each block from memory is situated to only one line in cache whereas in set associativity a specific number of cache lines store exactly one block from memory. mapping has regulated mapping while direct mapping is fixed, one block for one cache line. Direct mapping works by transferring the data at a specific memory address, using a tag to determine where the desired block was positioned in the cache [13]. mapping works by placing the memory blocks in limited number of cache lines, where depending on the cache one block can fill either 1, 2, or 3 cache lines. Although many computers systems differ in cache level organization, cache structure remain consistently the same. Just as memory cells contain data, within cache exist cache lines which store the data from memory. Cache lines are segmented to contain specific information needed by the processor, cache lines contain both the tags and the data. The tags are essential for determining the destination in main memory from which the block of data is being retrieved [13]. Cache operates and functions just as the main memory, difference is memory is lower located significantly lower in memory hierarchy when compared to cache which is located within the CPU chip. The hardware placement of cache in the chip causes significant speed for data retrieval yet lack in space whereas main memory lacks speed yet delay in speed [13]. Page 1 of 5
In the figure, cache placement is shown to be within the CPU chip and contained with the cache are segmented sections or cells known as cache lines. The data placed into cache lines are retrieved from main memory as displayed in the figure below. Whenever data retrieval is in process, each time segment of data from memory are located in cache, a hit occurs. Whenever a block of memory is not able to be located within any of the cache levels, whether they be a Level 1, Level 2, or Level 3, a miss occurs. When a high hit ratio occurs, data retrieval is significantly faster since the data exist within cache delivering considerably low timing for the data retrieval. When a high miss ratio occurs, there is data that is not able to be located within cache that has to be retrieved from main memory and has to be placed in the cache lines for the CPU to access; this entire process of location data and relocation causes delay in data retrieval performance decreasing overall speed. In the upcoming section of the paper, new advancements will be discussed that have taken place over the past decade with certain technological devices such as Spin Static Torque Ram and embedded Random Access Memory as well as other devices. Along with these devices, certain cache levels configurations will be discussed as well as optimizations that have taken place within each cache level, such as with Level 2 and most importantly Level 3. II. LITERATURE REVIEW While adding cache memory to a computer system can be an excellent alternative to increase the data accessing speed while decreasing the retrieval time, certain issues have risen such as that of cache coherence [2]. The issue takes place when multiple levels of cache contain data that needs to be altered in one level, precautions must take place to ensure that data is modified throughout the entire levels of cache to maintain consistency [2]. With the immense enhancements in cache-leveled structures new technological hybrid devices are being designed able to sustain heavily oriented memory tasks. One of the hybrids designed known as ASTRO which focuses on retrieving instructions stored in the main memory of the system [3]. Not only does ASTRO receive instructions but the hybrid also reduces energy dissipation when compared to other devices [3]. Not only are enhancements being made for cache level structures but certain advancements in energy preservation by minimal dissipation [4]. With issues surfacing with the use of technological devices such as S, new substitutions are being made such as the use of. Spin Transfer Torque Ram now the new device being implemented, unlike S does not reduce power loss but rather preserves while also be a non-volatile device. Last Level Cache also known as LLC are placed lower in the memory hierarchy while also maintaining remarkable speeds yet problems arrive when the CPU is waiting for the data retrieval from cache lines [5]. Solutions for solving these processor related issues incorporate replacing the S with, Spin Transfer Torque Random Access Memory [5]. is specifically used for data storage in the CPU last level cache which also can lead to a decrease in miss ratio, reducing the time to return the data from cache directly to the processor [5]. As specified earlier, or Static Transfer Torque has shown remarkable improvements when compared to cache Level 1 and cache Level 2. Not only is the overall area reduced but this improvement leads to lower amount of misses while also retaining information without any type of loss to data, causing increased delay in data retrieval [7]. Although has shown significant results when placed in cache level 3, certain factors inhibit its placement within any other localities in cache levels [7]. One of these factors include the excessive amount of read and write instructions stored in the device which can cause overheating and inaccurate placements of blocks within the pertaining cache lines. As shown in the configuration, when was placed in Level 1 cache, the write instructions prove to be considerably slower than compared to the placement of S in cache Level 1 as well as an increase in read and write energy consumption [7]. With new discoveries being made for cache level organization, new technological devices are beginning to come into light for the purpose of replacing S in cache levels [8]. Some of these new technological devices include as mentioned consistently throughout the paper but also a new device called ed [8]. Many cache levels are now being tested with these various device types, optimizing them for maximum efficiency. Although some show to be faster regarding data and instruction retrieval, these devices shows downsides as well. Downsides for include increases in energy use; while ed requires certain processes to retain the correct data block retrieved from the memory preventing data corruption [8]. With many options presented to replace technological devices such as with or ed; alternatives exist. Alternatives that rather than complete replacement, require mixture of two types of devices, known as hybrids [10]. These hybrids are composed of a new classification of device type known as non-volatile ram. Not only do the hybrids contain greater storage when compared to singular type devices but also do not require as much energy use as that of Ram [10].
As discussed throughout the paper, is currently being tested to replace the antiquated use of S which although have fast data generation, other devices are much faster with certain instruction while offering a tremendous increase in storage capacity which also happen to be non-volatile [11]. One issue that arises with is also the quantity of refresh instruction executed in cache which can cause excessive loss of energy [11]. A solution which can reduce the quantity of refresh instructions is cache coherence adaptive refresh which has the capability of minimizing the refresh instructions leading to a decrease in energy loss [11]. III. DATA ANALYSIS 5 Read Cache Latency (nsec) 4 3 2 1 0 The figure above depicts various Read Cache Latency in nanoseconds for their pertaining technological device as shown in the x-axis. Read latency is shown for multiple cache level configurations with device name and capacity listed. L1 4MB L2 S 512KB ed L3 STTS L2 S Re- 8MB S 32KB Read Energy Consumption (nj) 0 0.5 1 1.5 2 2.5 The graph above displays the Read Energy Consumtion for various devices such as, ED, S and many other devices. The energy consumtion shown is measured in nanojoules. IV. CONCLUSION Throughout the paper, many technological devices were mentioned, many which are currently undergoing examination to replace the long lasting Static or S. Some of these new devices included which has increased storage capacity while also having an increase in energy use. Other devices such as ed or embedded dynamic are also being examined to contain certain instruction to retain data reliability without the presence of corruption. Not only are some of the new devices replacing S but also being tested to be positioned in certain cache levels other than Level 3 such as Level 2 or Level 1. As more testing is done, solutions to decrease energy loss and minimize excessive writing instructions will begin to surface as they are as current testing is proving.
REFERENCES [1] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, Bit-Upset Vulnerability Factor for ed Last Level Cache Immunity Analysis, Proceedings of 17th International Symposium on Quality Electronic Design (ISQED 2016), Santa Clara, CA, USA, March 15-16, 2016. [2] S. E. Crawford and R. F. DeMara, "Cache coherence in a multiport memory environment," in Proceedings of the Second International Conference on Massively Parallel Computing Systems (MPCS-95), pp. 632-642, Ischia, Italy, May 2-6, 1995. [3] M. Lin, et al. "ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism" Microprocessors and Microsystems 39.7 (2015): 553-564. [4] X. Chen, N. Khoshavi, J. Zhou, D. Huang, R. F. DeMara, J. Wang, W. Wen and Y. Chen, AOS: Adaptive Overwrite Scheme for Energy- Efficient MLC Cache, 53rd Design Automation Conference, Austing, TX, USA, 2016. [5] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, "Read-Tuned and ed Cache Hierarchies for Throughput and Energy Enhancement, arxiv preprint, 2016. [6] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das, Cache Revive: Architecting Volatile Caches for Enhanced Performance in CMPs, in Proceedings of 49th Annual Design Automation Conference (DAC). 2012, pp. 243 252. [7] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, Multi Retention Level Cache Designs with a Dynamic Refresh Scheme, in Proceedings of 44th annual IEEE/ACM International Symposium on Microarchitecture. 2011, pp. 329 338. [8] M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, Technology Comparison for Large Last-level Caches (L 3 Cs): Low-leakage S, Low Write-energy, and Refresh-optimized ed, in Proceedings of 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 143 154. [9] M. R. Jokar, M. Arjomand, and H. Sarbazi-Azad, Sequoia: High- Endurance NVM-Based Cache Architecture, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016. [10] Joo, Yongsoo, and Sangsoo Park. "A hybrid P and cache architecture for extending the lifetime of P caches." IEEE computer architecture letters 12.2 (2013): 55-58. [11] Li, Jianhua, et al. "Low-energy volatile cache design using cache-coherence-enabled adaptive refresh." ACM Transactions on Design Automation of Electronic Systems (TODAES) 19.1 (2013): 5. [12] Zhang, Yaojun, et al. "Read performance: The newest barrier in scaled." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23.6 (2015): 1170-1174. [13] DeMara, Ronald. Module 11, Memory Hierarchy, 2016..
TABLE I. Parameters for Processor the below techniques # of Freq. Capacity cores Level 1 (L1) for Instruction (I) or Data (D) Level 2 (L2) Level 3 (L3) or Last Level Cache (LLC) # of CL Protocol Capacity # of CL Protocol Capacity # of CL Protocol Khoshavi [1] 8 3GHz 32KB 8-way S 512 MESI 512KB 8-way S 8192 MESI 96MB 16-way ed ~100M WB Sun [7] 4 2GHz 32KB 4-way S 512 N/A 256KB 8-way S 4096 N/A 4MB 16-way Jokar[9] 4 3GHz 32KB 8-way D 512 MOESI 2MB 8-way 65536 N/A 32768 MOESI 8MB 8-way Re 131072 MOESI Zhang[12] 16 3.5GHz 32KB 4-way S 512 MESI 256K 8-way S 4096 N/A 16M 16-way S 262144 N/A Chang[8] 8 2GHz 32KB 8-way N/A 512 MESI 256KB 8-way N/A 4096 MESI 32MB 16-way N/A 524288 WB Chen[4] 4 3.3GHz 32KB 8-way S 512 WB 4MB 8-way 65536 WB N/A N/A N/A N/A N/A Khoshavi[5] N/A 3GHz 32KB 8-way S 512 WB N/A 8-way N/A N/A WB 96MB 16-way EDRA M ~100M WB Jog[6] N/A 2GHz 32KB 4-way S 512 WB 1MB 16-way S 16384 N/A N/A N/A N/A N/A N/A Li[11] 16 2GHz 32KB 2-way 512 WB N/A N/A N/A N/A N/A 8MB 16-way 131072 WB Joo[10] 1 2GHz 32KB N/A S 512 WB 8MB 16-way Hybrid 131072 WB N/A N/A N/A N/A N/A CL = Cache line Calculation for # of CL columns: Manually compute the number of cache lines given the capacity value as listed in capacity column, assuming the cache line size is always 64 Bytes Protocol column = {Write Back (WB), Write Through (WT), MESI, MOESI, Not Available (N/A)}