Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption

Size: px

Start display at page:

Download "Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption"

Merilyn Haynes
6 years ago
Views:

1 Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption Xiaomin Ding, Minglei Wang (Team 28) Electrical and Computer Engineering, University of Florida Gainesville, USA {dingxiaomin, Abstract Cache is a crucial part of computer architecture, influencing the performance of whole system. In this thesis, we adapted two previously proposed optimization methods to cache way-predicting cache and pseudoassociative cache; chose AMAT (miss rate, hit time) and energy consumption as main analytical factors; then constructed the simulators by modifying the source code of SimpleScalar and carried out simulation and evaluation experiments on SimpleScalar platform, using SPEC2000 benchmarks. The test results basically match the principles in theoretical computer architecture. In the end, we gave the outcomes and conclusions. Keywords- cache performance; way predicting; pseudo associative; energy consumption component I. INTRODUCTION From the earliest days of computing, programmers have wanted unlimited amounts of fast memory and then they took advantage of the principle of locality by organizing the memory of a computer into a hierarchy [1]. A memory hierarchy consists of multiple levels of memory with different speeds and sizes. Cache [2] is the level of the memory hierarchy between the processor and main memory. Nowadays, the disparity between processor and DRAM memory has been increasing in recent years- the improvement of microprocessor is 60% per year and that of DRAM memory is less than 10% [3]. Because the cache can bridge the gap between the CPU and DRAM memory, the performance of cache would affect the performance of the computer much more than before. There already exist lots of different cache optimization strategies [4], such as using way prediction to reduce hit time, using pipelined cache access to increase cache bandwidth, critical word first and early restart to reduce miss penalty [5]. In this paper, we choose two optimization methods (way-prediction, pseudo associative cache) to improve the cache performance and evaluate the outcomes from the simulations. In chapter 2, we make a brief introduction of the background and related work of this subject, including principles of cache and tools in this thesis. In chapter 3, specific approaches (principles and realization of two optimizations) in our project are shown. In chapter 4, we list the results of our project and make evaluation and comparison of them. In chapter 5, we draw our conclusions from implemented simulation. II. BACKGROUND AND RELATED RESEARCH In this section, we first briefly introduce the principle of cache, evaluation methods and some related research. Then we describe the simulation tools and benchmark used in our project. A. Principle of Cache In computer system, a cache aims to store data so that probable future requests for the data in it can be served faster [6]. The data stored within a cache might be values that have been requested or computed earlier by the processor or duplicates of original values, which are stored in its lower level memory (main memory or disks). Cache hits occur when the requested data is contained in the cache [7], such requests can be served by simply reading the cache, which is relatively faster. Otherwise, during a cache miss [8], the data has to be fetched from its lower level memory- this case is obviously relatively slower.

2 There are three categories of cache organization defining how a block is placed in a cache: direct mapped, fully associative and set associative. When processor request data from a specified address to the cache, the address is firstly divided into two parts: block address and the block offset. The block address can be further divided into the tag field and the index field. The block offset field selects the desired data from the block, the index field selects the set, and the tag field is compared against for a hit. Figure 1 shows the three portions of an address in a set-associative or direct mapped. [2] Block address Tag Index Block offset Figure 1. Three portions of a block address [2] B. Evaluation Methods There are several kinds of methods to evaluate the cache performance. In this paper, we will focus on average memory access time (AMAT) as shown in equation (1): AMAT HitTime MissRate MissPenalty (1) Hence, we can easily get that three factors influencing AMAT. With reduction of these factors, the performance of cache will be improved. C. SimpleScalar and SPEC2000 SimpleScalar [9] is an open source computer architecture simulator developed by Todd Austin at the University of Wisconsin Madison. It is a simulator, used to show that Machine A is better than Machine B without building either Machine A or Machine B. It is written using C programming language. [10] SimpleScalar [11] is a set of tools that model a virtual computer system with CPU, Cache and Memory Hierarchy. Using the SimpleScalar tools, users can build modeling applications that simulate real programs running on a range of modern processors and systems [12]. The tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction. [9] In our paper, in order to model cache, we use sim-cache simulator as our basic tool and realize our purpose through modifying the source code of sim-cache and related files. In this paper, we use SPEC2000 benchmark to test the performance of cache. [13] SPEC2000 [14] is the nextgeneration industry-standardized CPU-intensive benchmark suite. SPEC [15] designed CPU2000 to provide a comparative measure of computing intensive performance across the widest practical range of hardware. [16] The implementation resulted in source code benchmarks developed from real user applications. These benchmarks measure the performance of the processor, memory and compiler on the tested system. In this paper, we choose two benchmarks to test the cache with or without optimization to accomplish our work. D. Related Research There are dozens of researches related to cache performance optimization upon these two aspects. For pseudoassociative cache, Yongjoon Lee, Byung-Kwon Chung proposed the pseudo 3-way set-associative caches [17], which overcome the limitation of the hit rate of 2-way set-associative cache; Bobbala, L.D., Salvatierra, J. and Byeong Kil Lee proposed a composite cache mechanism [18] by emphasizing primary way utilization and pseudo-associativity for L2 cache to maximize cache performance, etc. For way-predicting cache, Inoue, K., Ishihara, T., Murakami, K. proposed way prediction [19] for achieving high performance and low energy consumption of set-associative caches; Hsin-Chuan Chen, Jen-Shiun Chiang proposed a new cache scheme that uses the valid bits pre-decision [20] for way predicting to improve cache performance; Cuiping Xu, Ge Zhang,Shouqing Hao proposed a new way-prediction scheme [21] for achieving low energy consumption and high performance for set-associative instruction cache.

3 III. APPROACH In this section, we focus on two optimization strategies, pseudo-associative cache and way-prediction cache. We first introduce the principles of these optimization methods and then describe approaches of realization with SimpleScalar in detail. A. Pseudo-Associative Cache 1) Principle of Pseudo-Associative Cache Pseudo-associative cache, also called column associative cache [5]10, is a cache in which space is logically divided into two zones. For each visit, the pseudo-associate cache will first act like the direct mapped cache in the first zone, which means that each block has only one place where it may appear in the cache. If there is a cache hit, this cache is just like the direct mapped cache. If the cache misses, the CPU will visit a specified location in another zone. If cache hits this time, a pseudo hit happens and then the block will be swapped for the block of the first entry; otherwise, the processor will access the next level memory to find desired data. In this case, a real cache miss occurs. In other words, pseudo-associative cache combines the lower hit time of direct mapped cache and lower miss rate of 2-way associative cache. 2) Realization of pseudo-associative cache In general, the next location to check is to invert the highest index bit [5]10 of the block address to get the next index address in the cache. Based on the principle of pseudo-associative cache, we use SimpleScalar as our simulation tool and modify the source code to realize a 2-way pseudo-associative cache. The Figure 2 shows how pseudo-associative cache works. Block address Tag Index Block offset 2 1 Cache blocks Figure 2. Pseudo-associative cache In step 1, a block address is given by the processor, the index field is used to select the set, and in step 2, the tag field is compared for a hit. If the cache hits, each set only contains one block, it really acts as a direct mapped cache and we call it fast hit. If the cache misses, the high order index bit flips as a new index field to select block in the cache in step 3 and compare the tag field. If cache hits, it is a pseudo hit and swap the blocks in step 5. Otherwise, the cache misses, we will find the block in the lower level memory. In SimpleScalar, we added a Marco definition in cache.c file to invert the original index of the desired block address. To realize step 5, we wrote a swap function swap_blk_data(). And we added an energy consumption parameter in cache.h, cache.c files. All new added function should be first defined in cache.h file. Then we mainly modified the cache_acess () function to carry out the pseudo-associative cache. We set the pseudo-associative as 2-

4 way set associative cache first, and then utilize the different way to access cache, that is what the cache_acess() function do. When the cache hits in the first access, it only activates one block; under circumstance of pseudo hit, it needs to activate two blocks. If cache misses, it needs to activate two blocks. So we make comparisons of performance among direct mapped cache, 2-way associative cache and pseudo-associative cache. B. Way Prediction Cache 1) Principle of way-prediction cache In way-prediction cache, we designed a mechanism to predict the way. As a result the multiplexor is set early to select the desired block, and only a single tag comparison is needed when accessing the cache. [4]6 Hit time is reduced consequently. If the prediction is correct, the cache access latency is the fast hit time, if not, it tries the other block, changes the way predictor, and has a latency of one extra clock cycle. 2) Realization of way-prediction cache We designed two kinds of way-prediction cache: static way-prediction and dynamic way-prediction. The first approach uses way-prediction in the whole process of cache access and the second approach decides whether to use way-prediction according to prediction hit rate. Each approach has pros and cons. The reason why using dynamic wayprediction cache is that for some programs, the locality is so bad that the performance of way-prediction cache may be poor, which may influence the cache performance. We modified the content of cache hit function and designed a way-prediction model. Our idea is that the predictor will select the last block as the next target each time and be updated after every cache access. For static way-prediction cache, we simply used this predictor in the cache access process; for dynamic we added several parameters to calculate the prediction hit rate. We assumed a basic access unit which presents for the times of cache: predict_time_slice, and we set the initial value (total number of cache accesses) of the unit to be 600; finally we gathered the prediction hit number in this time unit. If this hit rate is larger than a predetermined threshold value(set as 0.9 in our project), the next access will also use way-prediction; if not, it means in this time unit the locality is bad, so the cache will close way-prediction and all blocks will be activated and tags will be compared during following cache accesses. However, we still let the cache record the number of way-prediction hits, when the hit rate is larger than the threshold value, the cache starts way-prediction mechanism again. In this case, the cache only needs to activate one block. Consequently, the hit time and energy consumption are reduced. To be precise and numerical, we add energy consumption parameter to evaluate wayprediction cache. C. Combine way-prediction and pseudo-associative cache in Simplescalar The two methods are combined as the last procedure. For simplicity, we set different input parameters to make decisions of which optimization strategy to use in commands. For example, if we want to apply dynamic wayprediction method to level 1 data cache, the command line should be as follows: -cache: dl1 dl1:256:32:1:d where d represents dynamic way-prediction cache, w for static way-prediction cache and p for pseudoassociative cache. IV. RESULTS AND ANALYSIS A. Evaluation Setup To evaluate the proposed approaches and compare these cache optimization strategies, we use benchmarks to test the simulator after construction. We used GZIP and BZIP2 which are parts of SPEC2000 benchmarks and run them in VM. We tested the first 200 million instructions in these benchmarks and used level 1 data cache as our target cache.

5 B. Evaluation results and analysis In this section, the overall benefits of proposed approach, compared with the cache without optimization are summarized first, followed by the more detailed results and analysis of how the performance of cache is improved by pseudo-associative and way-prediction optimization strategies. Based on the principle of pseudo-associative cache, it combines the shorter hit time of direct mapped cache with lower miss rate of two way associative cache. Because of the short hit time, the AMAT will be improved and the number of blocks that pseudo-associative cache needs to activate is less than 2-way set associative cache. Also, its miss rate would be lower than direct mapped cache. For way-prediction cache, it will reduce hit time, so AMAT will be improved. And on the other hand, each time it hits, it only need to activate one block. Due to the locality, the energy consumption will be lower. 1) Pseudo associative cache Firstly, Figure 3 shows the miss rate for direct mapped cache and pseudo-associative cache, plotting cache size on the x-axis and miss rate on the y axis. Figure 3 Miss rate for direct mapped cache and pseudo-associative cache From the Figure 3, it is apparent that the pseudo-associative cache has lower miss rate than direct mapped cache. Theoretically, this is because the pseudo-associative cache is actually a 2-way associative cache- it contains more blocks than direct mapped. Secondly, Figure 4 shows the energy consumption for both cache associative strategies. The x axis represents cache size and the y axis represents energy consumption. Figure 4. Energy consumption for different cache associative strategies

6 In our experiments, we assume that activating one block needs energy of 1unit. From Figure 4, for instance, after running 200 million instructions of GZIP benchmark in 16KB size level 1 data cache, direct mapped cache needs energy , two-way set associative cache needs and pseudo-associative cache needs energy Here we can see that pseudo associative cache performs the lowest energy consumption. However, this simulation result contradicts the theory. Whenever the processor visits the pseudo-associative cache, only one block is activated for a set if cache hits (called cache fast hit) in the first zone; if cache misses, the corresponding block in the second zone is activated (if cache hits, called cache slow hit). For pseudo-associative cache, as there is a probability of activating 2 blocks, the total energy consumption would be higher than directed mapped. The higher rate the cache fast hits, the lower energy it will consume. When pseudo-associative cache swapped the blocks after cache misses in the first zone, the rate of cache fast hit will be relatively high. This is to say that it will absolutely reduce energy than directed mapped cache. Figure 5 shows the hit ratio between cache fast hit and cache slow hit. We can see that the cache fast hit ratio is far higher than cache slow hit ratio which means it will save energy remarkably. Figure 5. Hit ratio between cache fast hit and cache slow hit In this example, the pseudo associative cache saved 52% energy than 2-way set associative cache. 2) Way-predicting cache First, based on the principle of way-predicting cache, the next block be selected will set first, so this mechanism mainly aims at reducing the hit time- decreases AMAT consequently. Also, if the prediction hit rate is high, the energy consumption will dramatically reduce. Figure 6 shows that energy consumption of caches using wayprediction and without way-prediction. Figure 6. Energy consumption of cache using way-prediction and without way-prediction

7 We can see that the static way-prediction cache has the lowest energy. Without way-prediction cache, the energy consumption is highest. For static way-prediction cache, each time it only needs to activate one block, so it must have the lowest energy; for dynamic way-prediction cache, if the predict hit ratio is lower than the value we set, it will close the way-prediction, which also demonstrate that in this period, the locality is bad. Although after closed wayprediction, it will result in increasing energy consumption because of the useless of way-prediction for lots of time, it will overcome the overhead of way-prediction. Figure 7 shows that the miss rate of predictor for cache using static way-prediction and dynamic way-prediction. In dynamic predictor, prediction mechanism is ceased when the miss rate is lower than a threshold value (0.9 here). As a result the miss rate is lower. Closure of prediction may avoid large overhead and a waste of time in wayprediction, otherwise the bandwidth and time may waste on high proportion of incorrect predictions, if locality is bad in some programs. Figure 7. Miss rate of predictor for cache using static way-prediction and dynamic way-prediction Table 1 shows the energy saved compared with cache without way-prediction using different way-prediction model. From table 1 we can draw the conclusion that the higher the associative is, the more energy will be saved using way-prediction. The reason why is that highly associative cache will activate more blocks at a time. Associative Static way-prediction energy saved Dynamic way-prediction energy saved 1 0% % 29% % 40.5% % Table 1. Energy saved compared with cache without way-prediction using different way-prediction model 3) Other benchmarks We also used BZIP2 benchmark to test our constructed simulator and ran first 300 million instructions. The results are shown in Figure 8-11 and they conform to the theoretical results except small contractions. Figure 8 shows the miss rate of direct mapped and pseudo-associative caches for different cache size. Figure 9 shows the energy consumption of pseudo-associative, 2-way associative cache and direct mapped cache. Figure 10 shows the energy consumption of way-prediction cache and without way-prediction cache. Figure 11 shows the prediction hit ratio of static way-prediction cache and dynamic way-prediction.

8 Figure 8. Miss rate of direct mapped and pseudo-associative caches Figure 9. Energy consumption for different cache associative strategies Figure 10. Energy consumption of way-prediction cache and without way-prediction cache

9 Figure 11. Predictor s hit ratio of static way - prediction cache and dynamic way-prediction V. CONCLUSIONS In this paper, we focused on two current popular cache optimization methods - way-predicting cache and pseudoassociative cache to improve the cache performance on AMAT (hit time and miss rate) and energy consumption in great detail. The pseudo-associative combines the shorter hit time of direct mapped cache and relatively lower miss rate of two set associative cache. The static way-prediction cache reduces hit time and saves energy, but it may have a bad performance when encountering bad locality. Then we also evaluate dynamic way-prediction cache which can conquer the drawback of the static way-prediction, because it is able to start or cease the way-prediction mechanism according to the situation of locality. We use SPEC2000 benchmark to test the simulators and to check the accordance between simulation results and the theoretical conclusions. Basically the simulation results are complied with the theory results. REFERENCE [1] Rami J. Ammari, A Study For Reducing Conflict Misses in Data Cache [2] David A. Patterson, John L. Hennessy, "Computer Organization and Design : the hardware/software interface", 2nd ed, San Francisco, Calif.: Morgan Kaufmann Publishers [3] David A. Patterson, John L. Hennessy, "Computer Architecture:: a quantitative approach", 4th ed, San Francisco : Morgan Kaufmann Publishers, [4] C. Kozyrakis, "Advanced Caching Techniques." [5] Markus Kowarschik, Christian Weis, "An Overview of Cache Opitimization Techniques and Cache-Aware Numerical Algorithms" [6] [7] Santanu Kumar Dash, Thambipillai Srikanthan, "Rapid Estimation of Instruction Cache Hit Rates Using Loop Profiling" /2008, IEEE, pp [8] Vlastimil Babka, Lukas Marek and Petr Tuma, "When Misses Differ: Investigating Impact of Cache Misses on Observed Performance", 15th International Conference on Parallel and Distributed Systems [9] [10] [11] Naraig Manjikian, "Enhancements and Applications of the SimpleScalar Simulator for Undergraduate and Graduate Computer Architecture Education", WCAE '00 Proceedings of the 2000 workshop on Computer architecture education [12] [13] [14] Jason F. Cantin. Cache Performance for SPEC CPU2000G. Eason, B. Noble, and I. N. Sneddon, On certain integrals of Lipschitz-Hankel type involving products of Bessel functions, Phil. Trans. Roy. Soc. London, vol. A247, pp , April [15] MA Hai-feng, YAO Nian-min, FAN Hong-bo, "Cache Performance Simulation and Analysis under SimpleScalar Platform",. [16] Henning J. L., SPEC CPU2000: Measuring CPU Performance in the New Millennium, IEEE Computer, vol. 33, no. 7, pp July [17] Yongjoon Lee,Byung-Kwon Chung, "Pseudo 3-way set-associative cache: a way of reducing miss ratio with fast access time", Electrical and Computer Engineering, 1999 IEEE Canadian Conference

10 [18] Bobbala, L.D.,Salvatierra, J., Byeong Kil Lee, "Composite Pseudo-Associative Cache for Mobile Processors", Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010 IEEE International Symposium [19] Inoue, K., Ishihara, T., Murakami, K.,"Way-predicting set-associative cache for high performance and low energy consumption", Low Power Electronics and Design International Symposium [20] Hsin-Chuan Chen, Jen-Shiun Chiang, "Low-power way-predicting cache using valid-bit pre-decision for parallel architectures", Advanced Information Networking and Applications, 19th International Conference [21] Cuiping Xu, Ge Zhang,Shouqing Hao, "Fast Way-Prediction Instruction Cache for Energy Efficiency and High Performance", Networking, Architecture, and Storage IEEE International Conference

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,