Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.

Size: px

Start display at page:

Download "Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering."

Grace McGee
5 years ago
Views:

1 Power/Performance Advantages of Victim Buer in High-Performance Processors Gianluca Albera xy x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY R. Iris Bahar y y Brown University Division of Engineering Providence, RI Abstract In this paper, we propose several dierent data cache congurations and analyze their power as well as performance implications on the processor. Unlike most existing work in low power microprocessor design, we explore a high performance processor with the latest innovations for performance. Using a detailed, architectural-level simulator, we evaluate full system performance using several dierent power/performance sensitive cache congurations. We then use the information obtained from the simulator to calculate the energy consumption of the memory hierarchy of the system. We show that victim buer oers improved cache energy consumption over other techniques (10% compared to 3.8%), while at the same time provides comparable performance gains (3.54% compared to 3.45%). 1: Introduction In this paper we will concentrate on reducing the energy demands of an ultra highperformance processor, such as the Pentium Pro or the Alpha 21264, which uses superscalar, speculative, out-of-order execution. In particular, we will investigate architectural-level solutions that achieve a power reduction in the memory subsystem of the processor without compromising performance. Prior research has been aimed at measuring and recommending optimal cache conguration for power. In [9], the authors determined that high performance caches were also the lowest power consuming caches since they reduce the trac to the lower level of the memory system. The work by Kin [7] proposed accessing a small lter cache before accessing the rst level cache to reduce the accesses (and energy consumption) from DL1. The idea lead to a large reduction in memory hierarchy energy consumption, but also resulted in a substantial reduction in processor performance. While this reduction in performance may be tolerable for some applications, the high-end market will not make such a sacrice. This paper will propose memory hierarchy congurations that reduce power while retaining performance. Reducing cache misses due to line conicts has been shown to be eective in improving overall system performance in high-performance processors. Techniques to reduce conicts include increasing cache associativity, use of victim caches [5], or cache bypassing with and without the aid of a buer [4, 8, 10]. Figure 1 shows the design of the memory hierarchy when using a buer alongside the rst level data cache. Also included in the

2 gure is a representation of the ve main stages found in a speculative out-of-order execution processor. Note that the instruction cache access occurs in the fetch stage while the data cache access occurs in the issue stage for a load operation and the commit stage for a store operation. An instruction may spend more than one cycle in any of these ve stages depending on the type of instruction, data dependencies, and cache hit outcome. L1 Data Cache data access (dl1) dl1 buffer from Processor Unified L2 cache L1 Inst. Cache (il1) instruction access Fetch 1 Decode 2 Issue 3 Writeback 4 Commit 5 il1 access dl1 access (load) dl1 access (store) Figure 1. Memory hierarchy design using a buffer alongside the DL1 cache. The buffer helps boost the effectiveness of the cache. The buer is a small cache, between 8{16 entries, located between the rst level and second level caches. The buer may be used to hold specic data (e.g. non-temporal or speculative data), or may be used for general data (e.g. \victim" data). Note that this buer may be used in combination with the rst level data cache (DL1) and/or the rst level instruction cache (IL1). Previous work has shown that, for some specic applications, using a buer along-side the instruction or data cache exclusively for non-temporal or speculative data can improve cache performance and energy consumption over a strictly \victim buer" scheme [1, 8]. This paper focuses on the eectiveness of using a buer along-side the DL1 only. In addition, we focus more closely on the impact of the buer when used as a general purpose victim buer. Specically, we compare the use of the victim buer to more traditional techniques for reducing cache conicts such as increasing cache size and/or associativity. We nd that the optimal victim buer conguration depends on the base data cache size, second level cache conguration, and processor conguration. In addition, we show that use of a victim buer is usually a better choice for both power and performance compared to increasing either the data cache size or associativity. 2: Experimental setup This section presents our experimental environment. First, the CPU simulator will be briey introduced and then we will describe how we obtained data about the energy consumption of the caches.

3 2.1: Full model simulator We use an extension of the SimpleScalar [2] tool suite. SimpleScalar is an executiondriven simulator that uses binaries compiled to a MIPS-like target. SimpleScalar can accurately model a high-performance, dynamically-scheduled, multi-issue processor. We use an extended version of the simulator that more accurately models all the memory hierarchy, implementing non-blocking caches and complete bus bandwidth and contention modeling [3]. Other modications were added to handle precise modeling of cache lls. Table 1 shows the conguration of the processor modeled. Note that rst level caches are on-chip, while the unied second level cache is o-chip. In addition we have an 8{16 entry fully associative buer associated with each rst level cache. Note that the ALU resources listed in Table 1 may incur dierent latency and occupancy values depending on the type of operation being performed by the unit. Although an 8KB L1 cache may seem small by today's standards for a high-performance processor, the SPEC95 benchmarks we use for our experiments tend to use a small working set. Therefore, we use a smaller cache to obtain reasonable hit/miss rates. Parameter L1 Icache L1 Dcache L2 Cache Memory Branch Pred. BTB RAS ITLB DTLB Table 1. Machine configuration and processor resources. Conguration 8KB direct; 32B line; 1 cycle 8KB direct; 32B line; 1 cycle 256KB 4-way; 64B line; 12 cycles 64 bit-wide; 20 cycles on hit, 40 cycles on page miss 2k bimodal 2048 entry 4-way set assoc. 8 entry queue 32 entry fully assoc. 64 entry fully assoc. Parameter Units Fetch/Issue/Commit Width 4 Integer ALU 4 Integer Mult/Div 1 FP ALU 4 FP Mult/Div/Sqrt 1 DL1 Read Ports 2 DL1 Write Ports 1 Instruction Window Entries 32 Load/Store Queue Entries 16 Fetch Queue 8 Our simulations are executed on SPECint95 and SPECfp95 benchmarks [12]; they were compiled using a re-targeted version of the GNU gcc compiler, with full optimization. This compiler generates SimpleScalar machine instructions. Since we are executing a full model on a a very detailed simulator, the benchmarks take several hours to complete; due to time constraints we feed the simulator with a small set of inputs. Integer benchmarks are executed to completion. Floating point benchmarks are simulated only for the rst 600 million clock cycles. 2.2: Power model Energy dissipation in CMOS technology circuits is mainly due to charging and discharging gate capacitances; on every transition we dissipate E t = 1 2 C eq V 2 dd Watts. To obtain the values for the equivalent capacitances, C eq, for the components in the memory subsystem, we follow the model given by Wilton and Jouppi [11]. Their model assumes a 0.8 m process; if a dierent process is used, only the transistor capacitances need to be recomputed. To obtain the number of transitions that occur on each transistor, we refer to Kamble and Ghose [6], adapting their work to our overall architecture. An m-way set associative cache consists of three main parts: a data array, a tag array and the necessary control logic. The data array consists of S rows containing m lines. Each line contains L bytes of data and a tag T which is used to uniquely identify the data. Upon

4 receiving a data request, the address is divided into three parts. The rst part indexes one row in the cache, the second selects the bytes or words desired, and the last is compared to the entry in the tag to detect a hit or a miss. On a hit, the processor accesses the data from the rst level cache. On a miss, we use a write-back, write-allocate policy. The latency of the access is directly proportional to the capacitance driven and to the length of bit and word-lines. In order to maximize speed we kept the arrays as square as possible by splitting the data and tag array vertically and/or horizontally. Sub-arrays can also be folded. We used the tool cacti [11] to compute sub-arraying parameters for all our caches. According to [6] we consider the main sources of power to be the following three components: E bit, E word, E output. We are not considering the energy dissipated in the address decoders, since we found this value to be negligible compared to the other components. Similar to Kin [7], we found that the energy consumption of the decoders is about three orders of magnitude smaller than that of the other components. The energy consumption is computed as: E cache = E bit + E word + E output (1) A brief description of each of these components follows. Further details can be found in [1] and [6] : Energy dissipated in bit-lines E bit is the energy consumption in the bit-lines; it is due to precharging lines (including driving the precharge logic) and reading or writing data. Since we assume a sub-arrayed cache, we need to precharge and discharge only the portion directly related with the address we need to read/write. Switching values depend on the number of accesses to the cache and on the array parameters given by cacti [11] while capacitances values depend on the sizes of the transistors within the SRAM cell. Note that in order to minimize the power overhead introduced by buers, in the fully associative conguration, we perform rst a tag look-up and access the data array only on a hit : Energy dissipated in word-lines E word is the energy consumption due to assertion of word-lines; once the bit-lines are all precharged we select one row, performing the read/write to the desired data : Energy dissipated driving external buses E output is the energy used to drive external buses; this component includes both the data sent/returned and the address sent to the lower level memory on a miss request. 3: Experimental results 3.1: Base case E output = E a output + E d output ; (2) In this section we will describe the base case we used; all other experiments will compare to this one.

5 Table 2. Baseline results: Instructions per cycle (IPC), cache miss rate, and energy consumption in the base case. Energy is given in Joules. DL1 Cache IL1 Cache UL2 Cache Test IPC miss rate Energy miss rate Energy miss rate Energy compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu As stated before, our base case uses 8K direct mapped on-chip rst level caches (i.e. DL1 for data and IL1 for instruction), with a unied 256K 4-way o-chip second level cache (UL2). Table 2 shows the instructions per cycle (IPC) and, for each cache, its miss rate and energy consumption (measured in Joules). The next sections' results will show percentage decreases over the base case; thus positive numbers will mean an improvement in energy consumption or performances and negative number will mean a worsening. First level caches are on-chip, so their energy consumption refers to the CPU level, while the o-chip second level cache energy refers to the board level. In the following sections we show what happens to the energy in the overall cache architecture (L1 plus L2), and also clarify whether variations in power belong to the CPU or to the board. As shown in Table 2 the dominant portion of energy consumption is due to the UL2, because of its bigger size and high capacitance board buses. Furthermore we see that the instruction cache demands more power than the data cache, due to a higher number of accesses. 3.2: Traditional techniques Traditional approaches for reducing cache misses utilize bigger cache sizes and/or increased associativity. These techniques may oer good performances but present several drawbacks; rst an area increase is required and second it is more likely that a larger latency will result whenever bigger caches and/or higher associativity are used. Table 3 presents results obtained changing data cache size or associativity. Specically, we present results for 16K direct mapped and 8K 2-way congurations, both maintaining a 1 cycle latency. Reduction in cycles, energy in DL1 and total energy are shown. The overall energy consumption (i.e. %E total ) is generally reduced, due to a reduced activity in the second level cache (since we decrease the DL1 miss rate). However we can also see cases in which we have an increase in energy consumption. Benchmark su2cor already has a low miss rate (1.48%) so, although IPC improves, more energy is consumed in the DL1 than is saved by reduced UL2 accesses. On the other hand, m88ksim, mgrid and applu have higher miss rates, but these are due mainly to conict misses. Therefore increasing the size of the cache does not improve DL1 miss rates enough

6 Table 3. Traditional Techniques: Percent improvement in performance and energy consumption by increasing DL1 size and associativity compared to the base case. Latency = 1 cycle. Negative results indicate a decrease in performance or increase in energy consumption. Test 16K-direct 8K-2way %IPC %M RDL1 %EDL1 %Etotal %IPC %M RDL1 %EDL1 %Etotal compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu to reduce energy consumption. The opposite is true for swim, which has a very large miss rate (21.2%); misses are predominately capacity misses so although using a 16KB DL1 improves both performance and energy consumption, both degrade when using an 8KB 2-way associative DL1. It is important to note that even if the total energy consumption improves, on-chip consumption increases signicantly (up to 61% for mgrid) as the cache becomes larger. In addition, we have observed that if the increase in size/associativity requires having a 2 cycle latency, performance improvements are no longer so dramatic and in some cases we also saw a decrease. 3.3: Victim cache We next consider the use of a victim buer originally presented by Jouppi [5]. In the original implementation, on a main cache miss, the victim cache is accessed; if the address hits the victim cache, the data is returned to the CPU and at the same time it is promoted to the main cache; the replaced line in the main cache is moved to the victim cache, therefore performing a \swap". If the victim cache also misses, an L2 access is performed; the incoming data will ll the main cache, and the replaced line will be moved to the victim cache. The replaced entry in the victim cache is discarded and, if dirty, written back to the second level cache. We modied the algorithm by, performing a parallel look-up in the main and victim cache, as we saw that this helps performance without signicant drawbacks on power. We also found that the time required by the processor to perform the swapping, due to a victim hit, was detrimental to performance, so we also tried a variation on the algorithm that doesn't require swapping (i.e. on a victim cache hit, the line is not promoted to the main cache). Table 4 shows eects using a fully associative victim cache. We present performance (in IPC) and overall energy reduction using either an 8-entry or 16-entry victim buer

7 Table 4. Fully Associative Victim Cache: Percent improvement in performance and energy consumption compared to the base case. Comparison is made with an 8 and 16-entry victim cache, using swapping and no swapping. 8-Entry Victim Cache 16-Entry Victim Cache Test non-swapping swapping non-swapping swapping %IPC %Energy %IPC %Energy %IPC %Energy %IPC %Energy compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu along side the data cache. As stated before, we show that the non-swapping mechanism generally outperforms the original algorithm presented by the author. In [5], the data cache of a single issue processor was considered, where a memory access occurs approximately one out of four cycles; thus the victim cache had ample time to perform the necessary swapping. The same cannot be said of a 4-way issue processor that has, on average, one data memory access per cycle. In this case the improved miss rated obtained by swapping is often outweighed by the extra latency introduced by having to stall subsequent accesses. It is interesting to note that, when using a victim buer, increased performance and energy reduction go hand-in-hand, since reduced L2 activity helps both of them. As we saw in Section 3.2, this is not always the case when doubling the size or associativity. There are two reasons for this. First, the victim buer can adapt better to address either capacity or conict miss problems. The swim benchmark is the best example of this where performance increases by over 16% using a 16-entry non-swapping victim buer, but by only 10% when doubling the size of the cache and degrades by 1.8% when doubling the associativity. The second reason is that the victim buer adds less power dissipation on-chip than either of the other techniques. We found that on-chip energy consumption increased by only 2% on average for the 16-entry buer (not shown in table). This allows total cache energy savings to be much greater using the victim buer than either increasing cache size or associativity (average 10% vs. 3.8% for 16KB DL1 or 5.0% for 8K 2-way DL1). 3.4: Eectiveness of Victim Cache using Dierent sized L1 Data Caches The following section explores the eectiveness of the victim cache when a smaller or larger base L1 data cache is available. We rst explored performance and energy consumption tradeos using a 4KB L1 data cache. We expect a 4KB DL1 cache to suer more from capacity and conict misses compared to an 8KB DL1 cache, therefore doubling size, doubling associativity, or using a victim buer all obtained a larger improvement in performance and energy consumption than those shown

8 in Tables 3 and 4 (energy consumption often improves by 50% more and performance by 30%. Victim caches still have a clear advantage over increasing the size or associativity of the DL1 in terms of energy consumption (while obtaining approximately the same or better performance improvement). However, one notable dierence is the eectiveness of swapping verses non-swapping victim buers. For an 8KB DL1 we found that swapping interferes with performance since it can stall subsequent accesses to the DL1. On the other hand, since a processor with only a 4KB cache spends more time servicing DL1 misses (and therefore a longer time executing the program), DL1 accesses are spread out more making concurrent swaps and accesses less frequent. The end result is that swapping helps to improve the cache hit rate without hindering processor performance. Alternatively, if the base DL1 cache size is increased to 16KB, then a 16-entry will still provide some improvements in performance and energy consumption, however not as great as with a 4KB or 8KB DL1. In addition, as cache size increases, swapping will continue to hinder performance. 3.5: Alternatives to Changing the Cache Conguration Our ultimate goal is to reduce energy consumption while retaining performance. Highperformance processors obtain performance increases through a variety of techniques, not just through improving the L1 cache structure. One question then is, what eect will other techniques have on the performance and energy consumption compared to improving cache congurations? We now look at increasing the issue and decode width of the processor from 4 to 8 instructions. Table 5. Increasing issue width: Percent improvement in performance and energy consumption compared to the base case when the decode and issue with is increased from 4 to 8 instructions. Test %IPC %M RDL1 %EDL1 %EIL1 %EUL2 %Etotal compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu In Table 5 we show the eect on IPC and energy consumption. It is interesting to note that although performance does improve, energy consumption degrades in almost all cases for all three caches (as indicated by the negative values in the columns labeled %E DL1, %E IL1, %E U L2 ). The increased energy consumption is a direct result of increased cache accesses. With a wider issue processor, more instructions can enter the fetch and execute stages per cycle. This allows more speculative instructions on mispredicted paths to be

9 executed before the branch misprediction is detected. For these benchmarks, adding a victim cache would be a better choice, especially since wider issue would also require more hardware in the decode and fetch stages, increasing energy consumption even further in the non-cache portion of the chip. However, increasing issue width and using a victim cache are not mutually exclusive. Both can be implemented, and we have found that performance and energy consumption improve by roughly the same percentage as those shown in Tables 3 and 4. As with a 4-wide issue machine, a 16-entry non-swapping victim cache is overall the best choice compared to doubling the size or associativity of the DL1. 4: Conclusions and Future Work In this paper we presented tradeos between power and performance in cache architectures; we showed that adding small buers can oer an overall energy reduction without decreasing the performance. These small buers suce in reducing the overall cache energy consumption, without a notable increase in on-chip power that traditional techniques present. Furthermore use of a buer is not mutually exclusive to increasing the cache size/associativity or decode/issue width, oering the possibility of a combined eect. We are currently investigating further improvements in combined techniques, replacement policies as well as buer utilization. Acknowledgments We wish to thank Bobbie Manne for her valuable comments throughout this work and Doug Burger for providing us with the framework for the SimpleScalar memory model. References [1] I. Bahar, G. Albera and S. Manne, \Using Condence to Reduce Energy Consumption in High- Performance Microprocessors," Int. Symposium on Low Power Electronics and Design, [2] D. Burger, and T. M. Austin, \The SimpleScalar Tool Set, Version 2.0," Technical Report TR#1342, University of Wisconsin, June [3] D. Burger, and T. M. Austin, \SimpleScalar Tutorial," presented at 30th International Symposium on Microarchitecture, December, [4] T. L. Johnson, and W. W. Hwu, \Run-time Adaptive Cache Hierarchy Management via Reference Analysis," International Symposium on Computer Architecture, pp. 315{326, June [5] N. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buers," Inter. Symposium on Computer Architecture, pp. 364{373, May [6] M. B. Kamble and K. Ghose, \Analytical Energy Dissipation Models for Low Power Caches," ACM/IEEE International Symposium on Low-Power Electronics and Design, August, [7] J. Kin, M. Gupta, and W. H. Mangione-Smith, \The Filter Cache: An Energy Ecient Memory Structure," 30th International Symposium on Microarchitecture, pp. 184{193, December [8] J. A. Rivers, and E. S. Davidson, \Reducing Conicts in Direct-Mapped Caches with a Temporality- Based Design," International Conference on Parallel Processing, pp , August [9] C. Su, and A. Despain, \Cache Design Tradeos for Power and Performance Optimization: A Case Study," ACM/IEEE Int. Symposium on Low-Power Design, pp. 63{68, April, [10] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, \Managing Data Caches using Selective Cache Line Replacement," Journal of Parallel Programming, Vol. 25, No. 3, pp. 213{242, June [11] S. J. E. Wilton, and N. Jouppi, \An Enhanced Access and Cycle Time Model for On-Chip Caches," Digital WRL Research Report 93/5, July [12] Jerey Gee, Mark Hill, Dinoisions Pnevmatikatos, Alan J. Smith, \Cache Performance of the SPEC Benchmark Suite," IEEE Micro, Vol. 13, Number 4, pp (August 1993)

Power and Performance Tradeoffs using Various Cache Configurations

Power and Performance Tradeoffs using Various Cache Configurations Gianluca Albera xy and R. Iris Bahar y x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 y Brown University