Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders

Size: px

Start display at page:

Download "Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders"

Ferdinand Barnett
6 years ago
Views:

1 Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Chuanjun Zhang Department of Computer Science and Electrical Engineering University of Missouri-Kansas City Abstract Level one cache normally resides on a processor s critical path, which determines the clock frequency. Directmapped caches exhibit fast access time but poor hit rates compared with same sized set-associative caches due to nonuniform accesses to the cache sets, which generate more conflict misses in some sets while other sets are underutilized. We propose a technique to reduce the miss rate of direct mapped caches through balancing the accesses to cache sets. We increase the decoder length and thus reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We introduce a replacement policy to direct-mapped cache design and increase the access to the underutilized cache sets with the help of programmable decoders. On average, the proposed balanced cache, or B- Cache, achieves 64.5% and 37.8% miss rate reductions on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 5.9%. The B-Cache consumes 10.5% more power per access but exhibits 2% total memory access related energy saving due to the miss rate reductions and hence the reduction to applications execution time. Compared with previous techniques that aim at reducing the miss rate of direct-mapped caches, our technique requires only one cycle to access all cache hits and has the same access time of a direct-mapped cache. 1. Introduction The increasing gap between memory latency and processor speeds is a critical bottleneck to achieving a high performance computing system. To bridge the gap, multilevel memory hierarchy has been exploited to hide the memory latency. Level one cache normally resides on a processor s critical path, which determines the clock frequency; therefore fast access to level one cache is an important issue for improved processor performance. A conventional direct-mapped cache accesses only one tag array and one data array per access, whereas a setassociative cache accesses multiple tag arrays and data arrays per access. Thus, a direct-mapped cache has the benefit of not requiring a multiplexer to combine multiple accessed data items and hence can have faster access time. A direct-mapped cache is 29.5% and 19.3% faster than a same sized 8-way cache [21] at sizes of 8kB and 16kB, respectively. A direct-mapped cache also has the advantage of consuming less power per access due to accessing only one way instead of multiple ways as in a set-associative cache. A direct-mapped cache consumes 74.7% and 68.8% less power than a same sized 8-way cache at sizes of 8kB and 16kB, respectively. A direct-mapped cache is also simple to design, easy to implement, and accounts for less area. However, a direct-mapped cache may have a higher miss rate than a set-associative cache, depending on the access pattern of the executing application, with a higher miss rate resulting in more waiting time for the next level memory accesses. On average, the miss rate of a direct-mapped instruction cache is 29.0% and 100.0% higher, while the miss rate of a direct-mapped data cache is 37.7% and 28.0% higher, than a same sized 8-way cache at sizes of 8kB and 16kB, respectively. Therefore, a direct-mapped cache may or may not result in better overall performance or energy for a particular application. Ideally, a desirable cache should have the access time of a direct-mapped cache but with as low of a miss rate as a highly associative cache. A set-associative cache has two distinct advantages over a direct-mapped cache: conflict miss reduction and replacement policy. A set-associative cache reduces the conflict misses by choosing the victim from multiple cache blocks while only one cache block can be chosen for a direct-mapped cache. Replacement policy of a set-associative cache can select a better victim through consideration of the cache access history. This is particularly true since memory accesses in a program run are extremely unbalanced, causing some cache sets to be accessed heavily while others remain underutilized. However, accessing a set-associative cache requires longer access time, more power, and area. We propose a novel mechanism to provide the benefit of cache block replacement while maintaining the constant access time of a direct-mapped cache. We call this the Balanced Cache, or simply, the B-Cache. The major contributions of our work are as follows: 1. We propose to increase the decoder length of a traditional direct-mapped cache by three bits. This creates the following effects: (a) Accesses to heavily used sets can be potentially reduced to one eighth of the original design. Therefore, we avoid not only detecting those heavily used sets dynamically but we also eliminate the corresponding hardware overhead. (b) Only one eighth of the memory address space has a mapping to the cache sets. We call this the limited memory address mapping.

2 2. We propose to incorporate a replacement policy to the B- Cache. For a cache miss, when the desired address cannot find a mapping to the cache sets due to the abovementioned limited address mapping, the B-Cache can increase the accesses to the underutilized sets through the replacement policy without explicitly detecting those underutilized sets dynamically. 3. We propose to use a programmable decoder for the B- Cache, since the B-Cache would dynamically determine which memory address has a mapping to the cache sets after a cache miss. Using execution driven simulations, we demonstrate that the B-Cache achieves an average miss rate reduction of 64.5% and 37.8% on all 26 benchmarks from the SPEC2K [13] suite for level one direct-mapped 16kB instruction and data caches, respectively. This translates into an instruction per cycle (IPC) improvement of up to 27.1% and an average IPC improvement of 5.9%. Although the B-Cache consumes 10.5% more power per access than the original direct-mapped cache, the B-cache achieves 2% total memory access related energy saving due to the miss rate reductions and hence a reduction in the applications execution time. Furthermore, compared with other techniques that reduce a direct-mapped cache s conflict misses, the access time of the B-Cache is the same as that of a traditional direct-mapped cache. Lastly, the B-Cache requires only one cycle to access all cache hits while other techniques either need a second cycle to access part of the cache hits or must have a longer access time than a directmapped cache. Section 2 further motivates the proposed technique. Section 3 describes the organization of the B-Cache. Experimental methodology and results are presented in Section 4. Section 5 describes the programmable decoder design. Performance and energy analysis are presented in Section 6. Related work is discussed in Section 7 and concluding remarks are given in Section Motivation 2.1 The Problem Memory reference addresses are mapped to cache sets based on index decoding. Because of the well-known locality [20][17] exhibited in both instruction and data, some cache sets are accessed more frequently than others and thus 5-bit tag conventional nonprogrammable decoder 3-bit index conflicting addresses Data Mem 6-bit tag conventional nonprogrammable decoder Way-1 conventional nonprogrammable decoder Way-2 2-bit index Data Mem generating more conflict misses while other cache sets are underutilized. Substantial research effort has been conducted to reduce the conflict misses for direct-mapped caches. A victim buffer [14] is a small fully-associative cache that can resolve the conflict misses for small direct-mapped caches. An extra cycle is required to access the victim buffer when the direct-mapped cache and the buffer are accessed sequentially or the access time of the cache would be prolonged if they are checked concurrently since a multiplexer is required to select the desired output from the buffer or the cache. The column associative cache [1], the adaptive group associative cache [20], the predictive sequential associative cache [5], and the partial address matching cache [18] trade varied hit latencies for reduced access time of set associative caches. Multiple hit latencies disrupt datapath pipeline and complicate the design of pipelined caches [2]. 2.2 Example Figure 1 shows a conventional direct-mapped cache (a), a conventional 2-way cache (b), and the proposed B-Cache (c). For simplicity, the cache contains only eight sets and the address contains only eight bits. For an address sequence of 0, 1, 8, 9, 0, 1, 8, 9, the direct-mapped cache experiences the worst situation of having no cache hits at all. This occurs since the cache accesses are completely non-uniform, which is the case in real applications [20]. A traditional directmapped cache cannot adaptively accommodate those thrashing addresses since the index decoding is fixed. On the other hand, the 2-way cache exhibits cache hits after the first four warm-up accesses. The 2-way cache achieves this high hit rate since the decoder length is one bit less than the direct-mapped cache. During a miss, the 2-way cache has two candidates for the victim while the directmapped cache has just one candidate. It is intuitive that decreasing the decoder length would generate more opportunities for choosing a better victim, such as the index lengths of 4-way and 8-way caches for this example are 1 bit and 0 bit, respectively. 2.3 Solution Figure 1 (c) shows the proposed B-Cache. Instead of decreasing the decoder length, we increase the decoder length by one bit. The first two most significant index bits are (a): A direct-mapped cache (b): A 2-way Cache (c): The proposed B-Cache Figure 1: (a): A conventional direct-mapped cache. (b): A conventional 2-way cache. (c): The proposed B-Cache with programmable decoders. PI: programmable index; NPI: non programmable index. represents an invalid PD entry during the cache s cold start bit tag Programmable decoder (PD) The two PIs must be different to maintain unique address decoding Programmable decoder (PD) 2-bit 2-bit PI NPI x x x x x x x x Data Mem PI NPI PI NPI conventional nonprogrammable decoder (NPD)

3 programmable indices for programmable decoders (PD) and the least two insignificant bits are non-programmable indices for non-programmable decoders (NPD). The outputs of the two decoders are ANDed together to control the activation of a word line. We will discuss how to determine the index length and how to divide the indices into programmable and non-programmable indices for a B-Cache in Section 6. The B-Cache exhibits the same hit rate as the 2-way cache for this example. The defining feature of the B-Cache is the PD, which is programmed with the index of each desired address on-the-fly during a cache refill after a miss. For this simple example, there are two victim candidates determined by the non-programmable index of an address. The PD will be dynamically programmed in the following three situations. First, during a cache s cold start, the contents of the PDs are invalid and programmed using the programmable index in each desired address. For addresses that have the same nonprogrammable index, such as addresses 0 and 8, the victim is chosen using the replacement policy (least recently used replacement is assumed). Second, the B_Cache exhibits a miss but the PD exhibits a hit. For example, this happens when address 25 is accessed after the aforementioned address sequence. The nonprogrammable index of the address 25 (11001) is 01, from Figure 1 (c), and the corresponding programmable index in the two PDs are 00 and 10. Since the programmable index of the address 25 (11001) is 10, the B-Cache has a PD hit. Recall that since the B-Cache is a direct-mapped cache and only one cache block is activated during an access, the address 25 must replace the address 9. In this situation, the B- Cache cannot choose a better victim based on the access history. If the B-Cache replaces address 1 with the address 25, then the address 9 must be evicted as well to maintain unique address decoding. This definitely impacts the hit rate of the cache inadvertently and should be avoided. Lastly, both the B-Cache and the PD exhibit a miss. This happens when address 13 is accessed since the programmable index of the address 13 (1101) is 11, which is different from the indices stored in the two PDs, which are 00 and 10. None of the PDs are activated. It should be noted that this cache miss is not an extra miss caused by the limited address mapping. The same situation exists in fully associative caches that use the whole tag as the decoding index and would not incur any extra misses. In fact, the PD predetermines the cache miss so that neither data nor tag will be read out from the memory during a PD miss. Since the cache miss signal is by default true, this will not cause any problem. The victim for address 13 can be chosen from any of the two cache sets based on the replacement policy. It is important to point out the difference between the B- Cache and the 2-way cache for this example. For the 2-way cache, both address 13 and 25 can be mapped to either of the two sets. It is obvious that decreasing the PD hit rate during a cache miss would improve the B-Cache hit rate since a low PD hit rate means that the replacement policy can be fully exploited to balance the cache accesses and thus reduce the conflict misses. 3. The B-Cache Organization 3.1 Terminology The fundamental idea behind the B-Cache is to reduce the conflict misses by balancing accesses to the cache sets. The defining property of the B-Cache is that the index length is longer than that in a conventional direct-mapped cache. We define the memory address mapping factor (MF): MF = 2 (PINPI) /2 OI, where MF 1, PI, NPI, and OI are the index lengths of the PD and NPD of the B-Cache and the original direct-mapped cache, respectively. This means that 1/MF of the total memory addresses has a mapping to the cache sets after increasing the B-Cache s index length. We will show that larger values of MF decrease the PD hit rate during a cache miss in Section The B-Cache can reduce the miss rate since multiple cache locations can be chosen for a victim, as shown in Figure 1 (c). We define the B-Cache associativity (BAS): BAS =2 OI /2 NPI, where BAS 1. This means that the B-Cache logically divides the cache sets into BAS clusters, as shown in Figure 2. For a cache miss, the victim can be chosen from these BAS clusters. The miss rate of the B-Cache will approach the miss rate of a same sized BAS-way cache. The case MF = 1 or BAS = 1 is equivalent to a traditional direct-mapped cache. For the B-Cache, both MF and BAS must be larger than Organization Cache memory is partitioned into a number of identically sized subarrays to tradeoff the access time, area, and power consumption [10][21]. Each subarray stores a part of the accessed word or several words depending on the divisions of cache memory from both word line and bit line cuts. Figure 2 shows the organization of a 16kB direct-mapped cache with a line size of 32 bytes. The address is assumed to have 32 bits. The data memory is partitioned into four subarrays and the tag memory is partitioned into eight subarrays (not shown in the figure) [21]. Each subarray has its own local decoder. The divided word line technique [25] is adopted to achieve both the fast access time and low per access energy consumption. For the ease of illustration, we use the least significant two bits of the index, I 1 and I 0, as the inputs to the global word line decoder to determine the selection of a subarray. The other seven bits, from I 8 to I 2, determine the cache line selection in a particular subarray. The combination of the global and local word line locates a particular set for an address. It should be noted that any address bit combination can be used as the global and local word line decoder inputs and the B-Cache is not restricted to any index decoding scheme or cache memory partitions. There do exist optimization opportunities in selecting the address bits for index decoding to achieve low miss rate. However, indexing optimization [11] is out of the range of this paper. For the B-Cache, we replace the original local decoders with new decoders. The new decoder includes eight 4 16 NPDs and eight 6 16 PDs. The NPD uses index bits I 5 to I 2 while the PD uses I 8, I 7, I 6 and three tag bits T 2, T 1, and T 0 as inputs. For this particular design,

4 the B-Cache s two parameters are MF =8 and BAS = 8. We determine the MF and BAS through experimentation and thus defer the discussion of how to choose B-Cache parameters in Section and The B-Cache s Replacement Policy The B-Cache increases the accesses to underutilized sets and reduces the conflict misses through the replacement policy. Any replacement policy available to conventional set associative caches is applicable to the B-Cache. We evaluate both random replacement and least recently used (LRU) policy for the B-Cache design. The random policy is simple to design and needs trivial extra hardware. LRU may achieve a better hit rate but will have more area overhead than the random policy. Designing a new replacement policy specifically for the B-Cache to detect the underutilized sets and hence choose the victim accordingly may reduce further the miss rate. However, we show in Section that the B- Cache s miss rate reduction approaches an 8-way cache, which makes further reducing the miss rate through innovative replacement policies not important. 4. Experimental Methodology and Results We use miss rate as the primary metric to measure the B- Cache effectiveness. We used a 4-issue out-of-order processor simulator to collect the miss rate. We determine the B-Cache parameters MP and BAS through experimentation. Overall performance improvement and energy is discussed in Section Cache Hierarchy The baseline level one cache is a direct-mapped 16kB cache with a line size of 32 bytes for both the instruction and data caches. We use a unified 4-way 256kB L-2 cache using LRU replacement policy and a hit latency of 6 cycles. The results of level one cache sizes of 8kB and 32kB are also colleted to show the effectiveness of the B-Cache. 4.2 Benchmarks We ran all 26 SPEC2K benchmarks using the SimpleScalar tool set [3]. The benchmarks are pre-compiled for the Alpha ISA and were obtained from the SimpleScalar tag -18 bits T 2 OI index- 9bits T 1 T 0 I 8 I 7 I 6 I 1 I 0 I 5 I 2 offset- 5bits developers [24]. The benchmarks were fast-forwarded for two billion instructions and executed for 500 million instructions afterwards using reference inputs. For the data cache, all results are reported. For the instruction cache, the results of benchmarks whose miss rates are less than 0.01% are not reported (to save space in the plot) since further reducing the miss rate may not be important for these benchmarks. These benchmarks include applu, art, bzip, facerec, galgel, gzip, lucas, mcf, mgrid, swim, and vpr. The overall performance and energy improvements, however, are reported for all 26 benchmarks. 4.3 Experimental Results Figure 4 and Figure 5 shows the relative miss rate reductions for nine different cache configurations compared to the baseline. The data cache results are reported in groups of CINT2K (integer) and CFP2K (floating point) components of SPEC2K. Four conventional set-associative cache configurations, 2-way, 4-way, 8-way, and 32-way are included to show the miss rate reductions over the baseline. The miss rate reduction of a 16-entry victim buffer is included and will be compared with the B_Cache design in Section 6.6. The other four configurations are the B-Cache at memory mapping factor MF = 2, 4, 8, or 16 with B-Cache associativity BAS = 8. The bar marked Ave was computed as the average of miss rate reductions for all the benchmarks The B-Cache Associativity - BAS We determine the B-Cache associativity, BAS, through experimentation. From Figure 4 and 5, we can see that increasing associativity higher than eight would not bring significant miss rate reductions. On average, the miss rate reduction of the 32-way cache is only 1% higher than the 8- way cache with only one exception where the benchmark perlmk shows a 20% improvement over the 8-way cache. On the other hand, the 8-way cache outperforms the 4-way cache by 2.6%, 4.6%, and 7.9% for data cache on CINT2K and CFP2K and the instruction cache on the reported benchmarks, respectively. For some benchmarks, crafty, eon, equake, gap, and twolf, the 8-way cache has more than a 10% miss rate reduction over a 4-way instruction cache; the same situation exists in the data cache for benchmarks crafty and fma3d. eight programmable decoders(pd) 6 16 PI 4 16 NPI local decoder I 0 I 1 global decoder 4 global word line cluster 1 cluster 1 cluster 1 local wordline bit lines proposed new decoder eight conventional nonprogrammable decoder original local decoder 0 Figure 2: B-Cache organization. cluster 8 cluster 8 cluster 8 cluster 8 subarray0 local subarray1 local subarray2 local subarray3 decoder 1 decoder 2 decoder 3

5 Since the miss rate of the B-Cache with associativity BAS = 8 would approach the miss rate of an 8-way cache, we choose BAS = 8. Further increasing the BAS to 16 or higher would not improve the miss rate reductions significantly but may incur more overhead in PDs, which is discussed in Section Memory Address Mapping Factor - MF We determine the optimal MF through experimentation. From Figure 4 and 5, when the MF is increased from 2, 4, to 8, the miss rate reduction of the B-Cache for the data cache on CINT2K is increased from 18.3%, 31.3%, to 38.1%; the miss rate reduction is increased from 21.8%, 31.1%, to 36.4% for CFP2K. Continuing to increase the MF to 16 improves the miss rate reduction by a mere 1.7% and 1.0% for CINT2K and CFP2K, respectively. For the instruction cache, the miss rate reductions for MF = 2, 4, or 8 are 33.2%, 53.4%, or 64.7%, respectively. The miss rate is reduced by merely 0.4% when the MF is increased to 16. There are some benchmarks, such as wupwise, facrec, galgel, and sixtrack for the data cache, where the miss rate reduction of the B-Cache is lower than the 4-way cache. We increase the MF and measure the B-Cache s miss rate and PD hit rate during a cache miss for benchmark wupwise. The results are shown in Figure 3. The miss rate decreases as the decoder hit rate does. This is because the B-Cache cannot fully exploit the replacement policy when there is a PD hit. D$ miss rate PD hit rate 4% miss_rate PD_hit_rate 3% 2% 1% 0% MF2 MF4 MF8 MF16 MF32 MF64 MF128 MF256 MF512 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3: Data cache miss rate (left) and PD hit rate (right) of the benchmark wupwise at cache size of 16kB with varied MP values. When the MF is increased from 32 to 64, the PD hit rate is reduced sharply and so does the cache miss rate. However, larger MF means more bits for the PD and thus more overhead in terms of area and access time. Considering the access time analysis of the B-Cache in Section 5.1, we choose a memory address mapping factor MF = Miss Rate Reduction of the B-Cache The upper bound on the miss rate reduction of the B- Cache with BAS = 8 is the miss rate of an 8-way cache. It approaches the upper bound as the memory address mapping % Reduction in miss rate over baseline % Reduction in miss rate over baseline % Reduction in miss rate over baseline k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 SPEC 2K Floating Point Benchmarks ammp applu apsi art equake facerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise Ave 16k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 SPEC 2K Integer Benchmarks bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf votex vpr Ave Figure 4: Data cache miss rate reductions of a 2-way, 4-way, 8-way, 32-way, 16-entry victim buffer, and the B-Cache with MF = 2, 4, 8, or 16 and BAS = 8 compared to the baseline. (Replacement policy is LRU) 16k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 ammp apsi crafty eon equake fma gap gcc mesa parser perl six two vot wup Ave Figure 5: Instruction cache miss rate reductions of a 2-way, 4-way, 8-way, 32-way, 16-entry victim buffer, and the B-Cache with MF = 2, 4, 8, or 16 and BAS = 8 compared to the baseline. (Replacement policy is LRU)

6 original non-programmable decoder (NPD) non-programmable decoder (NPD) programmable decoder (PD) factor MF increases. From Figure 4 and Figure 5, the miss rate reduction of the B-Cache is as good as a 4-way cache for the data cache. For the instruction cache, on average, the miss rate reduction is 5% better than a 4-way cache. 5. Programmable Decoder Design In this section we evaluate the latency, storage, and power costs associated with the B-Cache. We use HSPICE simulation to measure the PD s access time. Storage is measured in terms of SRAM bits equivalent. Both HSPICE and Cacti v3.2 are used to model the B-Cache s access time and energy. 5.1 Timing Analysis CAM B_Cache decoder (a). The original and the B-Cache s 4X16 NPD. wordline driver wordline driver A 4 A 5 A 6 The critical path of a conventional direct-mapped cache normally resides on the tag side instead of the data side [19]. The B-Cache incorporates three tag bits into the PD; therefore the tag length is reduced by three bits, which reduces the tag side access time. However, we do not claim this access time reduction since the data and tag path may already be balanced[19] so that the data side would be on the critical path. The B-Cache modifies the local decoder without changing the global decoder. Therefore, the B-Cache s local decoder should run faster than or at least as fast as that of the original local decoder of each subarray to avoid increasing the access time. The B-Cache decoder consists of a PD and NPD as shown in Figure 6 (a). The original decoder is composed of two 2-input NAND gates and one 3-input NAND gate as shown in Figure 6 (b). The outputs of these NAND gates are ORed through a 3-input NOR gate. The NPD in the B-Cache decoder has two 2-input NAND gates. Figure 6: Decoder timing analysis. (c).search bitline segmentation. The outputs of the 2-input NAND gates are ORed through a 2-input NOR gate. The PD of the B-Cache is composed of eight 6 16 CAMs. We use standard ten-transistor CAM cells. Each CAM cell contains a SRAM cell and a dynamic XOR gate for comparison. The search line and bit line are separated to reduce the capacitance load. The matchlines of the CAM cells are precharged high and conditionally discharged on a mismatch. Therefore, we need an AND gate for the outputs of the NPD and PD of the B-Cache. Inserting a new AND gate would increase the access time. The technique proposed in [28] is employed, which changes the inverter in the original wordline driver to a 2-input NAND gate. The transistor sizes of the 2-input NAND are selected to make the NAND gate as fast as the original inverter and thus incurs no access time overhead. By using this technique, if there is time slack left between the B-Cache decoder and the original decoder, the B- Cache would not prolong the B-Cache s access time. Table 1 shows the access time for decoders of 8 256, 7 128, 6 64, 5 32, and 4 16, which corresponds to the subarray sizes of 8kB, 4kB, 2kB, 1kB, and 512 bytes with a cache line size of 32 bytes, respectively. As far as we know, we have not found any level one cache whose subarray size is larger than 8kB or less than 512 bytes [21]. For each original decoder, the corresponding B-Cache decoder is shown. The access time of the conventional decoders is calculated from the Cacti 3.2, while the access time of the PD for the B- Cache is obtained from the HSPICE simulation of the circuits extracted from our own CAM layout using Cadence tools [4]. The area of the CAM cell is 25% larger than the SRAM cell. The result of our HSPICE simulation is comparable with the results reported from other researchers [8]. From this table, Table 1: Timing analysis of the B-Cache decoder. PD: programmable decoder; NPD: non programmable decoder. The composition of the decoders are represented in NAND and NOR gates, such as 2D_3R stands for 2-input NAND gate and 3-input NOR gate. Decoder Composition 3D-3R 3D-3R 2D-3R 3D-2R 2D-2R Time (ns) BC decoder PD NPD PD NPD PD NPD PD NPD PD NPD Composition CAM 3D-2R CAM 2D-2R CAM NAND3 CAM NAND2 CAM INV Time(ns) NOR3 is replaced by NOR2 in the 4 16 NPD NAND3 gates are removed for the 4 16 NPD A 0 A 1 A 2 A 3 (b). The original and the B-Cache s eight 4 16 NPD. 128 NOR2 for the 4 16 NPD

7 we can see that all of the decoders have time slack left. Therefore, the B-Cache would not have any time access overhead compared with a conventional direct-mapped cache. One may ask how the B-Cache s decoder is as fast as the original direct-mapped cache given the index length of the B- Cache is three bits longer than the original index. Compared with the baseline, the B-Cache s 4 16 decoder is faster than the original decoder since the NOR2 gates, which are faster, replace the NOR3 gates. The critical path of the decoder, for this example, is also changed from the 3-input NAND gates to the 2-input NAND gates. However, we should point out that the B-Cache s 4 16 NPD is much slower than the 4 16 decoder in the original direct-mapped cache with a subarray size of 512 bytes as shown in Table 1. This is because they have different subarray dimensions and outputs loads. The 2-input NAND gate in the B-Cache s 4 16 decoder has a fan out of 8 4 = 32 gates, while the fan out of the 4 16 in the conventional direct-mapped cache is 4 gates. This explains why the decoder in the conventional 16kB direct-mapped cache has not been divided into one and one and then combined through the NAND gate (changed from the inverter), since the performance improvement is less than 0.5% with an area overhead of input NOR gates and enlarged inverters. To speedup CAM comparison, the search bit lines of the CAM are segmented as shown in Figure 6 (c), which is similar to the technique proposed in [10]. One search bit line needs 9 extra inverters. For this 16kB direct-mapped cache, ( ) 4 = 648 inverters are required, which represents a fraction of the total area. 5.2 Different Partitions of Tag and Data Memory Table 2: Storage cost analysis. Tag Dec Tag Mem Data Dec Data Mem Total size Baseline No Mem cell 20bit x 512 No Mem cell 256bitx B-Cache 64 6x8 CAM 17bitx x16 CAM 256bitx Tag and data memory typically use different memory partitions to achieve the best tradeoff across performance, power, and area. For example, the data memory is partitioned into four subarrays while the tag memory is partitioned into eight subarrays for this 16kB direct-mapped cache [14]. Figure 7 shows the decoder design for tag and data memory subarrays for the B-Cache. For the tag decoder, eight subarrays need 3 bits for the global decoder while the global data decoder requires 2 bits for four subarrays. Since both tag and data decoders must have the same B-Cache parameters MF = 8, BAS = 8, and programmable index length: PI = 6, the length of the NPD for the B-Cache is 4 and 3 bits for the data and tag decoder, respectively. From Table 1 we can see that the decoders access times are faster than the original design. tag 3bits 2bits global index 7-bit index 3bits 4bits PD NPD PD NPD (a): data memory partition (b): tag memory partition Figure 7: PD design when memory partition in tag and data is different. PD: programmable decoder, NPD: non-programmable decoder. The four-input AND gates are implemented through two stages of NAND and NOR gates. 5.3 Storage Overhead The additional hardware for the B-Cache is the CAM based PD. The area [8] of the CAM cell is 25% larger than the SRAM cell used by data and tag memory. There are sixtyfour 6 8 and thirty-two 6 16 CAMs. The total storage requirements for both the baseline and the B-Cache are calculated in Table 2. The overhead of B-Cache increases the total cache area of the baseline by 4.3%, which is less than a same sized 4-way cache, which is 7.98% more area than the baseline without considering the area used for implementing the replacement policy [21]. 5.4 Power Overhead The extra power consumption comes from the PD of each subarray. Data and tag memory have four and eight subarrays, respectively. We measure the power consumption of the PD using HSPICE simulation at 0.18um technology. A 6 8 and 6 16 CAM decoder consumes 0.78pJ and 1.62pJ per search, respectively. The B-Cache uses sixty-four 6 8 CAMs for tag PDs and thirty-two 6 16 CAMs for data PDs. We calculate the power reductions due to the 3-bit tag length reduction and the removal of the 3-input NAND gates in the original tag and data decoders using Cacti 3.2. The energy per cache access of both the baseline and the B-Cache are shown in Table 3. The power consumption of the B-Cache is 10.5% higher than the baseline, however, it is still 17.4%, 44.4%, and 65.5% lower than the same sized 2-way, 4-way, and 8- way caches, respectively. Described in Section 6.2, the overall energy analysis of the B-Cache shows that the B- Cache reduces the overall memory related energy consumption by 2% over the baseline due to the reduced miss Table 3: Energy (pj) per cache access. T: tag; D: data; SA: sense amplifier; Dec: decoder; BL:bitline; WL: wordline. T-SA T-Dec T-BL-WL D-SA D-Dec D-BL-WL D-others Total (pj) Baseline B_Cache entries data subarray tag 3bits 3 bits global index 6-bit index 3bits 3bits 64 entries tag subarray

8 % IPC improvement over baseline Energy normalized to baseline way 4-way 8-way B_Cache victim ammp appl apsi art bzip craf eon equa face fma3 galg gap gcc gzip luca mcf mesa mgri pars perl sixtr swim twol votex vpr wupw Ave Figure 8: Performance improvement of a 2-way, 4-way, 8-way, the B-Cache and a 16-entry victim buffer over the baseline at cache size of 16kB (only the first four letters of the benchmarks are shown to save space) rate and hence the reduction to the application s execution time. 6. Analysis In this section, we present the impact of the B-Cache on overall processor performance and energy consumption. We also discuss the tradeoffs of the B-Cache design parameters. We evaluate the balance of the B-Cache and compare the B- Cache with other related techniques. Finally, we discuss the issues involving virtually or physically addressed tagged caches. 6.1 Overall Performance Table 4 shows the processor configuration for both the baseline and the proposed B-cache. Figure 8 shows the performance improvements measured in IPC of the processor with a level one B-Cache, an 8-way cache, and a 16-entry victim buffer over the baseline processor. The performance of the processor with the B-Cache outperforms the baseline by an average of 5.9%. The greatest performance improvement is seen in equake, where the IPC increases by 27.1%. The performance of the B-Cache is only 0.3% less than the 8-way cache but is 3.7% higher than the victim buffer. Although the victim buffer outperforms the B-Cache in data cache miss rate reduction for the benchmark wupwise, the performance of the B-Cache is still higher than the victim buffer since the B-Cache outperforms the victim buffer by 50% in miss rate reductions for the instruction cache as shown in Figure 5. It is important to point out that both the B- Table 4: Baseline and the B-cache processor configuration. Fetch/Issue/Retire Width Instruction Window Size L1 cache L2 Unified Cache Main Memory 2-way 4-way 8-way B_Cache victim16 ammp appl apsi art bzip craf eon equa face fma3 galg gap gcc gzip luca mcf mesa mgri pars perl sixt swim twol vote vpr wupw Ave Figure 9: Normalized overall energy of a 2-way, 4-way, 8-way, the B-Cache, and a 16-entry victim buffer over the baseline at a cache size of 16k (only the first four letters of the benchmarks are shown to save space). 4 instructions/cycle, 4 functional units 16 instructions 16kB, 32B linesize, direct mapped 256kB, 128B linesize, 4-way, 6 cycle hit Infinite size, 100 cycle access Cache and the victim cache have a fast access time than the set associative caches 6.2 Overall Energy There are two main components that result in power dissipation in CMOS circuits, namely static power dissipation due to leakage current and dynamic power dissipation due to logic switching current and the charging and discharging of the load capacitance. We consider both types of energy in our evaluation. We evaluate the memory related energy consumption [28] including on chip caches, as described in Section 4.1, and off-chip memory. Figure 10 lists the equations for computing the total memory related energy consumption. The italic terms are those we obtain through measurements or simulations. We compute cache_access, cache_miss, and cycles by running SimpleScalar simulations for each cache configuration. We compute E_cache_access and E_cache_block_refill using Cacti 3.2 for both level one and two caches. The E_next_level_mem includes the energy of accessing level two cache and the energy of accessing off-chip memory. The E_static_per_cycle is the total static energy consumed per cycle. Both energy of accessing off chip memory and E_static_per_cycle are highly system dependent. Using a methodology similar to [28], we evaluate the energy of accessing off-chip memory as 100 times larger than the E_mem = E_dyn E_static E_dyn = cache_access * E_cache_access cache_miss * E_misses E_misses=E_next_level_memE_cache_block_refill E_static = cycles * static_energy_per_cycle E_static_per_cycle = k_static * E_total_per_cycle Figure 10: Equations for energy evaluation.

9 Table 5: The miss rate reductions of the B_Cache compared with a direct-mapped cache at varied MF, BAS, and PD. MF=2 MF=4 MF=8 MF=16 Table 6: PD hit rate during cache misses of the B_Cache at varied MF, BAS, and PD. MF=2 MF=4 MF=8 MF=16 Design A Design B BAS = BAS = PD BAS = BAS = PD baseline. We evaluate the static energy as 50% (k_static) of the total energy that includes both dynamic and static energy. Figure 9 shows the energy of 2-way, 4-way, 8-way, the B- Cache, and the victim buffer normalized to the baseline. On average, the B-Cache consumes the least energy and is 2% less than the baseline. The greatest energy reduction is seen in crafty, where the energy reduces by 14%. The B-Cache reduces the miss rate and hence the accesses to the second level cache, which is more power costly. During a cache miss, the B-Cache also reduces the cache memory accesses through the miss prediction of the PD, which makes the effective power overhead much less. In Section 6.3, we show that the PD can predict on average around 80% of the cache misses thus reducing the power to access both the data and tag during a cache miss. 6.3 Design Tradeoffs for MP and BAS for a Fixed Length of PD For a fixed length of PD, such as PD = 4, the B-Cache has two design options as shown in Figure 11. In design A, the B-Cache uses MF = 2 and BAS = 8. In design B, the B- Cache uses MF = 4 and BAS = 4. The question is which design has a higher miss rate reduction. In design A, the B- Cache has eight clusters and the miss rate reduction would approach that of an 8-way cache. However, MF = 2 means that the PD hit rate is high and the B-Cache cannot fully exploit the replacement policy, therefore the miss rate reduction may be low. Table 5 and Table 6 shows the miss rate reduction and PD hit rates of the B-Cache at varied MF, BAS, and PD combinations. For the same PD length, the miss rate reduction of the design B (BAS = 4) outperforms the design A (BAS = 8) when PD is less than 6. This is because the PD hit rate of the design B is lower than design A, since the design B has a larger MF. However, when the PD is increased to 6, the PD hit rates are low for both designs and the effect of high clusters appears to be important, then design A has a higher miss reduction than design B. Based on our timing analysis in Section 5.1, we can use a 6-bit PD without MF=2, BAS=8, PD=4 (A) Index of original directmapped cache (B) MF=4, BAS=4, PD=4 Figure 11: Tradeoff of MF, BAS, and PD. incurring access time overhead, therefore we design the B- Cache with MF = 8 and BAS = 8. This result also presents the design options for varied PD values and corresponding miss rate reductions. 6.4 Balance Evaluation We classify a set as frequent hit or frequent miss sets when the cache hits or misses occurring in a set are 2 times higher than the average. We classify a set as the less accessed set when the total accesses to a set is less than half of the average. We measure the frequent hit, miss, and less accessed sets of the original direct-mapped cache and the B-Cache and show the results in Table 7 for the baseline data cache. Similar results exist for the instruction cache but not shown due to space limit. Comparing the B-Cache with the baseline, we find that the number of frequent hit sets remains almost unchanged but accounts for 39.8% of the total hits instead of 57.2%, representing a 17.4% reduction, which means cache hits occur across more sets. The frequent miss sets are reduced from 5.6% to 2.2% and the total misses are also reduced from 36.5% to 15.7%, explaining the conflict miss reductions of the B-Cache. The less accessed sets are reduced from 50.2% to 32.4%, which means more cache sets are efficiently used to accommodate the cache accesses. From Figure 4 and 5 we observe that the miss rate reductions of benchmarks art, lucas, swim, and mcf are less than 10% for both the set-associative caches and the B- Cache. From Table 7, we observe that there are no frequent miss sets for these benchmarks, which means that the cache misses occur evenly on all cache sets. Continuing to balance the accesses to the sets would not bring significant miss rate reductions. For other benchmarks, such as equake, the miss rate reductions are higher than 80%. The frequent miss sets are reduced from 5.5% to 2.3%, however, the total cache misses occur on the frequent miss sets are reduced from 76.9% to 7.0%, which means that the conflict addresses have been largely remapped to less accessed sets thus reducing the conflict misses. The B-Cache still has many cache sets that are less accessed. This is because we try to minimize the cache misses and re-map missed addresses to less accessed sets. On average, the cache miss rate is 1% and 9.2% for the instruction and data caches, respectively. Even if we remap all the cache misses to less accessed sets, they are still accessed much less often than the frequent hit sets. Based on the above discussion, we can see that other cache design

10 Table 7: Data cache memory access behavior. fhs: frequent hit sets, fms: frequent miss sets, ch: cache hit, cm: cache miss, las, less accessed sets, tca: total cache access. The table should be read as follows, for example, benchmark ammp at the baseline, 6.8% frequent hit sets account for the 54.3% cache hits; 54.9% of the total cache misses occurs on 6.8% of the cache sets; 59.8% less accessed cache sets accounts for 14.4% of the total cache accesses. fhs ch fms cm las tca fhs ch fms cm las tca fhs ch fms cm las tca dm amm fma par bc dm app gal per bc dm aps gap six bc dm art gcc swi bc dm bzi gzi two bc dm cra luc vot bc dm eon mcf vpr bc dm equ mes wup bc dm fac mgr Ave bc techniques, such as Drowsy cache [9] and Cache decay [16] that take advantage of the non-uniform cache accesses, can still be used on the B-Cache, since those less accessed set can still be in a drowsy state. 6.5 The Effect of L1 Cache Sizes Figure 12 shows the miss rate reductions of twelve cache configurations. It includes the B-Cache at memory mapping factor: MF = 2, 4, 8, or 16 with the B-Cache associativities BAS = 4 or 8. The conventional 2-way, 4-way, and 8-way caches at sizes of 8kB and 32kB and a 16-entry victim buffer are also reported. From Figure 12, we can tell that the B- Cache exhibits similar miss rate reductions at sizes of 8kB and 32kB compared to the 16kB direct mapped caches. The miss rate reductions increase when the MF is increased from 2 to 16 when the BAS are 4 and 8. For the same PD length of 6 bits, the B-Cache at MF = 8 and BAS = 8 has a higher miss rate reduction than the design of MF = 16 and BAS = 4. Therefore, we conclude that for the B-Cache, the design with MF = 8 and BAS = 8 is the best for the cache sizes of 8kB, 16kB, and 32kB. 6.6 Comparison with a Victim Buffer We compare the B-Cache with a 16-entry victim buffer with a line size of 32 bytes. A victim buffer with more than 16 entries may not bring significant miss rate reduction but may increase the buffer s access time and energy. The miss rate reductions of the buffer for both instruction and data cache are shown in Figure 4, 5 and 12 at cache sizes of 8kB, 16kB, and 32kB. For the baseline, the average miss rate reduction of the B-Cache is 13.6% and 20.7% higher than the victim buffer for the CINT2K and CFP2K, respectively. For the instruction cache, the miss rate reduction of the B-Cache is 37.9% higher than the victim buffer. There is only one benchmark, wupwise, whose miss rate reduction is lower than the victim buffer. On average, the miss rate reduction of the B-Cache is 14.4% and 52.3% higher than the victim buffer for the 32kB data and instruction caches, respectively. For the 8kB data and instruction caches, the miss rate reduction of the B-Cache is 15.3% and 17.2% higher than the victim buffer, respectively. 100% 2-way 4-way 8-way victim16 MF = 2, BAS = 4 MF = 4, BAS = 4 MF = 8, BAS = 4 MF = 16, BAS = 4 MF = 2,BAS = 8 MF = 4, BAS = 8 MF = 8, BAS = 8 MF = 16, BAS = 8 75% 50% 25% 0% 32K D$ 32K I$ 8K D$ 8K I$ Figure 12: Miss rate reductions of the B-Cache at cache sizes of 32kB and 8kB.

11 6.7 Comparison with a Highly Associative Cache The highly associative cache (HAC) [22] has been proposed for low-power embedded systems. To reduce the power consumption, the HAC has been aggressively partitioned with a subarray size of 1kB. Only one subarray of the HAC is accessed to reduce the power consumption, therefore the search of the CAM tags has to wait for the completion of the global decoding, which prolongs the access time of the HAC. In fact, the HAC is an extreme case of the B-Cache, where the decoder of the HAC is fully programmable. For a 16kB HAC with a line size of 32 bytes and associativity of 32, the CAM tag (also the PD) length of the HAC is 233(status) = 26 bits, while the B-Cache uses 6 bits of CAM to achieve similar miss rate reductions. Therefore, the HAC can be improved using the technique we proposed to reduce both the power consumption and area of the CAM. 6.8 Issues on Virtually/Physically Addressed Tagged Caches Our scheme requires that the decoder is three bits longer than the baseline. The extra three bits come from the tag bits of the original design. These three-bit tags are required no later than the set index. If the tag, but not the set index, needs to be first translated by a translation look aside buffer (TLB), it is a problem since the programmable index lookup cannot proceed. Such situations happen in a virtually indexed and physically tagged (V/P) cache as in the PowerPC [12]. Similar problem exists for the skewed-associative cache [23] and the way-halting cache [29], where the least four bits of the virtual tags from the processor are equal to the least four bits of the physical tags stored in the cache tags, which means those tag bits do not need translations. For the B-Cache, only the least three bits of the tag are required to be the same as those stored in the tag memory. We may just treat these three bits as virtual index. The B-cache would work under other tag and data array addressing using either the virtual address or the physical address, such as virtually-indexed, virtuallytagged; physically indexed, virtually tagged; and physicallyindexed, physically-tagged caches. 7. Related Work The related work can be generally categorized into two types. One type is to reduce the miss rate of direct-mapped caches. The other type is to reduce the access time of set associative caches. 7.1 Reducing Miss Rate of Direct-mapped Caches Techniques to resolve conflict misses of direct-mapped caches have been proposed with the help of operating systems. Page allocation [7] can be optimized to reduce the conflict misses of a direct-mapped cache with an operating system involved. A Cache Miss Lookaside buffer is used to detect the conflict misses by recording a history of cache misses. A software policy implemented in the operating system removes the conflict misses by dynamically remapping pages whenever large numbers of conflict misses are detected. Their technique enables a direct-mapped cache to perform nearly as well as a two-way set associative cache. The B-Cache is implemented entirely in hardware with the miss rate reductions to a 4-way cache. The column associative cache [1] uses a direct-mapped cache and an extra bit for dynamically selecting alternate hashing functions. This design improves the miss rate to a 2- way cache at a cost of an extra rehash bit and a multiplexer (for the address generation) that could affect the critical time of the cache hit. Column-associative cache can be extended to include multiple alternative locations, which are described in [6][30]. The B-Cache achieves higher miss rate reductions and maintains the constant latency of a direct-mapped cache. Peir et al. [20] attempt to use cache space intelligently by taking advantage of the cache holes during the execution of a program. They proposed an adaptive group-associative cache (AGAC) that can dynamically allocate the data to the cache holes and therefore reduce conflict misses of a direct-mapped cache. Both the AGAC and the B-Cache can achieve comparable miss rate reductions to a 4-way cache. However, the AGAC needs three cycles to access those relocated cache lines which account for 5.24% of the total cache hits, while the B-Cache needs one cycle for all cache hits. The skewed-associative cache [23] is a 2-way cache that exploits two or more indexing functions derived by XORing two m-bit fields from an address to generate an m-bit cache index to achieve the miss rate to that of a same sized 4-way cache. The B-Cache is a direct-mapped cache with a faster access time and achieves the same miss rate reductions as the skewed-associative cache. Our early work [26][27] can only balance the accesses to instruction caches. A complete overhaul of the original design balances the access to both instruction and data caches in the proposed B-Cache. 7.2 Reducing Access Time of Set-associative Caches Partial address matching [18] reduces the access time of set-associative caches by predicting the hit way. The tag memory is separated into two arrays: Main Directory (MD) and Partial Address Directory (PAD). The PAD contains only a small part of full tag bits (e.g., 5 bits), therefore the comparison of PAD is faster than full tag comparison. The results of the PAD comparison can be used to predict the hit way while the results of the MD comparison verify the hit. If the PAD prediction is not correct, a second cycle is required to access the correct way. The difference bit cache [15] is a two-way set-associative cache with an access time close to a direct-mapped cache. The difference bit in two tags is dynamically determined by using a special decoder and used to select the potential hit way from the two ways of the twoway set associative cache. Compared with the partial address matching method, our technique does not require the extra cycle to fetch the desired data because of the miss prediction in the PAD comparison. The access time of the difference bit cache is slower than the B-Cache. Furthermore, the B-Cache can achieve a miss rate as low as a traditional 4-way cache while the difference bit cache cannot be better than a 2-way cache. Compared with the previous techniques, the B-Cache can be applied to both high performance and low-power

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower