Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders

Size: px
Start display at page:

Download "Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders"

Transcription

1 Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Chuanjun Zhang Department of Computer Science and Electrical Engineering University of Missouri-Kansas City Abstract Level one cache normally resides on a processor s critical path, which determines the clock frequency. Directmapped caches exhibit fast access time but poor hit rates compared with same sized set-associative caches due to nonuniform accesses to the cache sets, which generate more conflict misses in some sets while other sets are underutilized. We propose a technique to reduce the miss rate of direct mapped caches through balancing the accesses to cache sets. We increase the decoder length and thus reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We introduce a replacement policy to direct-mapped cache design and increase the access to the underutilized cache sets with the help of programmable decoders. On average, the proposed balanced cache, or B- Cache, achieves 64.5% and 37.8% miss rate reductions on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 5.9%. The B-Cache consumes 10.5% more power per access but exhibits 2% total memory access related energy saving due to the miss rate reductions and hence the reduction to applications execution time. Compared with previous techniques that aim at reducing the miss rate of direct-mapped caches, our technique requires only one cycle to access all cache hits and has the same access time of a direct-mapped cache. 1. Introduction The increasing gap between memory latency and processor speeds is a critical bottleneck to achieving a high performance computing system. To bridge the gap, multilevel memory hierarchy has been exploited to hide the memory latency. Level one cache normally resides on a processor s critical path, which determines the clock frequency; therefore fast access to level one cache is an important issue for improved processor performance. A conventional direct-mapped cache accesses only one tag array and one data array per access, whereas a setassociative cache accesses multiple tag arrays and data arrays per access. Thus, a direct-mapped cache has the benefit of not requiring a multiplexer to combine multiple accessed data items and hence can have faster access time. A direct-mapped cache is 29.5% and 19.3% faster than a same sized 8-way cache [21] at sizes of 8kB and 16kB, respectively. A direct-mapped cache also has the advantage of consuming less power per access due to accessing only one way instead of multiple ways as in a set-associative cache. A direct-mapped cache consumes 74.7% and 68.8% less power than a same sized 8-way cache at sizes of 8kB and 16kB, respectively. A direct-mapped cache is also simple to design, easy to implement, and accounts for less area. However, a direct-mapped cache may have a higher miss rate than a set-associative cache, depending on the access pattern of the executing application, with a higher miss rate resulting in more waiting time for the next level memory accesses. On average, the miss rate of a direct-mapped instruction cache is 29.0% and 100.0% higher, while the miss rate of a direct-mapped data cache is 37.7% and 28.0% higher, than a same sized 8-way cache at sizes of 8kB and 16kB, respectively. Therefore, a direct-mapped cache may or may not result in better overall performance or energy for a particular application. Ideally, a desirable cache should have the access time of a direct-mapped cache but with as low of a miss rate as a highly associative cache. A set-associative cache has two distinct advantages over a direct-mapped cache: conflict miss reduction and replacement policy. A set-associative cache reduces the conflict misses by choosing the victim from multiple cache blocks while only one cache block can be chosen for a direct-mapped cache. Replacement policy of a set-associative cache can select a better victim through consideration of the cache access history. This is particularly true since memory accesses in a program run are extremely unbalanced, causing some cache sets to be accessed heavily while others remain underutilized. However, accessing a set-associative cache requires longer access time, more power, and area. We propose a novel mechanism to provide the benefit of cache block replacement while maintaining the constant access time of a direct-mapped cache. We call this the Balanced Cache, or simply, the B-Cache. The major contributions of our work are as follows: 1. We propose to increase the decoder length of a traditional direct-mapped cache by three bits. This creates the following effects: (a) Accesses to heavily used sets can be potentially reduced to one eighth of the original design. Therefore, we avoid not only detecting those heavily used sets dynamically but we also eliminate the corresponding hardware overhead. (b) Only one eighth of the memory address space has a mapping to the cache sets. We call this the limited memory address mapping.

2 2. We propose to incorporate a replacement policy to the B- Cache. For a cache miss, when the desired address cannot find a mapping to the cache sets due to the abovementioned limited address mapping, the B-Cache can increase the accesses to the underutilized sets through the replacement policy without explicitly detecting those underutilized sets dynamically. 3. We propose to use a programmable decoder for the B- Cache, since the B-Cache would dynamically determine which memory address has a mapping to the cache sets after a cache miss. Using execution driven simulations, we demonstrate that the B-Cache achieves an average miss rate reduction of 64.5% and 37.8% on all 26 benchmarks from the SPEC2K [13] suite for level one direct-mapped 16kB instruction and data caches, respectively. This translates into an instruction per cycle (IPC) improvement of up to 27.1% and an average IPC improvement of 5.9%. Although the B-Cache consumes 10.5% more power per access than the original direct-mapped cache, the B-cache achieves 2% total memory access related energy saving due to the miss rate reductions and hence a reduction in the applications execution time. Furthermore, compared with other techniques that reduce a direct-mapped cache s conflict misses, the access time of the B-Cache is the same as that of a traditional direct-mapped cache. Lastly, the B-Cache requires only one cycle to access all cache hits while other techniques either need a second cycle to access part of the cache hits or must have a longer access time than a directmapped cache. Section 2 further motivates the proposed technique. Section 3 describes the organization of the B-Cache. Experimental methodology and results are presented in Section 4. Section 5 describes the programmable decoder design. Performance and energy analysis are presented in Section 6. Related work is discussed in Section 7 and concluding remarks are given in Section Motivation 2.1 The Problem Memory reference addresses are mapped to cache sets based on index decoding. Because of the well-known locality [20][17] exhibited in both instruction and data, some cache sets are accessed more frequently than others and thus 5-bit tag conventional nonprogrammable decoder 3-bit index conflicting addresses Data Mem 6-bit tag conventional nonprogrammable decoder Way-1 conventional nonprogrammable decoder Way-2 2-bit index Data Mem generating more conflict misses while other cache sets are underutilized. Substantial research effort has been conducted to reduce the conflict misses for direct-mapped caches. A victim buffer [14] is a small fully-associative cache that can resolve the conflict misses for small direct-mapped caches. An extra cycle is required to access the victim buffer when the direct-mapped cache and the buffer are accessed sequentially or the access time of the cache would be prolonged if they are checked concurrently since a multiplexer is required to select the desired output from the buffer or the cache. The column associative cache [1], the adaptive group associative cache [20], the predictive sequential associative cache [5], and the partial address matching cache [18] trade varied hit latencies for reduced access time of set associative caches. Multiple hit latencies disrupt datapath pipeline and complicate the design of pipelined caches [2]. 2.2 Example Figure 1 shows a conventional direct-mapped cache (a), a conventional 2-way cache (b), and the proposed B-Cache (c). For simplicity, the cache contains only eight sets and the address contains only eight bits. For an address sequence of 0, 1, 8, 9, 0, 1, 8, 9, the direct-mapped cache experiences the worst situation of having no cache hits at all. This occurs since the cache accesses are completely non-uniform, which is the case in real applications [20]. A traditional directmapped cache cannot adaptively accommodate those thrashing addresses since the index decoding is fixed. On the other hand, the 2-way cache exhibits cache hits after the first four warm-up accesses. The 2-way cache achieves this high hit rate since the decoder length is one bit less than the direct-mapped cache. During a miss, the 2-way cache has two candidates for the victim while the directmapped cache has just one candidate. It is intuitive that decreasing the decoder length would generate more opportunities for choosing a better victim, such as the index lengths of 4-way and 8-way caches for this example are 1 bit and 0 bit, respectively. 2.3 Solution Figure 1 (c) shows the proposed B-Cache. Instead of decreasing the decoder length, we increase the decoder length by one bit. The first two most significant index bits are (a): A direct-mapped cache (b): A 2-way Cache (c): The proposed B-Cache Figure 1: (a): A conventional direct-mapped cache. (b): A conventional 2-way cache. (c): The proposed B-Cache with programmable decoders. PI: programmable index; NPI: non programmable index. represents an invalid PD entry during the cache s cold start bit tag Programmable decoder (PD) The two PIs must be different to maintain unique address decoding Programmable decoder (PD) 2-bit 2-bit PI NPI x x x x x x x x Data Mem PI NPI PI NPI conventional nonprogrammable decoder (NPD)

3 programmable indices for programmable decoders (PD) and the least two insignificant bits are non-programmable indices for non-programmable decoders (NPD). The outputs of the two decoders are ANDed together to control the activation of a word line. We will discuss how to determine the index length and how to divide the indices into programmable and non-programmable indices for a B-Cache in Section 6. The B-Cache exhibits the same hit rate as the 2-way cache for this example. The defining feature of the B-Cache is the PD, which is programmed with the index of each desired address on-the-fly during a cache refill after a miss. For this simple example, there are two victim candidates determined by the non-programmable index of an address. The PD will be dynamically programmed in the following three situations. First, during a cache s cold start, the contents of the PDs are invalid and programmed using the programmable index in each desired address. For addresses that have the same nonprogrammable index, such as addresses 0 and 8, the victim is chosen using the replacement policy (least recently used replacement is assumed). Second, the B_Cache exhibits a miss but the PD exhibits a hit. For example, this happens when address 25 is accessed after the aforementioned address sequence. The nonprogrammable index of the address 25 (11001) is 01, from Figure 1 (c), and the corresponding programmable index in the two PDs are 00 and 10. Since the programmable index of the address 25 (11001) is 10, the B-Cache has a PD hit. Recall that since the B-Cache is a direct-mapped cache and only one cache block is activated during an access, the address 25 must replace the address 9. In this situation, the B- Cache cannot choose a better victim based on the access history. If the B-Cache replaces address 1 with the address 25, then the address 9 must be evicted as well to maintain unique address decoding. This definitely impacts the hit rate of the cache inadvertently and should be avoided. Lastly, both the B-Cache and the PD exhibit a miss. This happens when address 13 is accessed since the programmable index of the address 13 (1101) is 11, which is different from the indices stored in the two PDs, which are 00 and 10. None of the PDs are activated. It should be noted that this cache miss is not an extra miss caused by the limited address mapping. The same situation exists in fully associative caches that use the whole tag as the decoding index and would not incur any extra misses. In fact, the PD predetermines the cache miss so that neither data nor tag will be read out from the memory during a PD miss. Since the cache miss signal is by default true, this will not cause any problem. The victim for address 13 can be chosen from any of the two cache sets based on the replacement policy. It is important to point out the difference between the B- Cache and the 2-way cache for this example. For the 2-way cache, both address 13 and 25 can be mapped to either of the two sets. It is obvious that decreasing the PD hit rate during a cache miss would improve the B-Cache hit rate since a low PD hit rate means that the replacement policy can be fully exploited to balance the cache accesses and thus reduce the conflict misses. 3. The B-Cache Organization 3.1 Terminology The fundamental idea behind the B-Cache is to reduce the conflict misses by balancing accesses to the cache sets. The defining property of the B-Cache is that the index length is longer than that in a conventional direct-mapped cache. We define the memory address mapping factor (MF): MF = 2 (PINPI) /2 OI, where MF 1, PI, NPI, and OI are the index lengths of the PD and NPD of the B-Cache and the original direct-mapped cache, respectively. This means that 1/MF of the total memory addresses has a mapping to the cache sets after increasing the B-Cache s index length. We will show that larger values of MF decrease the PD hit rate during a cache miss in Section The B-Cache can reduce the miss rate since multiple cache locations can be chosen for a victim, as shown in Figure 1 (c). We define the B-Cache associativity (BAS): BAS =2 OI /2 NPI, where BAS 1. This means that the B-Cache logically divides the cache sets into BAS clusters, as shown in Figure 2. For a cache miss, the victim can be chosen from these BAS clusters. The miss rate of the B-Cache will approach the miss rate of a same sized BAS-way cache. The case MF = 1 or BAS = 1 is equivalent to a traditional direct-mapped cache. For the B-Cache, both MF and BAS must be larger than Organization Cache memory is partitioned into a number of identically sized subarrays to tradeoff the access time, area, and power consumption [10][21]. Each subarray stores a part of the accessed word or several words depending on the divisions of cache memory from both word line and bit line cuts. Figure 2 shows the organization of a 16kB direct-mapped cache with a line size of 32 bytes. The address is assumed to have 32 bits. The data memory is partitioned into four subarrays and the tag memory is partitioned into eight subarrays (not shown in the figure) [21]. Each subarray has its own local decoder. The divided word line technique [25] is adopted to achieve both the fast access time and low per access energy consumption. For the ease of illustration, we use the least significant two bits of the index, I 1 and I 0, as the inputs to the global word line decoder to determine the selection of a subarray. The other seven bits, from I 8 to I 2, determine the cache line selection in a particular subarray. The combination of the global and local word line locates a particular set for an address. It should be noted that any address bit combination can be used as the global and local word line decoder inputs and the B-Cache is not restricted to any index decoding scheme or cache memory partitions. There do exist optimization opportunities in selecting the address bits for index decoding to achieve low miss rate. However, indexing optimization [11] is out of the range of this paper. For the B-Cache, we replace the original local decoders with new decoders. The new decoder includes eight 4 16 NPDs and eight 6 16 PDs. The NPD uses index bits I 5 to I 2 while the PD uses I 8, I 7, I 6 and three tag bits T 2, T 1, and T 0 as inputs. For this particular design,

4 the B-Cache s two parameters are MF =8 and BAS = 8. We determine the MF and BAS through experimentation and thus defer the discussion of how to choose B-Cache parameters in Section and The B-Cache s Replacement Policy The B-Cache increases the accesses to underutilized sets and reduces the conflict misses through the replacement policy. Any replacement policy available to conventional set associative caches is applicable to the B-Cache. We evaluate both random replacement and least recently used (LRU) policy for the B-Cache design. The random policy is simple to design and needs trivial extra hardware. LRU may achieve a better hit rate but will have more area overhead than the random policy. Designing a new replacement policy specifically for the B-Cache to detect the underutilized sets and hence choose the victim accordingly may reduce further the miss rate. However, we show in Section that the B- Cache s miss rate reduction approaches an 8-way cache, which makes further reducing the miss rate through innovative replacement policies not important. 4. Experimental Methodology and Results We use miss rate as the primary metric to measure the B- Cache effectiveness. We used a 4-issue out-of-order processor simulator to collect the miss rate. We determine the B-Cache parameters MP and BAS through experimentation. Overall performance improvement and energy is discussed in Section Cache Hierarchy The baseline level one cache is a direct-mapped 16kB cache with a line size of 32 bytes for both the instruction and data caches. We use a unified 4-way 256kB L-2 cache using LRU replacement policy and a hit latency of 6 cycles. The results of level one cache sizes of 8kB and 32kB are also colleted to show the effectiveness of the B-Cache. 4.2 Benchmarks We ran all 26 SPEC2K benchmarks using the SimpleScalar tool set [3]. The benchmarks are pre-compiled for the Alpha ISA and were obtained from the SimpleScalar tag -18 bits T 2 OI index- 9bits T 1 T 0 I 8 I 7 I 6 I 1 I 0 I 5 I 2 offset- 5bits developers [24]. The benchmarks were fast-forwarded for two billion instructions and executed for 500 million instructions afterwards using reference inputs. For the data cache, all results are reported. For the instruction cache, the results of benchmarks whose miss rates are less than 0.01% are not reported (to save space in the plot) since further reducing the miss rate may not be important for these benchmarks. These benchmarks include applu, art, bzip, facerec, galgel, gzip, lucas, mcf, mgrid, swim, and vpr. The overall performance and energy improvements, however, are reported for all 26 benchmarks. 4.3 Experimental Results Figure 4 and Figure 5 shows the relative miss rate reductions for nine different cache configurations compared to the baseline. The data cache results are reported in groups of CINT2K (integer) and CFP2K (floating point) components of SPEC2K. Four conventional set-associative cache configurations, 2-way, 4-way, 8-way, and 32-way are included to show the miss rate reductions over the baseline. The miss rate reduction of a 16-entry victim buffer is included and will be compared with the B_Cache design in Section 6.6. The other four configurations are the B-Cache at memory mapping factor MF = 2, 4, 8, or 16 with B-Cache associativity BAS = 8. The bar marked Ave was computed as the average of miss rate reductions for all the benchmarks The B-Cache Associativity - BAS We determine the B-Cache associativity, BAS, through experimentation. From Figure 4 and 5, we can see that increasing associativity higher than eight would not bring significant miss rate reductions. On average, the miss rate reduction of the 32-way cache is only 1% higher than the 8- way cache with only one exception where the benchmark perlmk shows a 20% improvement over the 8-way cache. On the other hand, the 8-way cache outperforms the 4-way cache by 2.6%, 4.6%, and 7.9% for data cache on CINT2K and CFP2K and the instruction cache on the reported benchmarks, respectively. For some benchmarks, crafty, eon, equake, gap, and twolf, the 8-way cache has more than a 10% miss rate reduction over a 4-way instruction cache; the same situation exists in the data cache for benchmarks crafty and fma3d. eight programmable decoders(pd) 6 16 PI 4 16 NPI local decoder I 0 I 1 global decoder 4 global word line cluster 1 cluster 1 cluster 1 local wordline bit lines proposed new decoder eight conventional nonprogrammable decoder original local decoder 0 Figure 2: B-Cache organization. cluster 8 cluster 8 cluster 8 cluster 8 subarray0 local subarray1 local subarray2 local subarray3 decoder 1 decoder 2 decoder 3

5 Since the miss rate of the B-Cache with associativity BAS = 8 would approach the miss rate of an 8-way cache, we choose BAS = 8. Further increasing the BAS to 16 or higher would not improve the miss rate reductions significantly but may incur more overhead in PDs, which is discussed in Section Memory Address Mapping Factor - MF We determine the optimal MF through experimentation. From Figure 4 and 5, when the MF is increased from 2, 4, to 8, the miss rate reduction of the B-Cache for the data cache on CINT2K is increased from 18.3%, 31.3%, to 38.1%; the miss rate reduction is increased from 21.8%, 31.1%, to 36.4% for CFP2K. Continuing to increase the MF to 16 improves the miss rate reduction by a mere 1.7% and 1.0% for CINT2K and CFP2K, respectively. For the instruction cache, the miss rate reductions for MF = 2, 4, or 8 are 33.2%, 53.4%, or 64.7%, respectively. The miss rate is reduced by merely 0.4% when the MF is increased to 16. There are some benchmarks, such as wupwise, facrec, galgel, and sixtrack for the data cache, where the miss rate reduction of the B-Cache is lower than the 4-way cache. We increase the MF and measure the B-Cache s miss rate and PD hit rate during a cache miss for benchmark wupwise. The results are shown in Figure 3. The miss rate decreases as the decoder hit rate does. This is because the B-Cache cannot fully exploit the replacement policy when there is a PD hit. D$ miss rate PD hit rate 4% miss_rate PD_hit_rate 3% 2% 1% 0% MF2 MF4 MF8 MF16 MF32 MF64 MF128 MF256 MF512 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 3: Data cache miss rate (left) and PD hit rate (right) of the benchmark wupwise at cache size of 16kB with varied MP values. When the MF is increased from 32 to 64, the PD hit rate is reduced sharply and so does the cache miss rate. However, larger MF means more bits for the PD and thus more overhead in terms of area and access time. Considering the access time analysis of the B-Cache in Section 5.1, we choose a memory address mapping factor MF = Miss Rate Reduction of the B-Cache The upper bound on the miss rate reduction of the B- Cache with BAS = 8 is the miss rate of an 8-way cache. It approaches the upper bound as the memory address mapping % Reduction in miss rate over baseline % Reduction in miss rate over baseline % Reduction in miss rate over baseline k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 SPEC 2K Floating Point Benchmarks ammp applu apsi art equake facerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise Ave 16k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 SPEC 2K Integer Benchmarks bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf votex vpr Ave Figure 4: Data cache miss rate reductions of a 2-way, 4-way, 8-way, 32-way, 16-entry victim buffer, and the B-Cache with MF = 2, 4, 8, or 16 and BAS = 8 compared to the baseline. (Replacement policy is LRU) 16k2way 16k4way 16k8way 16k32way victim16 MF2 MF4 MF8 MF16 ammp apsi crafty eon equake fma gap gcc mesa parser perl six two vot wup Ave Figure 5: Instruction cache miss rate reductions of a 2-way, 4-way, 8-way, 32-way, 16-entry victim buffer, and the B-Cache with MF = 2, 4, 8, or 16 and BAS = 8 compared to the baseline. (Replacement policy is LRU)

6 original non-programmable decoder (NPD) non-programmable decoder (NPD) programmable decoder (PD) factor MF increases. From Figure 4 and Figure 5, the miss rate reduction of the B-Cache is as good as a 4-way cache for the data cache. For the instruction cache, on average, the miss rate reduction is 5% better than a 4-way cache. 5. Programmable Decoder Design In this section we evaluate the latency, storage, and power costs associated with the B-Cache. We use HSPICE simulation to measure the PD s access time. Storage is measured in terms of SRAM bits equivalent. Both HSPICE and Cacti v3.2 are used to model the B-Cache s access time and energy. 5.1 Timing Analysis CAM B_Cache decoder (a). The original and the B-Cache s 4X16 NPD. wordline driver wordline driver A 4 A 5 A 6 The critical path of a conventional direct-mapped cache normally resides on the tag side instead of the data side [19]. The B-Cache incorporates three tag bits into the PD; therefore the tag length is reduced by three bits, which reduces the tag side access time. However, we do not claim this access time reduction since the data and tag path may already be balanced[19] so that the data side would be on the critical path. The B-Cache modifies the local decoder without changing the global decoder. Therefore, the B-Cache s local decoder should run faster than or at least as fast as that of the original local decoder of each subarray to avoid increasing the access time. The B-Cache decoder consists of a PD and NPD as shown in Figure 6 (a). The original decoder is composed of two 2-input NAND gates and one 3-input NAND gate as shown in Figure 6 (b). The outputs of these NAND gates are ORed through a 3-input NOR gate. The NPD in the B-Cache decoder has two 2-input NAND gates. Figure 6: Decoder timing analysis. (c).search bitline segmentation. The outputs of the 2-input NAND gates are ORed through a 2-input NOR gate. The PD of the B-Cache is composed of eight 6 16 CAMs. We use standard ten-transistor CAM cells. Each CAM cell contains a SRAM cell and a dynamic XOR gate for comparison. The search line and bit line are separated to reduce the capacitance load. The matchlines of the CAM cells are precharged high and conditionally discharged on a mismatch. Therefore, we need an AND gate for the outputs of the NPD and PD of the B-Cache. Inserting a new AND gate would increase the access time. The technique proposed in [28] is employed, which changes the inverter in the original wordline driver to a 2-input NAND gate. The transistor sizes of the 2-input NAND are selected to make the NAND gate as fast as the original inverter and thus incurs no access time overhead. By using this technique, if there is time slack left between the B-Cache decoder and the original decoder, the B- Cache would not prolong the B-Cache s access time. Table 1 shows the access time for decoders of 8 256, 7 128, 6 64, 5 32, and 4 16, which corresponds to the subarray sizes of 8kB, 4kB, 2kB, 1kB, and 512 bytes with a cache line size of 32 bytes, respectively. As far as we know, we have not found any level one cache whose subarray size is larger than 8kB or less than 512 bytes [21]. For each original decoder, the corresponding B-Cache decoder is shown. The access time of the conventional decoders is calculated from the Cacti 3.2, while the access time of the PD for the B- Cache is obtained from the HSPICE simulation of the circuits extracted from our own CAM layout using Cadence tools [4]. The area of the CAM cell is 25% larger than the SRAM cell. The result of our HSPICE simulation is comparable with the results reported from other researchers [8]. From this table, Table 1: Timing analysis of the B-Cache decoder. PD: programmable decoder; NPD: non programmable decoder. The composition of the decoders are represented in NAND and NOR gates, such as 2D_3R stands for 2-input NAND gate and 3-input NOR gate. Decoder Composition 3D-3R 3D-3R 2D-3R 3D-2R 2D-2R Time (ns) BC decoder PD NPD PD NPD PD NPD PD NPD PD NPD Composition CAM 3D-2R CAM 2D-2R CAM NAND3 CAM NAND2 CAM INV Time(ns) NOR3 is replaced by NOR2 in the 4 16 NPD NAND3 gates are removed for the 4 16 NPD A 0 A 1 A 2 A 3 (b). The original and the B-Cache s eight 4 16 NPD. 128 NOR2 for the 4 16 NPD

7 we can see that all of the decoders have time slack left. Therefore, the B-Cache would not have any time access overhead compared with a conventional direct-mapped cache. One may ask how the B-Cache s decoder is as fast as the original direct-mapped cache given the index length of the B- Cache is three bits longer than the original index. Compared with the baseline, the B-Cache s 4 16 decoder is faster than the original decoder since the NOR2 gates, which are faster, replace the NOR3 gates. The critical path of the decoder, for this example, is also changed from the 3-input NAND gates to the 2-input NAND gates. However, we should point out that the B-Cache s 4 16 NPD is much slower than the 4 16 decoder in the original direct-mapped cache with a subarray size of 512 bytes as shown in Table 1. This is because they have different subarray dimensions and outputs loads. The 2-input NAND gate in the B-Cache s 4 16 decoder has a fan out of 8 4 = 32 gates, while the fan out of the 4 16 in the conventional direct-mapped cache is 4 gates. This explains why the decoder in the conventional 16kB direct-mapped cache has not been divided into one and one and then combined through the NAND gate (changed from the inverter), since the performance improvement is less than 0.5% with an area overhead of input NOR gates and enlarged inverters. To speedup CAM comparison, the search bit lines of the CAM are segmented as shown in Figure 6 (c), which is similar to the technique proposed in [10]. One search bit line needs 9 extra inverters. For this 16kB direct-mapped cache, ( ) 4 = 648 inverters are required, which represents a fraction of the total area. 5.2 Different Partitions of Tag and Data Memory Table 2: Storage cost analysis. Tag Dec Tag Mem Data Dec Data Mem Total size Baseline No Mem cell 20bit x 512 No Mem cell 256bitx B-Cache 64 6x8 CAM 17bitx x16 CAM 256bitx Tag and data memory typically use different memory partitions to achieve the best tradeoff across performance, power, and area. For example, the data memory is partitioned into four subarrays while the tag memory is partitioned into eight subarrays for this 16kB direct-mapped cache [14]. Figure 7 shows the decoder design for tag and data memory subarrays for the B-Cache. For the tag decoder, eight subarrays need 3 bits for the global decoder while the global data decoder requires 2 bits for four subarrays. Since both tag and data decoders must have the same B-Cache parameters MF = 8, BAS = 8, and programmable index length: PI = 6, the length of the NPD for the B-Cache is 4 and 3 bits for the data and tag decoder, respectively. From Table 1 we can see that the decoders access times are faster than the original design. tag 3bits 2bits global index 7-bit index 3bits 4bits PD NPD PD NPD (a): data memory partition (b): tag memory partition Figure 7: PD design when memory partition in tag and data is different. PD: programmable decoder, NPD: non-programmable decoder. The four-input AND gates are implemented through two stages of NAND and NOR gates. 5.3 Storage Overhead The additional hardware for the B-Cache is the CAM based PD. The area [8] of the CAM cell is 25% larger than the SRAM cell used by data and tag memory. There are sixtyfour 6 8 and thirty-two 6 16 CAMs. The total storage requirements for both the baseline and the B-Cache are calculated in Table 2. The overhead of B-Cache increases the total cache area of the baseline by 4.3%, which is less than a same sized 4-way cache, which is 7.98% more area than the baseline without considering the area used for implementing the replacement policy [21]. 5.4 Power Overhead The extra power consumption comes from the PD of each subarray. Data and tag memory have four and eight subarrays, respectively. We measure the power consumption of the PD using HSPICE simulation at 0.18um technology. A 6 8 and 6 16 CAM decoder consumes 0.78pJ and 1.62pJ per search, respectively. The B-Cache uses sixty-four 6 8 CAMs for tag PDs and thirty-two 6 16 CAMs for data PDs. We calculate the power reductions due to the 3-bit tag length reduction and the removal of the 3-input NAND gates in the original tag and data decoders using Cacti 3.2. The energy per cache access of both the baseline and the B-Cache are shown in Table 3. The power consumption of the B-Cache is 10.5% higher than the baseline, however, it is still 17.4%, 44.4%, and 65.5% lower than the same sized 2-way, 4-way, and 8- way caches, respectively. Described in Section 6.2, the overall energy analysis of the B-Cache shows that the B- Cache reduces the overall memory related energy consumption by 2% over the baseline due to the reduced miss Table 3: Energy (pj) per cache access. T: tag; D: data; SA: sense amplifier; Dec: decoder; BL:bitline; WL: wordline. T-SA T-Dec T-BL-WL D-SA D-Dec D-BL-WL D-others Total (pj) Baseline B_Cache entries data subarray tag 3bits 3 bits global index 6-bit index 3bits 3bits 64 entries tag subarray

8 % IPC improvement over baseline Energy normalized to baseline way 4-way 8-way B_Cache victim ammp appl apsi art bzip craf eon equa face fma3 galg gap gcc gzip luca mcf mesa mgri pars perl sixtr swim twol votex vpr wupw Ave Figure 8: Performance improvement of a 2-way, 4-way, 8-way, the B-Cache and a 16-entry victim buffer over the baseline at cache size of 16kB (only the first four letters of the benchmarks are shown to save space) rate and hence the reduction to the application s execution time. 6. Analysis In this section, we present the impact of the B-Cache on overall processor performance and energy consumption. We also discuss the tradeoffs of the B-Cache design parameters. We evaluate the balance of the B-Cache and compare the B- Cache with other related techniques. Finally, we discuss the issues involving virtually or physically addressed tagged caches. 6.1 Overall Performance Table 4 shows the processor configuration for both the baseline and the proposed B-cache. Figure 8 shows the performance improvements measured in IPC of the processor with a level one B-Cache, an 8-way cache, and a 16-entry victim buffer over the baseline processor. The performance of the processor with the B-Cache outperforms the baseline by an average of 5.9%. The greatest performance improvement is seen in equake, where the IPC increases by 27.1%. The performance of the B-Cache is only 0.3% less than the 8-way cache but is 3.7% higher than the victim buffer. Although the victim buffer outperforms the B-Cache in data cache miss rate reduction for the benchmark wupwise, the performance of the B-Cache is still higher than the victim buffer since the B-Cache outperforms the victim buffer by 50% in miss rate reductions for the instruction cache as shown in Figure 5. It is important to point out that both the B- Table 4: Baseline and the B-cache processor configuration. Fetch/Issue/Retire Width Instruction Window Size L1 cache L2 Unified Cache Main Memory 2-way 4-way 8-way B_Cache victim16 ammp appl apsi art bzip craf eon equa face fma3 galg gap gcc gzip luca mcf mesa mgri pars perl sixt swim twol vote vpr wupw Ave Figure 9: Normalized overall energy of a 2-way, 4-way, 8-way, the B-Cache, and a 16-entry victim buffer over the baseline at a cache size of 16k (only the first four letters of the benchmarks are shown to save space). 4 instructions/cycle, 4 functional units 16 instructions 16kB, 32B linesize, direct mapped 256kB, 128B linesize, 4-way, 6 cycle hit Infinite size, 100 cycle access Cache and the victim cache have a fast access time than the set associative caches 6.2 Overall Energy There are two main components that result in power dissipation in CMOS circuits, namely static power dissipation due to leakage current and dynamic power dissipation due to logic switching current and the charging and discharging of the load capacitance. We consider both types of energy in our evaluation. We evaluate the memory related energy consumption [28] including on chip caches, as described in Section 4.1, and off-chip memory. Figure 10 lists the equations for computing the total memory related energy consumption. The italic terms are those we obtain through measurements or simulations. We compute cache_access, cache_miss, and cycles by running SimpleScalar simulations for each cache configuration. We compute E_cache_access and E_cache_block_refill using Cacti 3.2 for both level one and two caches. The E_next_level_mem includes the energy of accessing level two cache and the energy of accessing off-chip memory. The E_static_per_cycle is the total static energy consumed per cycle. Both energy of accessing off chip memory and E_static_per_cycle are highly system dependent. Using a methodology similar to [28], we evaluate the energy of accessing off-chip memory as 100 times larger than the E_mem = E_dyn E_static E_dyn = cache_access * E_cache_access cache_miss * E_misses E_misses=E_next_level_memE_cache_block_refill E_static = cycles * static_energy_per_cycle E_static_per_cycle = k_static * E_total_per_cycle Figure 10: Equations for energy evaluation.

9 Table 5: The miss rate reductions of the B_Cache compared with a direct-mapped cache at varied MF, BAS, and PD. MF=2 MF=4 MF=8 MF=16 Table 6: PD hit rate during cache misses of the B_Cache at varied MF, BAS, and PD. MF=2 MF=4 MF=8 MF=16 Design A Design B BAS = BAS = PD BAS = BAS = PD baseline. We evaluate the static energy as 50% (k_static) of the total energy that includes both dynamic and static energy. Figure 9 shows the energy of 2-way, 4-way, 8-way, the B- Cache, and the victim buffer normalized to the baseline. On average, the B-Cache consumes the least energy and is 2% less than the baseline. The greatest energy reduction is seen in crafty, where the energy reduces by 14%. The B-Cache reduces the miss rate and hence the accesses to the second level cache, which is more power costly. During a cache miss, the B-Cache also reduces the cache memory accesses through the miss prediction of the PD, which makes the effective power overhead much less. In Section 6.3, we show that the PD can predict on average around 80% of the cache misses thus reducing the power to access both the data and tag during a cache miss. 6.3 Design Tradeoffs for MP and BAS for a Fixed Length of PD For a fixed length of PD, such as PD = 4, the B-Cache has two design options as shown in Figure 11. In design A, the B-Cache uses MF = 2 and BAS = 8. In design B, the B- Cache uses MF = 4 and BAS = 4. The question is which design has a higher miss rate reduction. In design A, the B- Cache has eight clusters and the miss rate reduction would approach that of an 8-way cache. However, MF = 2 means that the PD hit rate is high and the B-Cache cannot fully exploit the replacement policy, therefore the miss rate reduction may be low. Table 5 and Table 6 shows the miss rate reduction and PD hit rates of the B-Cache at varied MF, BAS, and PD combinations. For the same PD length, the miss rate reduction of the design B (BAS = 4) outperforms the design A (BAS = 8) when PD is less than 6. This is because the PD hit rate of the design B is lower than design A, since the design B has a larger MF. However, when the PD is increased to 6, the PD hit rates are low for both designs and the effect of high clusters appears to be important, then design A has a higher miss reduction than design B. Based on our timing analysis in Section 5.1, we can use a 6-bit PD without MF=2, BAS=8, PD=4 (A) Index of original directmapped cache (B) MF=4, BAS=4, PD=4 Figure 11: Tradeoff of MF, BAS, and PD. incurring access time overhead, therefore we design the B- Cache with MF = 8 and BAS = 8. This result also presents the design options for varied PD values and corresponding miss rate reductions. 6.4 Balance Evaluation We classify a set as frequent hit or frequent miss sets when the cache hits or misses occurring in a set are 2 times higher than the average. We classify a set as the less accessed set when the total accesses to a set is less than half of the average. We measure the frequent hit, miss, and less accessed sets of the original direct-mapped cache and the B-Cache and show the results in Table 7 for the baseline data cache. Similar results exist for the instruction cache but not shown due to space limit. Comparing the B-Cache with the baseline, we find that the number of frequent hit sets remains almost unchanged but accounts for 39.8% of the total hits instead of 57.2%, representing a 17.4% reduction, which means cache hits occur across more sets. The frequent miss sets are reduced from 5.6% to 2.2% and the total misses are also reduced from 36.5% to 15.7%, explaining the conflict miss reductions of the B-Cache. The less accessed sets are reduced from 50.2% to 32.4%, which means more cache sets are efficiently used to accommodate the cache accesses. From Figure 4 and 5 we observe that the miss rate reductions of benchmarks art, lucas, swim, and mcf are less than 10% for both the set-associative caches and the B- Cache. From Table 7, we observe that there are no frequent miss sets for these benchmarks, which means that the cache misses occur evenly on all cache sets. Continuing to balance the accesses to the sets would not bring significant miss rate reductions. For other benchmarks, such as equake, the miss rate reductions are higher than 80%. The frequent miss sets are reduced from 5.5% to 2.3%, however, the total cache misses occur on the frequent miss sets are reduced from 76.9% to 7.0%, which means that the conflict addresses have been largely remapped to less accessed sets thus reducing the conflict misses. The B-Cache still has many cache sets that are less accessed. This is because we try to minimize the cache misses and re-map missed addresses to less accessed sets. On average, the cache miss rate is 1% and 9.2% for the instruction and data caches, respectively. Even if we remap all the cache misses to less accessed sets, they are still accessed much less often than the frequent hit sets. Based on the above discussion, we can see that other cache design

10 Table 7: Data cache memory access behavior. fhs: frequent hit sets, fms: frequent miss sets, ch: cache hit, cm: cache miss, las, less accessed sets, tca: total cache access. The table should be read as follows, for example, benchmark ammp at the baseline, 6.8% frequent hit sets account for the 54.3% cache hits; 54.9% of the total cache misses occurs on 6.8% of the cache sets; 59.8% less accessed cache sets accounts for 14.4% of the total cache accesses. fhs ch fms cm las tca fhs ch fms cm las tca fhs ch fms cm las tca dm amm fma par bc dm app gal per bc dm aps gap six bc dm art gcc swi bc dm bzi gzi two bc dm cra luc vot bc dm eon mcf vpr bc dm equ mes wup bc dm fac mgr Ave bc techniques, such as Drowsy cache [9] and Cache decay [16] that take advantage of the non-uniform cache accesses, can still be used on the B-Cache, since those less accessed set can still be in a drowsy state. 6.5 The Effect of L1 Cache Sizes Figure 12 shows the miss rate reductions of twelve cache configurations. It includes the B-Cache at memory mapping factor: MF = 2, 4, 8, or 16 with the B-Cache associativities BAS = 4 or 8. The conventional 2-way, 4-way, and 8-way caches at sizes of 8kB and 32kB and a 16-entry victim buffer are also reported. From Figure 12, we can tell that the B- Cache exhibits similar miss rate reductions at sizes of 8kB and 32kB compared to the 16kB direct mapped caches. The miss rate reductions increase when the MF is increased from 2 to 16 when the BAS are 4 and 8. For the same PD length of 6 bits, the B-Cache at MF = 8 and BAS = 8 has a higher miss rate reduction than the design of MF = 16 and BAS = 4. Therefore, we conclude that for the B-Cache, the design with MF = 8 and BAS = 8 is the best for the cache sizes of 8kB, 16kB, and 32kB. 6.6 Comparison with a Victim Buffer We compare the B-Cache with a 16-entry victim buffer with a line size of 32 bytes. A victim buffer with more than 16 entries may not bring significant miss rate reduction but may increase the buffer s access time and energy. The miss rate reductions of the buffer for both instruction and data cache are shown in Figure 4, 5 and 12 at cache sizes of 8kB, 16kB, and 32kB. For the baseline, the average miss rate reduction of the B-Cache is 13.6% and 20.7% higher than the victim buffer for the CINT2K and CFP2K, respectively. For the instruction cache, the miss rate reduction of the B-Cache is 37.9% higher than the victim buffer. There is only one benchmark, wupwise, whose miss rate reduction is lower than the victim buffer. On average, the miss rate reduction of the B-Cache is 14.4% and 52.3% higher than the victim buffer for the 32kB data and instruction caches, respectively. For the 8kB data and instruction caches, the miss rate reduction of the B-Cache is 15.3% and 17.2% higher than the victim buffer, respectively. 100% 2-way 4-way 8-way victim16 MF = 2, BAS = 4 MF = 4, BAS = 4 MF = 8, BAS = 4 MF = 16, BAS = 4 MF = 2,BAS = 8 MF = 4, BAS = 8 MF = 8, BAS = 8 MF = 16, BAS = 8 75% 50% 25% 0% 32K D$ 32K I$ 8K D$ 8K I$ Figure 12: Miss rate reductions of the B-Cache at cache sizes of 32kB and 8kB.

11 6.7 Comparison with a Highly Associative Cache The highly associative cache (HAC) [22] has been proposed for low-power embedded systems. To reduce the power consumption, the HAC has been aggressively partitioned with a subarray size of 1kB. Only one subarray of the HAC is accessed to reduce the power consumption, therefore the search of the CAM tags has to wait for the completion of the global decoding, which prolongs the access time of the HAC. In fact, the HAC is an extreme case of the B-Cache, where the decoder of the HAC is fully programmable. For a 16kB HAC with a line size of 32 bytes and associativity of 32, the CAM tag (also the PD) length of the HAC is 233(status) = 26 bits, while the B-Cache uses 6 bits of CAM to achieve similar miss rate reductions. Therefore, the HAC can be improved using the technique we proposed to reduce both the power consumption and area of the CAM. 6.8 Issues on Virtually/Physically Addressed Tagged Caches Our scheme requires that the decoder is three bits longer than the baseline. The extra three bits come from the tag bits of the original design. These three-bit tags are required no later than the set index. If the tag, but not the set index, needs to be first translated by a translation look aside buffer (TLB), it is a problem since the programmable index lookup cannot proceed. Such situations happen in a virtually indexed and physically tagged (V/P) cache as in the PowerPC [12]. Similar problem exists for the skewed-associative cache [23] and the way-halting cache [29], where the least four bits of the virtual tags from the processor are equal to the least four bits of the physical tags stored in the cache tags, which means those tag bits do not need translations. For the B-Cache, only the least three bits of the tag are required to be the same as those stored in the tag memory. We may just treat these three bits as virtual index. The B-cache would work under other tag and data array addressing using either the virtual address or the physical address, such as virtually-indexed, virtuallytagged; physically indexed, virtually tagged; and physicallyindexed, physically-tagged caches. 7. Related Work The related work can be generally categorized into two types. One type is to reduce the miss rate of direct-mapped caches. The other type is to reduce the access time of set associative caches. 7.1 Reducing Miss Rate of Direct-mapped Caches Techniques to resolve conflict misses of direct-mapped caches have been proposed with the help of operating systems. Page allocation [7] can be optimized to reduce the conflict misses of a direct-mapped cache with an operating system involved. A Cache Miss Lookaside buffer is used to detect the conflict misses by recording a history of cache misses. A software policy implemented in the operating system removes the conflict misses by dynamically remapping pages whenever large numbers of conflict misses are detected. Their technique enables a direct-mapped cache to perform nearly as well as a two-way set associative cache. The B-Cache is implemented entirely in hardware with the miss rate reductions to a 4-way cache. The column associative cache [1] uses a direct-mapped cache and an extra bit for dynamically selecting alternate hashing functions. This design improves the miss rate to a 2- way cache at a cost of an extra rehash bit and a multiplexer (for the address generation) that could affect the critical time of the cache hit. Column-associative cache can be extended to include multiple alternative locations, which are described in [6][30]. The B-Cache achieves higher miss rate reductions and maintains the constant latency of a direct-mapped cache. Peir et al. [20] attempt to use cache space intelligently by taking advantage of the cache holes during the execution of a program. They proposed an adaptive group-associative cache (AGAC) that can dynamically allocate the data to the cache holes and therefore reduce conflict misses of a direct-mapped cache. Both the AGAC and the B-Cache can achieve comparable miss rate reductions to a 4-way cache. However, the AGAC needs three cycles to access those relocated cache lines which account for 5.24% of the total cache hits, while the B-Cache needs one cycle for all cache hits. The skewed-associative cache [23] is a 2-way cache that exploits two or more indexing functions derived by XORing two m-bit fields from an address to generate an m-bit cache index to achieve the miss rate to that of a same sized 4-way cache. The B-Cache is a direct-mapped cache with a faster access time and achieves the same miss rate reductions as the skewed-associative cache. Our early work [26][27] can only balance the accesses to instruction caches. A complete overhaul of the original design balances the access to both instruction and data caches in the proposed B-Cache. 7.2 Reducing Access Time of Set-associative Caches Partial address matching [18] reduces the access time of set-associative caches by predicting the hit way. The tag memory is separated into two arrays: Main Directory (MD) and Partial Address Directory (PAD). The PAD contains only a small part of full tag bits (e.g., 5 bits), therefore the comparison of PAD is faster than full tag comparison. The results of the PAD comparison can be used to predict the hit way while the results of the MD comparison verify the hit. If the PAD prediction is not correct, a second cycle is required to access the correct way. The difference bit cache [15] is a two-way set-associative cache with an access time close to a direct-mapped cache. The difference bit in two tags is dynamically determined by using a special decoder and used to select the potential hit way from the two ways of the twoway set associative cache. Compared with the partial address matching method, our technique does not require the extra cycle to fetch the desired data because of the miss prediction in the PAD comparison. The access time of the difference bit cache is slower than the B-Cache. Furthermore, the B-Cache can achieve a miss rate as low as a traditional 4-way cache while the difference bit cache cannot be better than a 2-way cache. Compared with the previous techniques, the B-Cache can be applied to both high performance and low-power

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

The V-Way Cache : Demand-Based Associativity via Global Replacement

The V-Way Cache : Demand-Based Associativity via Global Replacement The V-Way Cache : Demand-Based Associativity via Global Replacement Moinuddin K. Qureshi David Thompson Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power

VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power Hai Li, Chen-Yong Cher, T. N. Vijaykumar, and Kaushik Roy 1285 EE Building, ECE Department, Purdue University @ecn.purdue.edu

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Decoupled Zero-Compressed Memory

Decoupled Zero-Compressed Memory Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Narrow Width Dynamic Scheduling

Narrow Width Dynamic Scheduling Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Circuit and Microarchitectural Techniques for Reducing Cache Leakage Power

Circuit and Microarchitectural Techniques for Reducing Cache Leakage Power Circuit and Microarchitectural Techniques for Reducing Cache Leakage Power Nam Sung Kim, Krisztián Flautner, David Blaauw, and Trevor Mudge Abstract On-chip caches represent a sizable fraction of the total

More information

Comparative Analysis of Contemporary Cache Power Reduction Techniques

Comparative Analysis of Contemporary Cache Power Reduction Techniques Comparative Analysis of Contemporary Cache Power Reduction Techniques Ph.D. Dissertation Proposal Samuel V. Rodriguez Motivation Power dissipation is important across the board, not just portable devices!!

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011 250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011 Energy-Efficient Hardware Data Prefetching Yao Guo, Member, IEEE, Pritish Narayanan, Student Member,

More information

AN ENERGY EFFICIENT TCAM ENHANCED CACHE ARCHITECTURE. A Thesis JASON MATHEW SURPRISE

AN ENERGY EFFICIENT TCAM ENHANCED CACHE ARCHITECTURE. A Thesis JASON MATHEW SURPRISE AN ENERGY EFFICIENT TCAM ENHANCED CACHE ARCHITECTURE A Thesis by JASON MATHEW SURPRISE Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Drowsy Instruction Caches

Drowsy Instruction Caches Drowsy Instruction Caches Leakage Power Reduction using Dynamic Voltage Scaling and Cache Sub-bank Prediction Nam Sung Kim, Krisztián Flautner, David Blaauw, Trevor Mudge {kimns, blaauw, tnm}@eecs.umich.edu

More information

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Yoav Etsion Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem 14 Jerusalem, Israel Abstract

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Energy-Effective Instruction Fetch Unit for Wide Issue Processors

Energy-Effective Instruction Fetch Unit for Wide Issue Processors Energy-Effective Instruction Fetch Unit for Wide Issue Processors Juan L. Aragón 1 and Alexander V. Veidenbaum 2 1 Dept. Ingen. y Tecnología de Computadores, Universidad de Murcia, 30071 Murcia, Spain

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Minimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007

Minimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007 Minimizing Power Dissipation during Write Operation to Register Files Kimish Patel, Wonbok Lee, Massoud Pedram University of Southern California Los Angeles CA August 28 th, 2007 Introduction Outline Conditional

More information

Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar

Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar AN ENERGY-EFFICIENT HIGH-PERFORMANCE DEEP-SUBMICRON INSTRUCTION CACHE Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar School of Electrical and Computer Engineering Purdue

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches To appear in the Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA), 2001. An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron

More information

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

Reactive-Associative Caches

Reactive-Associative Caches Reactive-Associative Caches Brannon Batson Alpha Design Group Compaq Computer Corporation bbatson@pa.dec.com T. N. Vijaykumar School of Electrical & Computer Engineering Purdue University vijay@ecn.purdue.edu

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling

Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling Exploiting Core Working Sets to Filter the L Cache with Random Sampling Yoav Etsion and Dror G. Feitelson Abstract Locality is often characterized by working sets, defined by Denning as the set of distinct

More information

Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Exploiting Streams in Instruction and Data Address Trace Compression

Exploiting Streams in Instruction and Data Address Trace Compression Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Reconfigurable STT-NV LUT-based Functional Units to Improve Performance in General-Purpose Processors

Reconfigurable STT-NV LUT-based Functional Units to Improve Performance in General-Purpose Processors Reconfigurable STT-NV LUT-based Functional Units to Improve Performance in General-Purpose Processors Adarsh Reddy Ashammagari 1, Hamid Mahmoodi 2, Tinoosh Mohsenin 3, Houman Homayoun 1 1 Dept. of Electrical

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Automatically Characterizing Large Scale Program Behavior

Automatically Characterizing Large Scale Program Behavior Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Department of Computer Science and Engineering University of California, San Diego {sherwood,eperelma,ghamerly,calder}@cs.ucsd.edu

More information

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Appears in Proc. of the 18th Int l Conf. on Parallel Architectures and Compilation Techniques. Raleigh, NC. Sept. 2009. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Wanli

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

PARE: A Power-Aware Hardware Data Prefetching Engine

PARE: A Power-Aware Hardware Data Prefetching Engine PARE: A Power-Aware Hardware Data Prefetching Engine Yao Guo Mahmoud Ben Naser Csaba Andras Moritz Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 01003 {yaoguo,

More information

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

A Power and Temperature Aware DRAM Architecture

A Power and Temperature Aware DRAM Architecture A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL

More information

An Energy-Efficient High-Performance Deep-Submicron Instruction Cache

An Energy-Efficient High-Performance Deep-Submicron Instruction Cache An Energy-Efficient High-Performance Deep-Submicron Instruction Cache Michael D. Powell ϒ, Se-Hyun Yang β1, Babak Falsafi β1,kaushikroy ϒ, and T. N. Vijaykumar ϒ ϒ School of Electrical and Computer Engineering

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Picking Statistically Valid and Early Simulation Points

Picking Statistically Valid and Early Simulation Points In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), September 23. Picking Statistically Valid and Early Simulation Points Erez Perelman Greg Hamerly

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

Computing Architectural Vulnerability Factors for Address-Based Structures

Computing Architectural Vulnerability Factors for Address-Based Structures Computing Architectural Vulnerability Factors for Address-Based Structures Arijit Biswas 1, Paul Racunas 1, Razvan Cheveresan 2, Joel Emer 3, Shubhendu S. Mukherjee 1 and Ram Rangan 4 1 FACT Group, Intel

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

RECENT studies have shown that, in highly associative

RECENT studies have shown that, in highly associative IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 4, APRIL 2008 433 Counter-Based Cache Replacement and Bypassing Algorithms Mazen Kharbutli, Member, IEEE, and Yan Solihin, Member, IEEE Abstract Recent studies

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Cache Pipelining with Partial Operand Knowledge

Cache Pipelining with Partial Operand Knowledge Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison {egunadi,mikko}@ece.wisc.edu Abstract

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information