Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.

Size: px
Start display at page:

Download "Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering."

Transcription

1 Power/Performance Advantages of Victim Buer in High-Performance Processors Gianluca Albera xy x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY R. Iris Bahar y y Brown University Division of Engineering Providence, RI Abstract In this paper, we propose several dierent data cache congurations and analyze their power as well as performance implications on the processor. Unlike most existing work in low power microprocessor design, we explore a high performance processor with the latest innovations for performance. Using a detailed, architectural-level simulator, we evaluate full system performance using several dierent power/performance sensitive cache congurations. We then use the information obtained from the simulator to calculate the energy consumption of the memory hierarchy of the system. We show that victim buer oers improved cache energy consumption over other techniques (10% compared to 3.8%), while at the same time provides comparable performance gains (3.54% compared to 3.45%). 1: Introduction In this paper we will concentrate on reducing the energy demands of an ultra highperformance processor, such as the Pentium Pro or the Alpha 21264, which uses superscalar, speculative, out-of-order execution. In particular, we will investigate architectural-level solutions that achieve a power reduction in the memory subsystem of the processor without compromising performance. Prior research has been aimed at measuring and recommending optimal cache conguration for power. In [9], the authors determined that high performance caches were also the lowest power consuming caches since they reduce the trac to the lower level of the memory system. The work by Kin [7] proposed accessing a small lter cache before accessing the rst level cache to reduce the accesses (and energy consumption) from DL1. The idea lead to a large reduction in memory hierarchy energy consumption, but also resulted in a substantial reduction in processor performance. While this reduction in performance may be tolerable for some applications, the high-end market will not make such a sacrice. This paper will propose memory hierarchy congurations that reduce power while retaining performance. Reducing cache misses due to line conicts has been shown to be eective in improving overall system performance in high-performance processors. Techniques to reduce conicts include increasing cache associativity, use of victim caches [5], or cache bypassing with and without the aid of a buer [4, 8, 10]. Figure 1 shows the design of the memory hierarchy when using a buer alongside the rst level data cache. Also included in the

2 gure is a representation of the ve main stages found in a speculative out-of-order execution processor. Note that the instruction cache access occurs in the fetch stage while the data cache access occurs in the issue stage for a load operation and the commit stage for a store operation. An instruction may spend more than one cycle in any of these ve stages depending on the type of instruction, data dependencies, and cache hit outcome. L1 Data Cache data access (dl1) dl1 buffer from Processor Unified L2 cache L1 Inst. Cache (il1) instruction access Fetch 1 Decode 2 Issue 3 Writeback 4 Commit 5 il1 access dl1 access (load) dl1 access (store) Figure 1. Memory hierarchy design using a buffer alongside the DL1 cache. The buffer helps boost the effectiveness of the cache. The buer is a small cache, between 8{16 entries, located between the rst level and second level caches. The buer may be used to hold specic data (e.g. non-temporal or speculative data), or may be used for general data (e.g. \victim" data). Note that this buer may be used in combination with the rst level data cache (DL1) and/or the rst level instruction cache (IL1). Previous work has shown that, for some specic applications, using a buer along-side the instruction or data cache exclusively for non-temporal or speculative data can improve cache performance and energy consumption over a strictly \victim buer" scheme [1, 8]. This paper focuses on the eectiveness of using a buer along-side the DL1 only. In addition, we focus more closely on the impact of the buer when used as a general purpose victim buer. Specically, we compare the use of the victim buer to more traditional techniques for reducing cache conicts such as increasing cache size and/or associativity. We nd that the optimal victim buer conguration depends on the base data cache size, second level cache conguration, and processor conguration. In addition, we show that use of a victim buer is usually a better choice for both power and performance compared to increasing either the data cache size or associativity. 2: Experimental setup This section presents our experimental environment. First, the CPU simulator will be briey introduced and then we will describe how we obtained data about the energy consumption of the caches.

3 2.1: Full model simulator We use an extension of the SimpleScalar [2] tool suite. SimpleScalar is an executiondriven simulator that uses binaries compiled to a MIPS-like target. SimpleScalar can accurately model a high-performance, dynamically-scheduled, multi-issue processor. We use an extended version of the simulator that more accurately models all the memory hierarchy, implementing non-blocking caches and complete bus bandwidth and contention modeling [3]. Other modications were added to handle precise modeling of cache lls. Table 1 shows the conguration of the processor modeled. Note that rst level caches are on-chip, while the unied second level cache is o-chip. In addition we have an 8{16 entry fully associative buer associated with each rst level cache. Note that the ALU resources listed in Table 1 may incur dierent latency and occupancy values depending on the type of operation being performed by the unit. Although an 8KB L1 cache may seem small by today's standards for a high-performance processor, the SPEC95 benchmarks we use for our experiments tend to use a small working set. Therefore, we use a smaller cache to obtain reasonable hit/miss rates. Parameter L1 Icache L1 Dcache L2 Cache Memory Branch Pred. BTB RAS ITLB DTLB Table 1. Machine configuration and processor resources. Conguration 8KB direct; 32B line; 1 cycle 8KB direct; 32B line; 1 cycle 256KB 4-way; 64B line; 12 cycles 64 bit-wide; 20 cycles on hit, 40 cycles on page miss 2k bimodal 2048 entry 4-way set assoc. 8 entry queue 32 entry fully assoc. 64 entry fully assoc. Parameter Units Fetch/Issue/Commit Width 4 Integer ALU 4 Integer Mult/Div 1 FP ALU 4 FP Mult/Div/Sqrt 1 DL1 Read Ports 2 DL1 Write Ports 1 Instruction Window Entries 32 Load/Store Queue Entries 16 Fetch Queue 8 Our simulations are executed on SPECint95 and SPECfp95 benchmarks [12]; they were compiled using a re-targeted version of the GNU gcc compiler, with full optimization. This compiler generates SimpleScalar machine instructions. Since we are executing a full model on a a very detailed simulator, the benchmarks take several hours to complete; due to time constraints we feed the simulator with a small set of inputs. Integer benchmarks are executed to completion. Floating point benchmarks are simulated only for the rst 600 million clock cycles. 2.2: Power model Energy dissipation in CMOS technology circuits is mainly due to charging and discharging gate capacitances; on every transition we dissipate E t = 1 2 C eq V 2 dd Watts. To obtain the values for the equivalent capacitances, C eq, for the components in the memory subsystem, we follow the model given by Wilton and Jouppi [11]. Their model assumes a 0.8 m process; if a dierent process is used, only the transistor capacitances need to be recomputed. To obtain the number of transitions that occur on each transistor, we refer to Kamble and Ghose [6], adapting their work to our overall architecture. An m-way set associative cache consists of three main parts: a data array, a tag array and the necessary control logic. The data array consists of S rows containing m lines. Each line contains L bytes of data and a tag T which is used to uniquely identify the data. Upon

4 receiving a data request, the address is divided into three parts. The rst part indexes one row in the cache, the second selects the bytes or words desired, and the last is compared to the entry in the tag to detect a hit or a miss. On a hit, the processor accesses the data from the rst level cache. On a miss, we use a write-back, write-allocate policy. The latency of the access is directly proportional to the capacitance driven and to the length of bit and word-lines. In order to maximize speed we kept the arrays as square as possible by splitting the data and tag array vertically and/or horizontally. Sub-arrays can also be folded. We used the tool cacti [11] to compute sub-arraying parameters for all our caches. According to [6] we consider the main sources of power to be the following three components: E bit, E word, E output. We are not considering the energy dissipated in the address decoders, since we found this value to be negligible compared to the other components. Similar to Kin [7], we found that the energy consumption of the decoders is about three orders of magnitude smaller than that of the other components. The energy consumption is computed as: E cache = E bit + E word + E output (1) A brief description of each of these components follows. Further details can be found in [1] and [6] : Energy dissipated in bit-lines E bit is the energy consumption in the bit-lines; it is due to precharging lines (including driving the precharge logic) and reading or writing data. Since we assume a sub-arrayed cache, we need to precharge and discharge only the portion directly related with the address we need to read/write. Switching values depend on the number of accesses to the cache and on the array parameters given by cacti [11] while capacitances values depend on the sizes of the transistors within the SRAM cell. Note that in order to minimize the power overhead introduced by buers, in the fully associative conguration, we perform rst a tag look-up and access the data array only on a hit : Energy dissipated in word-lines E word is the energy consumption due to assertion of word-lines; once the bit-lines are all precharged we select one row, performing the read/write to the desired data : Energy dissipated driving external buses E output is the energy used to drive external buses; this component includes both the data sent/returned and the address sent to the lower level memory on a miss request. 3: Experimental results 3.1: Base case E output = E a output + E d output ; (2) In this section we will describe the base case we used; all other experiments will compare to this one.

5 Table 2. Baseline results: Instructions per cycle (IPC), cache miss rate, and energy consumption in the base case. Energy is given in Joules. DL1 Cache IL1 Cache UL2 Cache Test IPC miss rate Energy miss rate Energy miss rate Energy compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu As stated before, our base case uses 8K direct mapped on-chip rst level caches (i.e. DL1 for data and IL1 for instruction), with a unied 256K 4-way o-chip second level cache (UL2). Table 2 shows the instructions per cycle (IPC) and, for each cache, its miss rate and energy consumption (measured in Joules). The next sections' results will show percentage decreases over the base case; thus positive numbers will mean an improvement in energy consumption or performances and negative number will mean a worsening. First level caches are on-chip, so their energy consumption refers to the CPU level, while the o-chip second level cache energy refers to the board level. In the following sections we show what happens to the energy in the overall cache architecture (L1 plus L2), and also clarify whether variations in power belong to the CPU or to the board. As shown in Table 2 the dominant portion of energy consumption is due to the UL2, because of its bigger size and high capacitance board buses. Furthermore we see that the instruction cache demands more power than the data cache, due to a higher number of accesses. 3.2: Traditional techniques Traditional approaches for reducing cache misses utilize bigger cache sizes and/or increased associativity. These techniques may oer good performances but present several drawbacks; rst an area increase is required and second it is more likely that a larger latency will result whenever bigger caches and/or higher associativity are used. Table 3 presents results obtained changing data cache size or associativity. Specically, we present results for 16K direct mapped and 8K 2-way congurations, both maintaining a 1 cycle latency. Reduction in cycles, energy in DL1 and total energy are shown. The overall energy consumption (i.e. %E total ) is generally reduced, due to a reduced activity in the second level cache (since we decrease the DL1 miss rate). However we can also see cases in which we have an increase in energy consumption. Benchmark su2cor already has a low miss rate (1.48%) so, although IPC improves, more energy is consumed in the DL1 than is saved by reduced UL2 accesses. On the other hand, m88ksim, mgrid and applu have higher miss rates, but these are due mainly to conict misses. Therefore increasing the size of the cache does not improve DL1 miss rates enough

6 Table 3. Traditional Techniques: Percent improvement in performance and energy consumption by increasing DL1 size and associativity compared to the base case. Latency = 1 cycle. Negative results indicate a decrease in performance or increase in energy consumption. Test 16K-direct 8K-2way %IPC %M RDL1 %EDL1 %Etotal %IPC %M RDL1 %EDL1 %Etotal compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu to reduce energy consumption. The opposite is true for swim, which has a very large miss rate (21.2%); misses are predominately capacity misses so although using a 16KB DL1 improves both performance and energy consumption, both degrade when using an 8KB 2-way associative DL1. It is important to note that even if the total energy consumption improves, on-chip consumption increases signicantly (up to 61% for mgrid) as the cache becomes larger. In addition, we have observed that if the increase in size/associativity requires having a 2 cycle latency, performance improvements are no longer so dramatic and in some cases we also saw a decrease. 3.3: Victim cache We next consider the use of a victim buer originally presented by Jouppi [5]. In the original implementation, on a main cache miss, the victim cache is accessed; if the address hits the victim cache, the data is returned to the CPU and at the same time it is promoted to the main cache; the replaced line in the main cache is moved to the victim cache, therefore performing a \swap". If the victim cache also misses, an L2 access is performed; the incoming data will ll the main cache, and the replaced line will be moved to the victim cache. The replaced entry in the victim cache is discarded and, if dirty, written back to the second level cache. We modied the algorithm by, performing a parallel look-up in the main and victim cache, as we saw that this helps performance without signicant drawbacks on power. We also found that the time required by the processor to perform the swapping, due to a victim hit, was detrimental to performance, so we also tried a variation on the algorithm that doesn't require swapping (i.e. on a victim cache hit, the line is not promoted to the main cache). Table 4 shows eects using a fully associative victim cache. We present performance (in IPC) and overall energy reduction using either an 8-entry or 16-entry victim buer

7 Table 4. Fully Associative Victim Cache: Percent improvement in performance and energy consumption compared to the base case. Comparison is made with an 8 and 16-entry victim cache, using swapping and no swapping. 8-Entry Victim Cache 16-Entry Victim Cache Test non-swapping swapping non-swapping swapping %IPC %Energy %IPC %Energy %IPC %Energy %IPC %Energy compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu along side the data cache. As stated before, we show that the non-swapping mechanism generally outperforms the original algorithm presented by the author. In [5], the data cache of a single issue processor was considered, where a memory access occurs approximately one out of four cycles; thus the victim cache had ample time to perform the necessary swapping. The same cannot be said of a 4-way issue processor that has, on average, one data memory access per cycle. In this case the improved miss rated obtained by swapping is often outweighed by the extra latency introduced by having to stall subsequent accesses. It is interesting to note that, when using a victim buer, increased performance and energy reduction go hand-in-hand, since reduced L2 activity helps both of them. As we saw in Section 3.2, this is not always the case when doubling the size or associativity. There are two reasons for this. First, the victim buer can adapt better to address either capacity or conict miss problems. The swim benchmark is the best example of this where performance increases by over 16% using a 16-entry non-swapping victim buer, but by only 10% when doubling the size of the cache and degrades by 1.8% when doubling the associativity. The second reason is that the victim buer adds less power dissipation on-chip than either of the other techniques. We found that on-chip energy consumption increased by only 2% on average for the 16-entry buer (not shown in table). This allows total cache energy savings to be much greater using the victim buer than either increasing cache size or associativity (average 10% vs. 3.8% for 16KB DL1 or 5.0% for 8K 2-way DL1). 3.4: Eectiveness of Victim Cache using Dierent sized L1 Data Caches The following section explores the eectiveness of the victim cache when a smaller or larger base L1 data cache is available. We rst explored performance and energy consumption tradeos using a 4KB L1 data cache. We expect a 4KB DL1 cache to suer more from capacity and conict misses compared to an 8KB DL1 cache, therefore doubling size, doubling associativity, or using a victim buer all obtained a larger improvement in performance and energy consumption than those shown

8 in Tables 3 and 4 (energy consumption often improves by 50% more and performance by 30%. Victim caches still have a clear advantage over increasing the size or associativity of the DL1 in terms of energy consumption (while obtaining approximately the same or better performance improvement). However, one notable dierence is the eectiveness of swapping verses non-swapping victim buers. For an 8KB DL1 we found that swapping interferes with performance since it can stall subsequent accesses to the DL1. On the other hand, since a processor with only a 4KB cache spends more time servicing DL1 misses (and therefore a longer time executing the program), DL1 accesses are spread out more making concurrent swaps and accesses less frequent. The end result is that swapping helps to improve the cache hit rate without hindering processor performance. Alternatively, if the base DL1 cache size is increased to 16KB, then a 16-entry will still provide some improvements in performance and energy consumption, however not as great as with a 4KB or 8KB DL1. In addition, as cache size increases, swapping will continue to hinder performance. 3.5: Alternatives to Changing the Cache Conguration Our ultimate goal is to reduce energy consumption while retaining performance. Highperformance processors obtain performance increases through a variety of techniques, not just through improving the L1 cache structure. One question then is, what eect will other techniques have on the performance and energy consumption compared to improving cache congurations? We now look at increasing the issue and decode width of the processor from 4 to 8 instructions. Table 5. Increasing issue width: Percent improvement in performance and energy consumption compared to the base case when the decode and issue with is increased from 4 to 8 instructions. Test %IPC %M RDL1 %EDL1 %EIL1 %EUL2 %Etotal compress go vortex gcc li ijpeg m88ksim perl tomcatv swim su2cor hydro2d mgrid applu In Table 5 we show the eect on IPC and energy consumption. It is interesting to note that although performance does improve, energy consumption degrades in almost all cases for all three caches (as indicated by the negative values in the columns labeled %E DL1, %E IL1, %E U L2 ). The increased energy consumption is a direct result of increased cache accesses. With a wider issue processor, more instructions can enter the fetch and execute stages per cycle. This allows more speculative instructions on mispredicted paths to be

9 executed before the branch misprediction is detected. For these benchmarks, adding a victim cache would be a better choice, especially since wider issue would also require more hardware in the decode and fetch stages, increasing energy consumption even further in the non-cache portion of the chip. However, increasing issue width and using a victim cache are not mutually exclusive. Both can be implemented, and we have found that performance and energy consumption improve by roughly the same percentage as those shown in Tables 3 and 4. As with a 4-wide issue machine, a 16-entry non-swapping victim cache is overall the best choice compared to doubling the size or associativity of the DL1. 4: Conclusions and Future Work In this paper we presented tradeos between power and performance in cache architectures; we showed that adding small buers can oer an overall energy reduction without decreasing the performance. These small buers suce in reducing the overall cache energy consumption, without a notable increase in on-chip power that traditional techniques present. Furthermore use of a buer is not mutually exclusive to increasing the cache size/associativity or decode/issue width, oering the possibility of a combined eect. We are currently investigating further improvements in combined techniques, replacement policies as well as buer utilization. Acknowledgments We wish to thank Bobbie Manne for her valuable comments throughout this work and Doug Burger for providing us with the framework for the SimpleScalar memory model. References [1] I. Bahar, G. Albera and S. Manne, \Using Condence to Reduce Energy Consumption in High- Performance Microprocessors," Int. Symposium on Low Power Electronics and Design, [2] D. Burger, and T. M. Austin, \The SimpleScalar Tool Set, Version 2.0," Technical Report TR#1342, University of Wisconsin, June [3] D. Burger, and T. M. Austin, \SimpleScalar Tutorial," presented at 30th International Symposium on Microarchitecture, December, [4] T. L. Johnson, and W. W. Hwu, \Run-time Adaptive Cache Hierarchy Management via Reference Analysis," International Symposium on Computer Architecture, pp. 315{326, June [5] N. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buers," Inter. Symposium on Computer Architecture, pp. 364{373, May [6] M. B. Kamble and K. Ghose, \Analytical Energy Dissipation Models for Low Power Caches," ACM/IEEE International Symposium on Low-Power Electronics and Design, August, [7] J. Kin, M. Gupta, and W. H. Mangione-Smith, \The Filter Cache: An Energy Ecient Memory Structure," 30th International Symposium on Microarchitecture, pp. 184{193, December [8] J. A. Rivers, and E. S. Davidson, \Reducing Conicts in Direct-Mapped Caches with a Temporality- Based Design," International Conference on Parallel Processing, pp , August [9] C. Su, and A. Despain, \Cache Design Tradeos for Power and Performance Optimization: A Case Study," ACM/IEEE Int. Symposium on Low-Power Design, pp. 63{68, April, [10] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, \Managing Data Caches using Selective Cache Line Replacement," Journal of Parallel Programming, Vol. 25, No. 3, pp. 213{242, June [11] S. J. E. Wilton, and N. Jouppi, \An Enhanced Access and Cycle Time Model for On-Chip Caches," Digital WRL Research Report 93/5, July [12] Jerey Gee, Mark Hill, Dinoisions Pnevmatikatos, Alan J. Smith, \Cache Performance of the SPEC Benchmark Suite," IEEE Micro, Vol. 13, Number 4, pp (August 1993)

Power and Performance Tradeoffs using Various Cache Configurations

Power and Performance Tradeoffs using Various Cache Configurations Power and Performance Tradeoffs using Various Cache Configurations Gianluca Albera xy and R. Iris Bahar y x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 y Brown University

More information

Power and Performance Tradeoffs using Various Caching Strategies

Power and Performance Tradeoffs using Various Caching Strategies Power and Performance Tradeoffs using Various Caching Strategies y Brown University Division of Engineering Providence, RI 02912 R. Iris Bahar y Gianluca Albera xy Srilatha Manne z x Politecnico di Torino

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Gated-V dd : A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories

Gated-V dd : A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories To appear in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2000. Gated-V dd : A Circuit Technique to Reduce in Deep-Submicron Cache Memories Michael Powell,

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Improving Performance of an L1 Cache With an. Associated Buer. Vijayalakshmi Srinivasan. Electrical Engineering and Computer Science,

Improving Performance of an L1 Cache With an. Associated Buer. Vijayalakshmi Srinivasan. Electrical Engineering and Computer Science, Improving Performance of an L1 Cache With an Associated Buer Vijayalakshmi Srinivasan Electrical Engineering and Computer Science, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109-2122,USA.

More information

Allocation By Conflict: A Simple, Effective Cache Management Scheme

Allocation By Conflict: A Simple, Effective Cache Management Scheme Allocation By Conflict: A Simple, Effective Cache Management Scheme Edward S. Tam, Gary S. Tyson, and Edward S. Davidson Advanced Computer Architecture Laboratory The University of Michigan {estam,tyson,davidson}@eecs.umich.edu

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB A Low Energy Set-Associative I-Cache with Extended BTB Koji Inoue, Vasily G. Moshnyaga Dept. of Elec. Eng. and Computer Science Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0180 JAPAN {inoue,

More information

Evaluation of a High Performance Code Compression Method

Evaluation of a High Performance Code Compression Method Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Annotated Memory References: A Mechanism for Informed Cache Management

Annotated Memory References: A Mechanism for Informed Cache Management Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University http://www.cs.duke.edu/ari/ice

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Utilizing Reuse Information in Data Cache Management

Utilizing Reuse Information in Data Cache Management Utilizing Reuse Information in Data Cache Management Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, Michigan

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

A Prefetch Taxonomy. Viji Srinivasan, Edward S Davidson, Gary S Tyson.

A Prefetch Taxonomy. Viji Srinivasan, Edward S Davidson, Gary S Tyson. A Prefetch Taxonomy Viji Srinivasan, Edward S Davidson, Gary S Tyson e-mail: sviji@eecs.umich.edu, davidson@eecs.umich.edu, tyson@eecs.umich.edu Abstract The dierence in processor and main memory cycle

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

predicted address tag prev_addr stride state effective address correct incorrect weakly predict predict correct correct incorrect correct

predicted address tag prev_addr stride state effective address correct incorrect weakly predict predict correct correct incorrect correct Data Dependence Speculation using Data Address Prediction and its Enhancement with Instruction Reissue Toshinori Sato Toshiba Microelectronics Engineering Laboratory 580-1, Horikawa-Cho, Saiwai-Ku, Kawasaki

More information

Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar

Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar AN ENERGY-EFFICIENT HIGH-PERFORMANCE DEEP-SUBMICRON INSTRUCTION CACHE Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar School of Electrical and Computer Engineering Purdue

More information

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures.

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures. Compiler Techniques for Energy Saving in Instruction Caches of Speculative Parallel Microarchitectures Seon Wook Kim Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Using dynamic cache management techniques to reduce energy in a high-performance processor *

Using dynamic cache management techniques to reduce energy in a high-performance processor * Using dynamic cache management techniques to reduce energy in a high-performance processor * Nikolaos Bellas, Ibrahim Haji, and Constantine Polychronopoulos Department of Electrical & Computer Engineering

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures.

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures. Introduction Power- and Performance -Aware Architectures PhD. candidate: Ramon Canal Corretger Advisors: Antonio onzález Colás (UPC) James E. Smith (U. Wisconsin-Madison) Departament d Arquitectura de

More information

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches To appear in the Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA), 2001. An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Comparing Multiported Cache Schemes

Comparing Multiported Cache Schemes Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be

More information

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class Overview of Today s Lecture: Cost & Price, Performance EE176-SJSU Computer Architecture and Organization Lecture 2 Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class EE176

More information

High Performance and Energy Efficient Serial Prefetch Architecture

High Performance and Energy Efficient Serial Prefetch Architecture In Proceedings of the 4th International Symposium on High Performance Computing (ISHPC), May 2002, (c) Springer-Verlag. High Performance and Energy Efficient Serial Prefetch Architecture Glenn Reinman

More information

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel E-mail: shlomit@vnet.ibm.com

More information

Energy Efficient Caching-on-Cache Architectures for Embedded Systems

Energy Efficient Caching-on-Cache Architectures for Embedded Systems JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 19, 809-825 (2003) Energy Efficient Caching-on-Cache Architectures for Embedded Systems HUNG-CHENG WU, TIEN-FU CHEN, HUNG-YU LI + AND JINN-SHYAN WANG + Department

More information

An Energy-Efficient High-Performance Deep-Submicron Instruction Cache

An Energy-Efficient High-Performance Deep-Submicron Instruction Cache An Energy-Efficient High-Performance Deep-Submicron Instruction Cache Michael D. Powell ϒ, Se-Hyun Yang β1, Babak Falsafi β1,kaushikroy ϒ, and T. N. Vijaykumar ϒ ϒ School of Electrical and Computer Engineering

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Cache Pipelining with Partial Operand Knowledge

Cache Pipelining with Partial Operand Knowledge Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison {egunadi,mikko}@ece.wisc.edu Abstract

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

A Cost-Effective Clustered Architecture

A Cost-Effective Clustered Architecture A Cost-Effective Clustered Architecture Ramon Canal, Joan-Manuel Parcerisa, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Cr. Jordi Girona, - Mòdul D6

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

REDUCING POWER IN SUPERSCALAR PROCESSOR CACHES USING SUBBANKING, MULTIPLE LINE BUFFERS AND BIT LINE SEGMENTATION *

REDUCING POWER IN SUPERSCALAR PROCESSOR CACHES USING SUBBANKING, MULTIPLE LINE BUFFERS AND BIT LINE SEGMENTATION * REDUCING POWER IN SUPERSCALAR PROCESSOR CACHES USING SUBBANKING, MULTIPLE LINE BUFFERS AND BIT LINE SEGMENTATION * Kanad Ghose and Milind B. Kamble ** Department of Computer Science State University of

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Four Questions for Memory Hierarchy Designers

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department

More information

Reducing Data Cache Energy Consumption via Cached Load/Store Queue

Reducing Data Cache Energy Consumption via Cached Load/Store Queue Reducing Data Cache Energy Consumption via Cached Load/Store Queue Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau Center for Embedded Computer Systems University of Cafornia, Irvine {dann,alexv,nicolau}@cecs.uci.edu

More information

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Transient-Fault Recovery Using Simultaneous Multithreading

Transient-Fault Recovery Using Simultaneous Multithreading To appear in Proceedings of the International Symposium on ComputerArchitecture (ISCA), May 2002. Transient-Fault Recovery Using Simultaneous Multithreading T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Computer Performance Evaluation: Cycles Per Instruction (CPI) Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Kashif Ali MoKhtar Aboelaze SupraKash Datta Department of Computer Science and Engineering York University Toronto ON CANADA Abstract

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

Multiple-Banked Register File Architectures

Multiple-Banked Register File Architectures Multiple-Banked Register File Architectures José-Lorenzo Cruz, Antonio González and Mateo Valero Nigel P. Topham Departament d Arquitectura de Computadors Siroyan Ltd Universitat Politècnica de Catalunya

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Multiple Branch and Block Prediction

Multiple Branch and Block Prediction Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

fswallace, niravd, limited superscalar principles. Most of them implemented 2 Design

fswallace, niravd, limited superscalar principles. Most of them implemented 2 Design Design and Implementation of a 100 MHz Centralized Instruction Window for a Superscalar Microprocessor Steven Wallace, Nirav Dagli, and Nader Bagherzadeh Department of Electrical and Computer Engineering

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Energy-Effective Instruction Fetch Unit for Wide Issue Processors

Energy-Effective Instruction Fetch Unit for Wide Issue Processors Energy-Effective Instruction Fetch Unit for Wide Issue Processors Juan L. Aragón 1 and Alexander V. Veidenbaum 2 1 Dept. Ingen. y Tecnología de Computadores, Universidad de Murcia, 30071 Murcia, Spain

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

I-CACHE INSTRUCTION FETCHER. PREFETCH BUFFER FIFO n INSTRUCTION DECODER

I-CACHE INSTRUCTION FETCHER. PREFETCH BUFFER FIFO n INSTRUCTION DECODER Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine

More information

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

More information

Computing Along the Critical Path

Computing Along the Critical Path Computing Along the Critical Path Dean M. Tullsen Brad Calder Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 UCSD Technical Report October 1998

More information

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu CENG 3420 Computer Organization and Design Lecture 08: Cache Review Bei Yu CEG3420 L08.1 Spring 2016 A Typical Memory Hierarchy q Take advantage of the principle of locality to present the user with as

More information

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Exploiting Idle Floating-Point Resources for Integer Execution

Exploiting Idle Floating-Point Resources for Integer Execution Exploiting Idle Floating-Point Resources for Integer Execution, Subbarao Palacharla, James E. Smith University of Wisconsin, Madison Motivation Partitioned integer and floating-point resources on current

More information

The Impact of Instruction Compression on I-cache Performance

The Impact of Instruction Compression on I-cache Performance Technical Report CSE-TR--97, University of Michigan The Impact of Instruction Compression on I-cache Performance I-Cheng K. Chen Peter L. Bird Trevor Mudge EECS Department University of Michigan {icheng,pbird,tnm}@eecs.umich.edu

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM

More information

The Von Neumann Computer Model

The Von Neumann Computer Model The Von Neumann Computer Model Partitioning of the computing engine into components: Central Processing Unit (CPU): Control Unit (instruction decode, sequencing of operations), Datapath (registers, arithmetic

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue

SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue Jaume Abella and Antonio González Intel Barcelona Research Center Intel Labs, Universitat Politècnica de Catalunya Barcelona (Spain)

More information

Cache performance Outline

Cache performance Outline Cache performance 1 Outline Metrics Performance characterization Cache optimization techniques 2 Page 1 Cache Performance metrics (1) Miss rate: Neglects cycle time implications Average memory access time

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information