Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Size: px
Start display at page:

Download "Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads"

Transcription

1 Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu University, Fukuoka, Japan Dept. of Elec. Eng. and Computer Science, Fukuoka University, Fukuoka, Japan Abstract In recent years, the performance gap between microprocessor and main memory has been increasing. This problem suppresses the performance of microprocessor and is well-known in the literature as the Memory Wall Problem (MWP). This paper proposes a new method to reduce the MWP effect by means of re-computation. The basic idea is to replace a frequently cache-missed load (socalled delinquent load) instruction with a piece of code which regenerates the load value. The code is called re-computation (RC) code. This method can reduce the number of main memory accesses and the cache miss penalty to alleviate the MWP. From the experiments, one can obtain up to 45.3% reduction of execution time on SPEC2000 benchmark programs. 1 Introduction The performance of microprocessor have been improving 55% per year, whereas that of DRAM commonly used as the main memory improving only 7% [4]. The performance gap between them has been increasing. This gap is one of the main reasons which suppress the performance of microprocessor. This problem is called as memory wall problem. To solve this problem, the memory access latency needs to be reduced or hidden. One of the traditional latency reduction techniques is to employ an on-chip cache memory, which is widely used in state-of-the-art commercial microprocessors. On the other hand, one of the latency hiding techniques is prefetching. So far, many researchers have been proposed a number of techniques to improve the accuracy of prefetching. In recent years, some research groups proposed a new prefetching technique focused on the load instructions which frequently cause cache misses [3, 7]. Those load instructions have come to be known as delinquent load (DL) instructions, and are responsible for 90 % of total cache misses in an experiment [3]. Their prefetching technique is based on multithreaded processors. Besides the normal program execution in one thread (main thread), the data addresses accessed by DL instructions are speculatively calculated in another thread (helper thread). Therefore, the microprocessor can prefetch the data whose address is calculated in the helper thread. However, prefetching needs to access the main memory to improve cache hit rate, so cannot reduce the number of accesses. The main drawback of the prefetching scheme is that an inaccurate data loading pollutes the on-chip cache, resulting in higher cache miss rates. These points are the limitations of the prefetching. In this paper, we propose a novel approach to solve the memory wall problem based on Re- Computation (RC). Originally, we have a concept called CCC (Computing Centric Computation) as one of the measures to the memory wall problem. Recently, the memory accesse latency is extremely longer than the computation time for one operation in microprocessors. CCC replaces the memory accesses with the computations to reduce total execution time. We apply this concept to the load instructions, and propose a Load Data Re-Computation (LDRC) technique. In order to reduce the growing offchip memory-access latency, we replace the load instructions with the corresponding RC code. Namely, we attempt to change memory access latency to RC code execution time. To make the best use of a bad influence of cache misses, we apply this method only to DL instructions out of all loads. Hence, if the microprocessor can execute RC codes faster than DL instructions, LDRC executes RC codes instead of DL instructions, and shortens total execution time,. In this paper, we preliminarily evaluate

2 our technique and show the potential for execution time reduction. The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the concept of our re-computation technique. Section 4 expresses the experiment and the result of evaluation for our method. Section 5 concludes this paper. 2 Related work Original source code c=a+b; Value C is stored to main memory z=x+c; Value C is not in the cache, and has to be read from main memory Original object code Load a Load b Add c, a, b Store c Load c Load x Add z, x, c Store z RC code Load a Load b Add c, a, b Replaced with the RC code LDRC applied object code Load a Load b Add c, a, b Store c [RC code] Load x Add z, x, c Store z Assuming that value x is on the cache Figure 1. Outline of LDRC There are a lot of techniques to reduce or hide the latency of main memory access. However, but needs to add DGG and PE. as far as we know, no techniques related to recomputation These techniques described above are limited have been proposed. In this sec- to data prefetching. If the predicted data ad- tion, we take up methods dealing with DL instructions. Recently, to hide the access latency, speculative data prefetching techniques for DL instructions dress is wrong, they cannot hide the latency of main memory accesses. In the next section, we propose a data re-computation method for DL instructions. have been proposed. Roth and Sohi propose Speculative data-driven multithreading (DDMT) in [7]. In DDMT, intructions are classified by their criticality to performance 3 Load data re-computation from a trace data. If the processor de- In this section, we propose the load data re- tects the critical instructions upcoming at run computation technique, called LDRC, and discuss the effectiveness of our method. time, then the processor forks a speculative thread to execute the critical instruction. The speculative thread is called data-driven thread (DDT), ation that benchmark programs are running on To make problem simple, we assume a situ- and is executed in parallel with the main thread. a normal 32bit RISC type microprocessor with DDT includes load instructions which cause first single thread. The microprocessor accesses a 2- level cache misses or branch instructions which level on-chip cache memory and an off-chip main are frequently miss predicted. memory. Collins et al. propose Speculative Precomputation (SP) in [3]. SP also uses a simultaneous multithreading processor to improve performance 3.1 Basic idea of single-thread applications. In SP, pre- To explain the concept of LDRC, consider a computation slices (p-slices) which include DL case to execute a load instruction. If a data accessed by the load instruction is not in the on- instructions and their dependent instructions are executed in another thread to calculate data addresses accessed by DL instructions and prefetch chip L2 cache, the data has to be fetched from the off-chip main memory. Since it takes long them. SP is focused on Itanium processors which time to fetch the data from the off-chip memory, have in-order pipeline, whereas DDMT focused the cache miss penalty is very large. on out-of-order processors. However, in this case, if the data was defined Annavaram et al. propose Data Prefetching previously in the program, we can get the data by Dependence Graph Precomputation (DGP) by means of re-computation. In addition, if the in [1]. DGP employs Dependence Graph Generator (DGG) in the IFQ, and generates a depen- re-computation time for the load data is shorter than the memory access time, we can reduce the dence graph for a load or store instruction which execution time by re-computation. This is the is likely to cause a cache miss. Precomputation Engine (PE) executes the dependence graph basic idea of LDRC. Namely, we attempt to replace DL instructions with corresponding re-computation to generate prefetching address and prefetch the codes to reduce execution time. data. DGP does not need multithread processors

3 We give an example in Figure 1 to explain this method. In Figure 1, we see an original source code, a corresponding object code, a RC code, and a modified object code. In the original source code, the variable c is defined in the first sentence (c = a + b), and is used in the latter one (z = x + c). In the latter sentence, we assume that the value of c is evicted from the on-chip cache. In the corresponding object code, the instruction Load c causes a cache miss. The variable c needs to be loaded from the main memory. In the LDRC applied object code, the RC code is added instead of Load c which was in the original object code. The variable c is regenerated with the RC code. In our method, if variables a and b are in the cache, the variable c is successfully re-computed by RC code without any cache misses. The execution cycles can be reduced, if the re-computation can replaces the accesses to the main memory and its time is shorter than the access time to the main memory. 3.2 RC code generation We explain a procedure to generate RC codes. RC codes generation algorithm is shown as follows: 1. Determine DL instructions with simulations using benchmark programs. DL instructions depend on input vectors to a benchmark program, so the DL instruction deterninations have to be carried out for each inputs. 2. Construct CDFGs for each benchmark program. 3. On the CDFG, search the load data definition points for each DL instruction. 4. Extract the defining codes for each load data from CDFG. We can obtain RC codes from above algorithm. Furthermore, if we know that the load instruction in the leaf of the RC code will cause the cache miss, then the RC code for the load should be generated. This kind of RC codes are called Multi-layered RC codes, whereas the normal ones are called Single-layered RC codes. 3.3 The conditions to obtain performance improvements by LDRC This subsection discusses the conditions that should be satisfied in order to cut down the execution time by LDRC. We ignore implementation issues because the evaluation objective is to show the possibility of LDRC to reduce execution time. To reduce execution time, applications should at least satisfy the following four conditions. 1. DL instructions cause a lot of cache misses. 2. DL instructions can be replaced with RC codes. 3. The execution time of RC code is shorter than memory access latency. 4. The load instructions cannot cause any cache misses. First, conditions 1 and 2 describe the opportunity to apply the proposed method. The more opportunities we have, the more effect we can obtain. Next, conditions 3 and 4 describe the overhead caused by the proposed method. Execution time cannot be reduced, unless re-computation time is short and load instructions in the RC code cause cache hits. 4 Evaluation In this section, we quantitatively evaluate our method using some benchmark programs. Since the evaluation objective is to show the maximum performance improvement due to our method, we ignore several issues related to implementations. 4.1 Experimental environment To evaluate our method, we used SimpleScalar simulation tool set [2]. The processor configuration assumed in this evaluation is shown in Table 1. We selected the following benchmark programs from SPEC CPU 2000 benchmark set [5]. These applications were compiled with MIRV C compiler [6] Floating point benchmark programs (4 programs): 177.mesa, 179.art, 183.equake, 188.ammp Integer benchmark programs (7 programs): 164.gzip, 175.vpr, 176.gcc, 181.mcf, 197.parser, 255.vortex, 256.bzip2 We used Reference inputs in SPEC CPU 2000 to these benchmarks. We selected command line parameters so as to cause maximum number of

4 Table 1. Simulator parameters (cc: clock cycle(s)) Instruction issue Out of order Branch predictor Type 2 level (gshare, 2K entries) BTB size 512 entries, 4 way RAS 32 Inst. decode width 8 instructions/cc Inst. issue width 8 instructions/cc IFQ size 8 entries RUU size 64 entries LSQ size 32 entries Cache memory L1 data 32KB (64B/entry, 2way, 256 entries) L1 instruction 32KB (64B/entry, 1way, 512 entries) L2 unified 2MB (64B/entry, 4way, 8192 entries) Latency L1 cache 1 cc L2 cache 16 cc Main memory 250 cc Memory bandwidth 8B Memory ports 2 ITLB, DTLB Entry 1M entries (4KB/entry, 256 entries/way, 4 way) Miss penalty 30 cc Integer operation units (Units, Exec. lat., Issue.lat.) ALU 4, 1 cc, 1 cc Mult. 1, 3 cc, 1 cc Div. 1, 20 cc, 19 cc Floating operation units (Units, Exec.lat., Issue.lat.) ALU 4, 2 cc, 1 cc Mult. 1, 4 cc, 1 cc Div. 1, 12 cc, 12 cc SQRT 1, 24 cc, 24 cc L2 data cache misses by DL instructions. We evaluated 200M instructions executed after 2G instructions from the beginning to shorten the simulation time. 4.2 Evaluation assumptions The assumptions in this evaluation are shown as follows: In this evaluation, we define the 16 static DL instructions which cause L2 cache misses most frequently. This definition is based on the fact that the 16 static load instructions are responsible for the greater part of L2 cache misses. Figure 2 represents the percentage of L2 cache miss caused by the DL instructions for each benchmark programs. We ranked DL instructions by their cache miss counts such as DL1, DL2, DL3 and Percentage of L2 cache miss due to DL instructions (%) DL1 DL2 DL3 DL Figure 2. Percentage of L2 cache miss due to DL instructions DL4-16. According to Figure 2, DL1,2,3 instructions are responsible for more than 40% of total cache misses in most of the benchmark programs, and all the DL instructions are responsible for more than %. In Subsection 3.2, we explain the way to generate RC codes. However, we could not generate CDFG, so we used instruction trace instead. Therefore, RC codes should have included all the paths needed for re-computation, but the generated RC codes include only one path. To count the number of instructions in RC codes, we have to generate all the RC codes. However, it takes long time to generate all of them. In this evaluation, to reduce the RC code generation time, we group the RC codes together by their similarity and generate only one representative RC code for each group. We assume that all the RC codes in the same group have the same number of instructions as a representative RC code. In order to always satisfy the condition 4 described in Subsection 3.3, it is assumed that all the load instructions in all the RC codes cause cache hits in the L2 cache. If the execution time of RC code is longer than the memory access latency, we suppose that the load instruction can fetch the data from the off-chip memory. In this case, we assume that the RC code is not executed but the access to off-chip memory is carried out. We do not consider the drawbacks with LDRC for the instruction caches and the data caches.

5 4.3 Evaluation index We adopt the percentage of the reduced execution time by LDRC as an evaluation index. The following Formula (1) represents the execution time (in clock cycles) by this method. C ldrc = C ideal + C rc + C norc (1) C ldrc : The execution time (in clock cycles). C ideal : The execution time (in clock cycles) which ignores the data cache miss penalty of the DL instructions. In this case, we assume that DL instructions always cause cache hits. C rc : The execution time (in clock cycles) of all the RC codes. C norc : The execution time (in clock cycles) of DL instructions which cannot be replaced with RC codes. We estimate C rc with Formula (2). ( dl rcc ) C rc = CP I I rccg (i, k) N rccg (i, k) i=0 k=0 (2) CP I: The mean execution clock cycles per instruction. I rccg (i, k): The number of instructions in the kth representative RC code, for the ith DL instruction. N rccg (i, k): The number of execution counts of the kth representative RC code, for the ith DL instruction. dl: The number of DL instructions. In this experiment, this number is fixed to 16. rcc: The number of RC code groups. Formula (3) represents the execution time of DL instructions which cannot be replaced with RC codes. C norc = (C org C ideal ) N norc N dl2miss (3) C org : Original execution time (in clock cycles). N dl2miss : Total number of L2 miss counts due to DL instructions. N norc : Total number of DL instructions which can not be replaced with RC codes. Execution time reduction (%) Ideal LDRC Figure 3. Percentage of reduced execution time Average re-computation time (Clock Cycles) Figure 4. The average execution time of RC codes Formula (4) indicates the percentage of the reduced execution time using Formula (1), (2), and (3). ( R ldrc = 1 C ) ldrc 100[%] (4) C org R ldrc : The percentage of execution time reduction method. To estimate the percentage of the reduced execution time by LDRC, we obtain C org C ideal, I rccg (i, k), N rccg (i, k), N norc, N dl2miss, and CP I from our experiment. 4.4 Experimental results and discussions We show the percentage of the reduced execution time by LDRC in Figure 3, the mean execution time of RC codes in Figure 4, and the percentage of DL instruction which can be replaced with RC codes in Figure 5. In Figure3, there are two bars corresponding to two cases. The first bar corresponds to the ideal case assuming all the DL instructions cause cache hits,

6 Percentages of replacable STOREs by RC codes (%) Figure 5. The percentage of DL instructions which can be replaced with RC codes and the second bar corresponds to the LDRC applied case. Figure 4 plots the average execution time of representative RC codes for each RC code group mentioned in Subsection 4.2. Figure 5 indicates the percentage of DL instructions which can be replaced with RC codes. According to Figure 3, one benchmark program 181.mcf can reduce execution time up to 45.3%. On the other hand, we cannot see the execution time reduction effect of LDRC on 177.mesa, 179.art, 176.gcc, and 255.vortex. We discuss the reasons of these results by means of average execution time of RC codes and the number of DL instructions which can be replaced with RC codes. First, according to Figure 3, we cannot see the large effect of execution time reduction on benchmark programs 177.mesa, 176.gcc and 255.vortex, because these programs seldom satisfy the condition 1 described in Subsection 3.3. Since the DL instructions have little influence on these programs, they cannot obtain large effect due to LDRC. Secondly, according to Figure 5, 179.art has a very small number of DL instructions which can be replaced with RC codes. Therefore, this program seldom satisfies the performance improving condition 2 described in Subsection 3.3. This kind of program has little opportunity to apply LDRC. Finally, we discuss about one more benchmark program, 188.ammp. According to Figure 3, this program indicates high potential to reduce execution time in the ideal case, but not in the case of LDRC. The reason is that, according to Figure 4, it takes a long time for this program to execute RC code. This program does not satisfy the condition 3 described in Subsection 3.3. From above discussions, if one of the performance improving conditions described in Subsection 3.3 is not satisfied, we cannot get large execution time reduction effect. One of the benchmark programs in this evaluation, 181.mcf, satisfies every condition, so it can obtain large execution time reduction effect. 5 Conclusion In this paper, we propose a load data re-computation technique called LDRC as a measure to memory wall problem. We apply LDRC to DL instructions, and evaluate the potential of execution time reduction effect. From the experimental results, one benchmark program 181.mcf can reduce execution time up to 45.3%. We confirmed the potential of execution time reduction. In this evaluation, we ignore several factors related to overheads, such as a change of cache miss rate in the data and the instruction cache. If we take more evaluations considering the available implementations, we will see the effectiveness of LDRC more precisely. References [1] M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proc of the 28th Intl. Symposium on Computer Architecture, [2] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. University of Wisconsin- Madison Computer Sciences Department Technical Report, June [3] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, G. Hoflener, D. Lavery, and J. P. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proc of the 28th Intl. Symposium on Computer Architecture, [4] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach 3rd Edition. Morgan Kaufmann Publishers, [5] J. L. Henning. Spec cpu2000: Measuring cpu performance in the new millennium. 33(7), July [6] M. Postiff, D. Greene, C. Lefurgy, D. Helder, and T. Mudge. The mirv simplescalar/pisa compiler. University of Michigan EECS Department Technical Report, April [7] A. Roth and G. Sohi. Speculative data-driven multithreading. In Proc of the 7th Intl. Symposium on High-Performance Computer Architecture, January 2001.

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Data Prefetching by Dependence Graph Precomputation

Data Prefetching by Dependence Graph Precomputation Data Prefetching by Dependence Graph Precomputation Murali Annavaram, Jignesh M. Patel, Edward S. Davidson Electrical Engineering and Computer Science Department The University of Michigan, Ann Arbor fannavara,

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture Resit Sendag, Ying Chen, and David J Lilja Department of Electrical and Computer Engineering Minnesota

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Increasing Instruction-Level Parallelism with Instruction Precomputation

Increasing Instruction-Level Parallelism with Instruction Precomputation Increasing Instruction-Level Parallelism with Instruction Precomputation Joshua J. Yi, Resit Sendag, and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

Power and Performance Tradeoffs using Various Cache Configurations

Power and Performance Tradeoffs using Various Cache Configurations Power and Performance Tradeoffs using Various Cache Configurations Gianluca Albera xy and R. Iris Bahar y x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 y Brown University

More information

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

Power and Performance Tradeoffs using Various Caching Strategies

Power and Performance Tradeoffs using Various Caching Strategies Power and Performance Tradeoffs using Various Caching Strategies y Brown University Division of Engineering Providence, RI 02912 R. Iris Bahar y Gianluca Albera xy Srilatha Manne z x Politecnico di Torino

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Cache Performance Research for Embedded Processors

Cache Performance Research for Embedded Processors Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 1322 1328 2012 International Conference on Solid State Devices and Materials Science Cache Performance Research for Embedded Processors

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB A Low Energy Set-Associative I-Cache with Extended BTB Koji Inoue, Vasily G. Moshnyaga Dept. of Elec. Eng. and Computer Science Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0180 JAPAN {inoue,

More information

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 1-1-2003 An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Wessam Hassanein José

More information

Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption

Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption Way-Predicting Cache and Pseudo- Associative Cache for High Performance and Low Energy Consumption Xiaomin Ding, Minglei Wang (Team 28) Electrical and Computer Engineering, University of Florida Gainesville,

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Cache Pipelining with Partial Operand Knowledge

Cache Pipelining with Partial Operand Knowledge Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison {egunadi,mikko}@ece.wisc.edu Abstract

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes The Predictability of Computations that Produce Unpredictable Outcomes Tor Aamodt Andreas Moshovos Paul Chow Department of Electrical and Computer Engineering University of Toronto {aamodt,moshovos,pc}@eecg.toronto.edu

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Data Access History Cache and Associated Data Prefetching Mechanisms

Data Access History Cache and Associated Data Prefetching Mechanisms Data Access History Cache and Associated Data Prefetching Mechanisms Yong Chen 1 chenyon1@iit.edu Surendra Byna 1 sbyna@iit.edu Xian-He Sun 1, 2 sun@iit.edu 1 Department of Computer Science, Illinois Institute

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Dynamic Cache Partitioning for CMP/SMT Systems

Dynamic Cache Partitioning for CMP/SMT Systems Dynamic Cache Partitioning for CMP/SMT Systems G. E. Suh (suh@mit.edu), L. Rudolph (rudolph@mit.edu) and S. Devadas (devadas@mit.edu) Massachusetts Institute of Technology Abstract. This paper proposes

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

Speculation Control for Simultaneous Multithreading

Speculation Control for Simultaneous Multithreading Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California dkang@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Energy Efficient Asymmetrically Ported Register Files

Energy Efficient Asymmetrically Ported Register Files Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University

More information

Designing and Optimizing the Fetch Unit for a RISC Core

Designing and Optimizing the Fetch Unit for a RISC Core Journal of Computer and Robotics (200) 3-25 3 Designing and Optimizing the Fetch Unit for a RISC Core Mojtaba Shojaei, Bahman Javadi *, Mohammad Kazem Akbari, Farnaz Irannejad Computer Engineering and

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy ing-aware Cache Line Turnoff for Saving Leakage Energy Ismail Kadayif Mahmut Kandemir Feihui Li Dept. of Computer Engineering Dept. of Computer Sci. & Eng. Dept. of Computer Sci. & Eng. Canakkale Onsekiz

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Targeted Data Prefetching

Targeted Data Prefetching Targeted Data Prefetching Weng-Fai Wong Department of Computer Science, and Singapore-MIT Alliance National University of Singapore 3 Science Drive 2, Singapore 117543 wongwf@comp.nus.edu.sg Abstract.

More information

The MIRV SimpleScalar/PISA Compiler

The MIRV SimpleScalar/PISA Compiler March 29, 2 12:56 pm 1 Abstract The MIRV SimpleScalar/PISA Compiler Matthew Postiff, David Greene, Charles Lefurgy, Dave Helder and Trevor Mudge {postiffm,greened,lefurgy,dhelder,tnm}@eecs.umich.edu EECS

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors

The Limits of Speculative Trace Reuse on Deeply Pipelined Processors The Limits of Speculative Trace Reuse on Deeply Pipelined Processors Maurício L. Pilla, Philippe O. A. Navaux Computer Science Institute UFRGS, Brazil fpilla,navauxg@inf.ufrgs.br Amarildo T. da Costa IME,

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar

A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar Abstract Waseem Ahmad, Enrico Ng {wahmad1, eng3}@uic.edu Department of Electrical and Computer

More information

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA

More information

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University zmily@stanford.edu, christos@ee.stanford.edu

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Improving Cloud Application Performance with Simulation-Guided CPU State Management

Improving Cloud Application Performance with Simulation-Guided CPU State Management Improving Cloud Application Performance with Simulation-Guided CPU State Management Mathias Gottschlag, Frank Bellosa April 23, 2017 KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT) - OPERATING SYSTEMS GROUP KIT

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely

More information

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor Ying Chen, Simon Y. Chen 2 School of Engineering San Francisco State University 600 Holloway Ave San Francisco, CA 9432

More information

Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture *

Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 1127-1142 (2008) Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture * Department of Computer Science National Chiao

More information

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srinivas Devadas

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

PMPM: Prediction by Combining Multiple Partial Matches

PMPM: Prediction by Combining Multiple Partial Matches 1 PMPM: Prediction by Combining Multiple Partial Matches Hongliang Gao Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {hgao, zhou}@cs.ucf.edu Abstract

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST HELPS IMPROVE PROCESSOR PERFORMANCE AND DECREASE POWER CONSUMPTION BY IDENTIFYING WHEN DESIGNERS CAN CHOOSE AMONG A SET OF OPTIMIZATIONS

More information