Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Size: px
Start display at page:

Download "Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research"

Transcription

1 Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness Lenni Kuff Computer Science Department University of Wisconsin, Madison, WI Abstract Present day computer architects use advanced microarchitecture simulators to test the performance of processor designs. The simulator workloads are generally benchmarks, which are representative of specific types of real world applications. Because microarchitecture implementations increase in complexity and the simulation workloads are required to represent complicated applications, the simulation time has greatly increased. To solve the problem, researchers are looking into ways to reduce the amount of time benchmarks run, while maintaining the same workload characterization of the longer benchmarks. MinneSPEC is a representative reduction of SPEC2000, with the reduced input sets found using SimpleScalar profiling tools [1]. With the release of SPEC CPU2006, new benchmarks have been added to the SPEC benchmarking suite which will be used to evaluate performance in tomorrow's microprocessors. These benchmarks are considerably larger than SPEC2000 and using SimpleScalar to profile their workloads would take a large amount of time and effort. This paper suggests a different reduction technique which gathers profiling information using processor performance counters accessed using PAPI. Since workloads are running on a native system instead of a simulator, profiling information can be gathered in a much shorter amount of time. This allows for fine grained tuning of reduced input sets so more representative reduced benchmarks can be found in a much shorter amount of time. Using this technique, we were able to reduce five SPEC2006 benchmarks to under 10 billion instructions yet still have over 95% Chi Squared confidence for instruction mix between the reduced input set and reference input set. While further work in this area needs to be done, the preliminary results of our research show that PAPI benchmark reduction may be a much more effective reduction technique than SimpleScalar alone. 1 Introduction Computer processor design has rapidly changed since the 1980s. Instead of basing microarchitecture design decisions on the advice of experts and rules of thumb, designers now run directed workloads on advanced simulation tools to make microarchitecture decisions [2]. Running benchmarks is an extremely important part of early design performance evaluations [3]. Because of the increasing complexity of the microarchitecture and the workloads executed on them, the simulation times have grown to be very long. Designers must now make the choice between running few long, more representative benchmarks or running more short, less representative benchmarks [1]. This work proposes and implements the use of performance counters available in modern microprocessors to collect information about the instruction mix, branch predictor, cache and translation lookaside buffer (TLB) performance, and instructions per cycle (IPC) for the reference input sets to the SPEC CPU2006 benchmark suite. This information is statistically compared to custom input sets to find 1

2 representative reduced input sets. Techniques of finding reduced input sets will be described in Methodology. Further, the use of performance counters on native hardware will greatly speed the data collection over the use of a simulator such as SimpleScalar [6] that was utilized for MinneSPEC. 2 Related Work SPEC CPU [4] is a popular benchmarking suite widely used in the computer architecture industry. When SPEC released CPU2000, the benchmark run times were greatly increased from the previous version. For example, the SPEC CPU2000 benchmark parser with the reference input is over 500 billion instructions, which could take 3 weeks to simulate at 100,000 instructions/sec. This was a problem for computer architects who wanted to run simulations of the SPEC2000 benchmarks on their designs. The magnitude of the SPEC2000 suite prompted research at the University of Minnesota to develop a set of benchmarks called MinneSPEC [1]. MinneSPEC is a reduced input set version of SPEC CPU2000 benchmark suite that maintains very similar program characteristics as the original SPEC2000 suite. With the release of SPEC CPU2006, new benchmarks have been added which designers will want to use when simulating future processor designs. These new benchmarks have effective reference set instruction counts that are almost an order of magnitude larger than SPEC2000 and it would be beneficial to extend the MinneSPEC research and reduce the input sets for the benchmarks in SPEC CPU2006. The paper [2] describes the use of Principle Component Analysis to cluster the benchmarks within a suite in order to further reduce the suite by finding reduced input sets that are representative of multiple benchmarks instead of just a single one. This type of analysis was beyond the scope of our project due to the limited time period that was available. 3 Design This work targets reduction of the SPEC CINT2006 benchmarks due to the time constraints on the project, as well as results from [1]. In contrast to the MinneSPEC project, we utilize an open source package called PAPI [5], from the Innovative Computing Laboratory at the University of Tennessee, that provides a library for access to performance counters of many modern microprocessors. This work used a Pentium 4 3.0GHz processor with 1GB memory for this data collection. By using native hardware to collect the data that we used, our runtimes were sped up by a factor of about 1000 over using SimpleScalar. In addition, we have run a number of the reference and reduced input set benchmarks on SimpleScalar for validation of our reduced input sets on the Alpha instruction set architecture. We utilized PAPI preset performance counters for the Pentium 4 as well as sim profile, and sim cache in SimpleScalar. 4 Methodology The data that we collected includes instruction mix, performance of cache, branch predictor and TLB, and the IPC for a run of the benchmark. The instruction mix is defined as percentages of the following instructions: Loads Stores Unconditional Branches Conditional Branches Integer Instructions Floating Point Instructions Miscellaneous (Vector, Trap, etc.) Data cache miss rate information was not supported in PAPI, so we relied on the results of sim cache in SimpleScalar in order to make comparisons. The branch predictor results came from PAPI counters and included branches predicted taken, mispredicted taken, predicted not taken, and mispredicted not taken. We also collected the data TLB misses. Finally, the IPC was collected using PAPI, and could, in future work, be collected using SimpleScalar. Using the afore mentioned collected data, we were able to reduce five of the CINT benchmarks using the following technique. We used instruction mix as the major factor in searching for a reduced input set, and the other factors were instructions per cycle, branch predictor, cache and TLB performance. When instruction mix was collected for the reference set and a trial reduced set, the mixes were compared using Chi Square Test, with the 2

3 Inst Mix Reduction Benchmark Type Description Reduction Method Match Factor (x86) Run Length ISA 401.bzip2 Int Compression Modified Command Yes 426x 6.19 / 7.52 billion x86/alpha line, new input file 456.hmmer Int Search Gene Modified Command Yes 735x 4.43 billion x86 Sequence line 462.libquantum Int Quantum Modified Command Yes 3854x 1.00 / 1.01 billion x86 / Alpha Computing line 429.mcf Int Combinatorial Same as test Yes 95x 4.73 / 4.88 billion x86 /Alpha Optimization Figure 1: SPEC CINT2006 Benchmark Reduction Results. reference as the expected value, to find our confidence in their dependence. Our target for Chi Square confidence was 90% per the MinneSPEC results. Further, we wished to reduce the instruction count to between 1 and 10 billion instructions, a slight increase over the target size for MinneSPEC due to Moore's law and a large increase in size between the SPEC2000 benchmarks and the SPEC2006 suite. Finally, a qualitative analysis of cache, branch predictor and TLB performance, and IPC is used to fully compare reference to reduced benchmarks. The results and discussion explain how these parts fit together as well as what would be required to complete this work. To find a reduced input set, we employed an iterative process. First, a benchmark was selected and run to collect data for the reference, test and train data input sets provided by SPEC. The test and train sets were often very useful for getting a starting point in the iterative process. The iterative part of reduction is made up of two possible distinct paths: (1) change the command line input to the benchmark, or (2) change the input files. For instance, the benchmark 462.libquantum, described later, only used command line input, while 401.bzip2 compresses data files, so to reduce its input, we tried swapping out or changing the contents of its input files. Finally, we checked the data collected, and if the candidate input set met our requirements, we ran that set on SimpleScalar. Otherwise, we iterated the process again. The speed of native hardware allowed us to make quick changes to the trial input and keep momentum, whereas using SimpleScalar, we would have to wait minutes for candidate reduced input data to be returned. 5 Results Figure 1 shows the results of our reduction process for the benchmarks 401.bzip2, 429.mcf, 456.hmmer, 458.sjeng, and 462.libquantum. The run lengths range between 1 and 7.5 billion instructions, which, depending on the benchmark equates to reductions from the reference size of two to four orders of magnitude. The Alpha run lengths are supplied where we had sufficient time to collect them using SimpleScalar. The three reduced sets that ran to completion on SimpleScalar met the Chi Square confidence for the instruction mix. The distribution of instructions within a class between the x86 and the Alpha architectures were very different. For instance, the percentage of integer instructions on the x86 was much smaller than the percentage of integer instructions on the Alpha for most benchmarks. This is could possibly be due to the embedded integer operations done within a single x86 instruction like some loop instructions. This is also reflected in the comparison between total number of instructions per benchmark for each benchmark. Figure 2 and Figure 3 present example data for 401.bzip2 pertaining to the TLB miss rate per instruction and IPC values. Since the IPC was one of our factors in deciding how well a candidate reduced input set matched the reference input set, observations about correlations within data guided our reduction process. Figure 4 presents data cache miss rate for the 429.mcf benchmark over multiple cache sizes as collected from SimpleScalar. More information on these data components is included in the discussion. 3

4 6 Discussion Candid 1 Reduced Candid 2 Train Ref TLB Misses Per Instruction: IPC: Figure 2: The comparison of TLB misses per instruction and IPC for 401.bzip2 for two candidate input sets, the reference, train, and reduced input sets. IPC IPC vs TLB Misses Per Instruction TLB Misses Per Instruction: Figure 3: A comparison of TLB misses per instruction and IPC for the 401.bzip2 benchmark. Notice that as the TLB miss rate increases, the IPC increases. We can see from the bzip2 results in Figure 2 and Figure 3 that the TLB misses per instruction for the reduced benchmarks was high. We attributed this the overhead of filling the TLB at the beginning of the run, so that shorter runs would have a higher number of TLB misses per instruction compared to longer runs that miss infrequently as the run goes on. This increase in TLB miss rate as the size of the input set decreased is reflected in the IPC difference from reference to reduced input set. The consideration of file size, type and the length of the benchmark run were used as a guide to finding a reduced input set that maintained the instruction mix and count characteristics desired while avoiding a large IPC difference. Also, the quick access to the TLB and IPC data that is available with PAPI is beneficial for guiding the reduction process. The data cache miss rate collected in SimpleScalar must be analyzed qualitatively in comparing the reference set to the reduced input set. As can be seen in Figure 4, as the cache size increases, the miss rate decreases logarithmically. However, when comparing the reference data and the reduced data, we see that the reduced input set is more sensitive to a change in cache size. This is mainly due to the memory footprint difference between Figure 4: Data cache miss rate for the Test/Reduced and Reference 429.MCF input files. The smaller reduced set is more sensitive to cache size changes. 4

5 the reference input and the reduced input. The reduced input set has a smaller memory footprint, so it responds better to more cache. In addition, the poor cache miss rates for the reduced input set with small cache can probably be attributed to worse data access locality when compared to the reference input. The other two benchmarks for which this data was collected showed similar cache sensitivity from reference to reduced input sets. This work needs to be extended to include a function profile using a tool like gprof or oprofile. Other research in benchmark reduction uses a function profile to compare reference and reduced input sets. The function profile is the list of C like functions that are called during execution of the program, and the percentage of time was spent in each. This profile gives a higher level comparison of how a program executes between reference and reduced input sets. Relative profile statistics should be less architecture dependent than the instruction mix characteristics that we collected. This is due to the fact that a function will be executed on any architecture given the same input, while particular instructions are not necessarily included in every ISA. We have found that PAPI cannot be used concurrently with gprof, so with time constraints, we were not able to do function profiles of the benchmarks that we worked on. We did run across an anomaly using PAPI to collect branch information. During a test run using PAPI, we found that the conditional branch count exceeded the total branch count. We tried multiple means of counting the same information, but the results remained the same. This bug needs to be looked into, but on the P4, the results did not sway the Chi Square Test results enough for us to spend more time evaluating the bug. As a brief aside, a couple of the benchmarks that we attempted to reduce continued to show a Chi Square confidence of more than 95% as we continued to reduce their size to what could be considered microbenchmark size (e.g. 462.libquantum was reduced to fewer than 150 million instructions). 7 Conclusions In the past, reducing benchmark input sets has been a long and tedious task. Every time an input set change was made the person reducing the benchmark would have to wait for the simulation profile results to return, which could take over an hour. To allow for faster and more flexible benchmark reduction, we have used PAPI to access the processor performance counters to gather the instruction mix and other workload characteristics, such as TLB miss rate. Because the workload is running on a native system, the profile results are returned in minutes instead of hours or days. By using PAPI for workload characterization, and using an interactive benchmark reduction process, we have successfully reduced five of the SPEC CPU2006 integer benchmarks. Because PAPI returned profile results in such a short amount of time, we were able to make fine grained changes to our reduced input set to hone in on a representative reduced input set for the benchmark. Our initial results from using this reduction strategy technique seem to be very promising. We see no problems for continuing on and reducing the remaining SPEC CPU2006 Integer and Floating Point benchmarks in this manner. In fact, this technique could be applied to reducing other benchmarking suites as well. 8 References [1] A. J. KleinOsowski and D. J. Lilja, "MinneSPEC: A New SPEC Benchmark Workload for Simulation Based Computer Architecture Research," Computer Architecture Letters, IEEE, vol. 1, pp. 7 7, [2] L. Eeckhout, H. Vandierendonck and K. De Bosschere, "Designing Computer Architecture Research Workloads," Computer, vol. 36, pp , [3] P. Bose and T. M. Conte, "Performance Analysis and its Impact on Design," Computer, vol. 31, pp , [4] Standard Performance Evaluation Corporation. SPEC benchmark suites. Various benchmark suites available. Details available at [5] Performance Application Programming Interface (PAPI). Innovative Computing Library. Available at [6] Todd Austin, Eric Larson, and Dan Ernst. 5

6 SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, pages 59 67, February [7] David J. Lilja. Measuring Computer Performance: A Practitioners Guide. Cambridge University Press, New York, NY, [8] W.C. Hsu et al., "On the Predictability of Program Behavior Using Different Input Data Sets," Proc. 6th Workshop on Interaction between Compilers and Computer Architectures (INTERACT 02), held in conjunction with the 8th Int'l Symp. High Performance Computer Architecture (HPCA 8), IEEE CS Press, 2002, pp

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Ajay Joshi 1, Lieven Eeckhout 2, Robert H. Bell Jr. 3, and Lizy John 1 1 - Department of Electrical and Computer

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions 1 Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions Lieven Eeckhout John Cavazos ELIS Department, Ghent University, Belgium School of Informatics, University of Edinburgh,

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

THE SPEC CPU2000 benchmark suite [12] is a commonly

THE SPEC CPU2000 benchmark suite [12] is a commonly IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007 1549 Speed versus Accuracy Trade-Offs in Microarchitectural Simulations Joshua J. Yi, Member, IEEE, Resit Sendag, Member, IEEE, David J. Lilja,

More information

DESIGNING a microprocessor is extremely time consuming

DESIGNING a microprocessor is extremely time consuming IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 1, JANUARY 2008 41 Memory Data Flow Modeling in Statistical Simulation for the Efficient Exploration of Microprocessor Design Spaces Davy Genbrugge and Lieven

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Hardware Loop Buffering

Hardware Loop Buffering Hardware Loop Buffering Scott DiPasquale, Khaled Elmeleegy, C.J. Ganier, Erik Swanson Abstract Several classes of applications can be characterized by repetition of certain behaviors or the regular distribution

More information

WORKLOAD REDUCTION AND GENERATION TECHNIQUES

WORKLOAD REDUCTION AND GENERATION TECHNIQUES ... WORKLOAD REDUCTION AND GENERATION TECHNIQUES... BENCHMARKING IS A FUNDAMENTAL ASPECT OF COMPUTER SYSTEM DESIGN. RECENTLY PROPOSED WORKLOAD REDUCTION AND GENERATION TECHNIQUES INCLUDE INPUT REDUCTION,

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science ! CPI = (1-branch%) * non-branch CPI + branch% *

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW

More information

ARCHITECTS use cycle-accurate simulators to accurately

ARCHITECTS use cycle-accurate simulators to accurately IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 10, OCTOBER 2011 1445 An Empirical Architecture-Centric Approach to Microarchitectural Design Space Exploration Christophe Dubach, Timothy M. Jones, and Michael

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Visualizing Page Replacement Techniques based on Page Frequency and Memory Access Pattern

Visualizing Page Replacement Techniques based on Page Frequency and Memory Access Pattern Visualizing Page Replacement Techniques based on Page Frequency and Memory Access Pattern Ruchin Gupta, Narendra Teotia Information Technology, Ajay Kumar Garg Engineering College Ghaziabad, India Abstract

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary. ECE C61 Computer Architecture Lecture 2 performance Prof Alok N Choudhary choudhar@ecenorthwesternedu 2-1 Today s s Lecture Performance Concepts Response Time Throughput Performance Evaluation Benchmarks

More information

NAME: Problem Points Score. 7 (bonus) 15. Total

NAME: Problem Points Score. 7 (bonus) 15. Total Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 NAME: Problem Points Score 1 40

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Accelerated Application Development

Accelerated Application Development WHO: Wargaming.net CHALLENGES: Storage system performance insufficient to enable continuous integration and deployment (CI/CD) IT ENVIROMENT: 100 percent VMware virtualized, 500 VMs, HP clusters, Windows,

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

DATA CACHE PREFETCHING USING

DATA CACHE PREFETCHING USING DATA AHE PREFETHING USING A GLOBAL HISTORY BUFFER BY ORGANIZING DATA AHE PREFETH INFORMATION IN A NEW WAY, A GHB SUPPORTS EXISTING PREFETH ALGORITHMS MORE EFFETIVELY THAN ONVENTIONAL PREFETH TABLES. IT

More information

Quiz for Chapter 1 Computer Abstractions and Technology

Quiz for Chapter 1 Computer Abstractions and Technology Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So Computer Architecture ELEC344 Computer Performance How do you measure performance of a computer? 2 nd Semester, 208-9 Dr. Hayden Kwok-Hay So How do you make a computer fast? Department of Electrical and

More information

Increasing Instruction-Level Parallelism with Instruction Precomputation

Increasing Instruction-Level Parallelism with Instruction Precomputation Increasing Instruction-Level Parallelism with Instruction Precomputation Joshua J. Yi, Resit Sendag, and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set 2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering University of Texas

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE 571 Advanced Microprocessor-Based Design Lecture 13 ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache

More information

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Lecture 2: Computer Performance Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Performance and Cost Purchasing perspective given a collection of machines, which has the - best performance?

More information

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST HELPS IMPROVE PROCESSOR PERFORMANCE AND DECREASE POWER CONSUMPTION BY IDENTIFYING WHEN DESIGNERS CAN CHOOSE AMONG A SET OF OPTIMIZATIONS

More information

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology

CSAIL. Computer Science and Artificial Intelligence Laboratory. Massachusetts Institute of Technology CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Dynamic Cache Partitioning for Simultaneous Multithreading Systems Ed Suh, Larry Rudolph, Srini Devadas

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

Performance Evaluation and Acceleration for XML Data Parsing

Performance Evaluation and Acceleration for XML Data Parsing Performance Evaluation and Acceleration for XML Data Parsing Li Zhao System Technology Lab Intel Corporation, Hillsboro, OR 97124 li.zhao@intel.com Laxmi Bhuyan Department of Computer Science University

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

New Challenges in Microarchitecture and Compiler Design

New Challenges in Microarchitecture and Compiler Design New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Mestrado em Informática

Mestrado em Informática Sistemas de Computação e Desempenho Arquitecturas Paralelas Mestrado em Informática 2010/11 A.J.Proença Tema Arquitecturas Paralelas (1) Estrutura do tema AP 1. A evolução das arquitecturas pelo paralelismo

More information

MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E.

MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E. MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR BY AMOL SHAMKANT PANDIT, B.E. A thesis submitted to the Graduate School in partial fulfillment of the requirements

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Using the SimpleScalar Tool Set at UT-CS

Using the SimpleScalar Tool Set at UT-CS Handout #3 Using the SimpleScalar Tool Set at UT-CS Prof. Steve Keckler skeckler@cs.utexas.edu Version 1.0: 1/25/99 1 Introduction The SimpleScalar Tool Set performs fast, flexible, and accurate simulations

More information

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented

More information

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures by Ernily Blern, laikrishnan Menon, and Karthikeyan Sankaralingarn Danilo Dominguez Perez danilo0@iastate.edu

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA

Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA Zhongda Yuan 1, Jinian Bian 1, Qiang Wu 2, Oskar Mencer 2 1 Dept. of Computer Science and Technology, Tsinghua Univ., Beijing 100084,

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information