A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar

Size: px
Start display at page:

Download "A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar"

Transcription

1 A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar Abstract Waseem Ahmad, Enrico Ng {wahmad1, eng3}@uic.edu Department of Electrical and Computer Engineering University of Illinois, Chicago This report is being submitted as a requirement for the term project for ECE 466 Course Instructor Prof. Gyungho Lee Wide Issue super scalar processors are very complex machines. Simplescalar toolset makes the job of selected parameter simulation for such processors a lot easier. The use of critical quantitative analysis based upon the SimpleScalar simulations augmented with a qualitative analysis for a rational cost/performance model is necessary to select optimal parameter values for the processor aimed at specific target environment. We present one such exercise where we make qualitatively analyzed cost/performance model and apply it to quantitative results obtained by SimpleScalar simulations and come up with a processor model based upon restricted set of parameters which gives us a reasonable performance on reduced cost. The target environment for such processors may range from desktop to embedded systems. 1 Introduction SimpleScalar is the most popular simulation toolset for performance studies of super scalar microprocessors. We make use of this toolset to develop a basic framework for optimal parameter selection for a super scalar processor. Optimal here refers to a rational trade-off between cost and performance. The procedure adopted is as follows; we first run simoutorder (from SimpleScalar toolset), details of which can be found in next section, for two SPEC2000 integer programmes (a compiler programme) and (a database application) for varying configurations of select parameters, with fixed I-Cache of 64KB, D- Cache of 16 KB and assumption of a perfect branch prediction. The parameter values, that are being selected to experiment with, are Instruction issue width (and the number of function units), Register Update Unit (RUU) size, and D-cache associativity. For an in-depth focus on the quantitative and qualitative aspects of performance at the expense of more hardware we propose a cost model. The proposed cost model enables us to develop a rational trade-off between cost and performance and lets us pick one configuration, which performs the best for this cost model, as the candidate for the next set of experiments. Next, for this candidate configuration, we experiment with varying D-cache sizes and Branch Target Buffers (BTB) sizes for branch prediction. We now use more practical bimodal predictor with different configurations of increasing amount of entries. Assuming cycle time will increase 10% by doubling the size and D-cache (up to 128KB); we try to decide the D-cache

2 size along with the BTB size that provides the best cost/performance ratio. The rest of the paper is organized as follows, we first give a brief introduction to SimpleScalar toolset and variety of tools that it provides, after that we give a compilation of first set of experiments and conclusions drawn from them based upon proposed cost model which is also presented in the same section. After that we take up the second set of experiments and then give a conclusive parameter selection based upon the proposed cost model. We conclude by suggesting ways to improve performance of the selected processor configuration. 2 SimpleScalar Background The SimpleScalar Toolset is an architectural research tool set including compiler, assembler, linker, simulation, and visualization tools for the Simplescalar architecture [4]. With this tool set, the user can simulate real programs using fast execution-driven simulation. It is a flexible and accurate cycle resolution simulator that implements a close derivative of the MIPS-IV Instruction Set Architecture (ISA). More precisely, the Simplescalar instruction set (also called the PISA, or "Portable Instruction Set Architecture", instruction set) is a superset of MIPS with a few minor differences and additions. It is capable of simulating binary programs on one of the several processor simulators provided. These specialized simulators fall into four main categories: Functional simulation The fastest and most elementary simulator is sim-fast, which performs only functional simulation using in-order execution of the instructions (i.e. they are executed in the order they appear in the program). This simulator is optimized for raw speed and does not take into account instruction checking or existence of cache memory. A separate version of sim-fast, called sim-safe, also performs functional simulation, but checks for correct alignment and access permissions for each memory reference. Profiling Simplescalar includes a functional simulator that can be used for obtaining profiling information. This simulator, sim-profile, can provide detailed profiles on instruction classes and addresses, text symbols, memory accesses, branches, and data segment symbols. Cache simulation The tool set provides two functional cache simulators: sim-cache and sim-cheetah. These simulators are ideal for fast simulation of architectures that include cache memory, provided that the cache access time is not relevant with regard to execution performance. These tools are useful for evaluating a variety of cache organizations. Out-of-order processor timing simulation The most complicated and detailed simulator included in the tool set is sim-outorder. This is the one that we use for our study on optimal parameter selection for a super scalar processor. This simulator performs out-of-order execution of the instructions, based on the Register Update Unit (a scheme that uses a reorder buffer to automatically rename registers and hold the results of pending instructions). Out of Order Execution (OOO) distinguishes a

3 superscalar architecture from others. This is where the name Simplescalar came from - it is a simple superscalar architecture. In the sections to come, we will make use of sim-outorder to run simulation tests for optimal parameter selection based upon the simulation results for perfect and practical branch prediction schemes. 3 Parameter Selection for perfect branch prediction As mentioned earlier, in this set of experiments, we define a set of configurations based upon three parameters namely Instruction issue width (and the number of function units), RUU size and D-cache associativity. Meanwhile fixed parameters include I-Cache of 64 KB D-Cache of 16KB Perfect branch prediction. Following assumptions are made regarding increase in clock cycle time on increasing values of certain parameters. 10% by doubling the issue rate 10% by doubling the RUU size 2% by doubling the D-cache associativity 2% by doubling the number of function units (including load/store units) Forty seven simulations were executed, first testing each setting separately, then testing combinations. The settings include following configurations as given in table 1. Base Case Range/Value Configuration Value Branch Prediction Perfect Bpred Perfect I Cache Size cache:in1 il1:2048:32:1:l Issue Width 4,16,32,64 issue:width 4 Fetch Queue Size 4,16,32,64 fetch:ifqsize 4 Decode Width 4,16,32,64 decode:width 4 Commit Width 4,16,32,64 commit:width 4 RUU Size 16,32,64,128,256,512,1024 Ruu:size 16 Dcache Associativity 1,2,4,8,16,32,64 cache:dl1 Dl1:512:32:4:l Integer ALUs 4,8,16 Res:ialu 4 Integer ALUs 1,2 Res:imult 1 FP ALUs 4,8,16 Res:fpalu 4 FP ALUs 1,2 Res:fpmult 1 Table1. Range of values for different parameters under first set of experiments. Note that branch prediction is assumed to be perfect. After the simulations were run, we pruned the results to select four configurations which had best execution time on the two benchmarks. We observed following interesting patterns in the simulated configurations for the ranges and parameters shown in Table1.

4 % 9.00% Increase 8.00% 7.00% 6.00% 5.00% 4.00% 3.00% Transistors Die Area RUU size 2.00% 1.00% 0.00% RUU Size Figure1. (a) Graph of Percentage Time RUU is full versus RUU size. Clearly for the given benchmarks there is no need to go beyond RUU size of more than 16 as it won t be utilizing its capacity anyway. As expected the cost scales linearly by increasing the RUU Size. The curve seemingly going down below 0 in left side graph is because of interpolation and not based on actual values I ssue Wi dt h Issue Width Execution Time Figure2. (a) Graph of Percentage Time Fetch Queue Full versus Issue width. Graph clearly shows that there is no use going beyond issue width of 4 to 8 as most of the time IFQ will not be full anyways and the increased issue width won t get served properly. Graph (b) clearly shows that being a Data base application has more load stores and keeps the LSQ full most of the time as compared to but in any case this LSQ utilization decreases with increasing issue width. Note here that LSQ utilization peeks at issue-width =8, this is because for this set of experiments we were using LSQ size of % % % % % 50.00% Transistors 0.00% Issue Width Issue Width Figure3 (a) Graph shows that increasing the issue-width increases the CPU execution time almost linearly. Although CPI improves with increasing issue-width to certain extent but the additional clock cycle penalty dominates the improvement to effectively reduce performance with increasing issue width. Moreover, perfect branch prediction assumption implies that control hazards don t reduce the performance, but since FU s are still set to default values, structural hazards are playing their role in decreasing performance with increasing issue-width. (We ll see the effect of FUs with increased issue-width soon). Die Area

5 Execution Time Associativity Figure4. Figure shows Graphs of D-Cache Associativity versus Execution Time. It gives us a very insightful result that associativity of 1 gives the best results. This corresponds to direct mapped cache and it has the lowest cost as well. So this will be part of set of ultimate configurations when we take up cost/performance model in detail % % Miss Rate Associativity % 50.00% 0.00% Transistors Die Area Associativity Figure5. Graphs of D-Cache Associativity against miss rate and cost (in terms of die area and transistors). It shows us clearly, why increased associativity does not give better cost/performance results. Although increasing associativity increases performance by decreasing miss rate (as there are few conflict misses) but the cost paid to achieve this performance does not scale well with the performance. So we can predict that optimal parameter selection will involve lower values for associativity. Moreover, this also illustrates that although miss rate may be reduced but the CPU execution time eventually increases which is the true measure of performance Execution Time mul - 2 mul - 1 mul - 2 mul Integer ALU Units Figure6. Graph of Integer units against CPU execution time. It simply illustrates that increasing functional units does not itself enhance performance as long as issue-width and other parameters are same. Since the benchmarks are integer, FP FU s don t impact performance. These illustrations also fully explain the logic behind our elimination criteria for configurations under critical analysis. The configurations finally selected for cost/performance model evaluation and explanation are given in Table 2.

6 Config bpred Issue:width Ruu:size Cache:DL1 Res:ialu Res:imult Res:fpalu Res:fpmult TotalSize= 16KB Base perfect 4 16 dl1:512:32:4:l A perfect 4 16 dl1:512:32:2:l B perfect 4 16 dl1:512:32:1:l C perfect 4 32 dl1:512:32:4:l D perfect 8 16 dl1:512:32:4:l E perfect 4 16 dl1:512:32:4:l Table2. The final list of candidate configurations for cost/performance model evaluation and explanation. For all configurations in Table 2, Instruction Fetch queue size, decode-width and commit widths have the same values as their issue-widths. 3.1 Cost Model We assume that all the configurations are based upon the same VLSI fabrication technology, the difference in cost is driven by the transistor count which in turn is based upon number of functional units, complexity of instruction decode and issue logic, RUU size and L1 cache size and its associativity among lots of other parameters. Moreover, the cost also depends upon the die area, as for the same technology wafer cost will be same and the die area will affect number of dies/wafer thus effecting individual die cost. Moreover, die yield also is dependent upon die area. Our target is to set performance in comparison with the number of transistors and the chipspace (die area) required to implement the simulated features. For this purpose we used the tool in [1] that estimates transistor count and chip space requirements of the simulated features. The tool is based on Microsoft Excel. To estimate the chip space and the amount of transistors, [1] uses an analytical method for memory-based structures like register files or internal queues and an empirical method for logic blocks like control logic and functional units. For the analytical method it calculates the amount of bit cells, which are needed for the memory-based structures and the number of ports to access them. Based on this information, number of transistors is calculated, assuming four transistors to implement a basic bit cell, two transistors per write port and one transistor per read port. To calculate the chip-space of a memory based structure, the area of a basic cell is estimated. The basic area of a bit cell is increased in height and width by the number of the ports. The information of a basic bit cell space and the number of the ports is used then to calculate its area, from that the whole chip-space can be estimated. To be independent of chip technology, the half-feature size λ as measure of length is applied, e.g., 1 mm 2 in 0.5 micron technology equals 16 million λ 2. For the non memory-based parts of the processor, the floor plans of existing processors are measured. With this empirical approach, [1] estimates the sizes of the basic logic blocks of the processors. Using this data, it calculates the necessary chip space of the logic blocks. To estimate the transistor amount of hypothetical processors with this tool, the average transistor density of non memory-based structures of real processors is calculated by measuring floor plans (SPARC64 and HP PA-8000) and by additional information about the transistor amount of the measured logic blocks.

7 Transistor Count Calculation The results of the estimator for transistor count for all configurations are given in table 3. Analytical estimated Transistors Transistors Transistors Transistors Transistors Transistors Base A B C D E Registers 51,200 51,200 51,200 51,200 51,200 59,392 LSQ 20,448 20,448 20,448 20,736 20,448 24,992 RUU 124, , , , , ,760 Level 1 d-cache 5,713,920 2,846,720 1,418,240 5,713,920 5,713,920 5,713,920 Empirical estimated Issue (scheduler) 97,937 97,937 97,937 97, ,874 97,937 Write back unit 146, , , , , ,874 Integer units 186, , , , , ,104 Floating point units 244, , , , , ,842 Load store units 183, , , , , ,632 Summary Total without caches 1,475,411 1,475,411 1,475,411 1,599,699 1,583,268 1,752,927 Total with l1 caches 11,131,731 8,264,531 6,836,051 11,256,019 11,239,588 11,409,247 Total with all caches 48,245,587 45,378,387 43,949,907 48,369,875 48,353,444 48,523,103 Table3. The transistor count of candidate configurations Total values of the transistor count will be used when we are doing final calculations. Note here that transistor count scales linearly with the increasing complexity of the data path. Die Area Calculation The results for the die area calculation for all four configurations are given in table 4. M λ 2 M λ 2 M λ 2 M λ 2 M λ 2 M λ 2 Base A B C D E Analytical estimated Registers LSQ RUU Level 1 d-cache 2, , , , , Empirical estimated Write back unit Integer units Floating point units Load strore units Summary Total without caches 2, , , , , , Total with l1 caches 6, , , , , , Total with all caches 21, , , , , , Table4. The Die Area Estimation for different configurations The graphic summary of Table 3 Table 4 is given in Figure below.

8 Base A without Cache with L1 Cache with All Cache Base A without Cache with L1 Cache with All Cache B B C C D D E E Transistors Figure7 (a) Transistor counts for different configurations. Note that the values in each category of the transistor count in every configuration other than the base is normalized to its corresponding base value to ease comparison based upon the base value of all three categories of transistor count. (b) Chip space estimation for different configurations. Same argument holds here regarding normalization of individual values in each category as in (a). Cost can now be estimated easily by using total transistor count and total die area as the basic criterion of evaluation. But since we are interested in comparing cost/performance ratio, let s develop an idea of our performance model first. 3.2 Performance Model The measure used for performance is the CPU time given by CPU execution time = IC * (CPI + misses/inst(l1)*hit_time(l2)+misses/inst(l2)* miss_panalty(l2)) * Clock Cycle Time The default values for hit and miss latencies in SimpleScalar are used in our calculations. Misses/instruction are calculated by simulation result. The reason why we are not choosing CPI as our performance measure is clear from the graphs in Figure8. Although Configuration C has a better CPI but due to the increased cycle time because of increased size of RUU, its CPU Execution time is worse than others. Base A Base A votex B B C C D D E E (a) CPI Execution time Figure8 (a) CPI values for all configurations (b) Execution Time values for all configurations. Note that these values are normalized to the base values. (b)

9 Performance is inversely proportional to CPU Time, higher the CPU Time, lower the performance and vice versa. Since we are using two benchmarks that are and with different total number of instructions, we will follow following steps to obtain Geometric Mean, as suggested by [3], based upon results of both the benchmarks. For each benchmark i, look up T base,i For each benchmark i, run target machine to get T new,i Compute the geometric mean GM = { 1 Π 2 (T new,i / T base,i ) } 1/2. The least is the best. Base A B C D E Geometric Mean Figure10. Geometric Mean values for different configurations normalized to the base. Compute the (Cost machine / Cost base ) * GM. Where Cost machine / Cost base = TotalTransistorCount machine/ TotalTransistorCount base * Die Area machine /Die Area base Here T base,i refers to the CPU time of the base configuration for the ith benchmark and T new,i is the execution time for the configuration under test for that benchmark. Similarly, Cost base refers to the cost of the base configuration and the Cost machine refers to the cost of the machine model represented by the configuration under test. This leads us to our final goal of this set of experiments that is to select the best configuration based upon cost/performance ratio. The cost/performance ratios for the select configurations are plotted in graph of figure below. E D configuration C B A cost/performance ratio Base cost/perfromance Figure11. Cost/performance ratios for the select configurations As is clear from the graph, the machine configuration B gives us the best cost/performance ratio. This configuration represents direct mapped D-cache (see other parameter values in table2), though it has performance disadvantages over A and Base configurations. But the savings in cost are much more significant than this performance disadvantage eventually making it our configuration of choice for next set of experiments. As a word of warning,

10 please note here that sim-outorder is not specifically meant for cache simulation so the performance comparison done here may not lead us to believe that this configuration will work the same for all practical machine designs. 4 Optimal D-Cache and BTB size for Bimodal Branch Predictor for Configuration B After choosing the optimal configuration i-e configuration B in previous section, sixteen additional simulations were executed for this configuration using every combination of the settings below: Base Case Range Configuration Value Max Instructions max:inst Fast Forward Fastfwd Branch Prediction Bpred Bimod BiModal Table Size 512,1024,2048,4096 bpred:btb 512,4 BTB Size 512,1024,2048,4096 bpred:bimod 512 Dcache Size 512,1024,2048,4096 cache:dn1 dl1:512:32:4:l Table5. Range of values to chose from for second set of experiments. Again a simple analytical and quantitative analysis helps us prune these configurations to a few of them based upon the results of simulations run for all of them. This analysis is illustrated in figures below % 20.00% Transistors Die Area % % % Cache Size 0.00% Cache Size Figure12 (a) Graph of miss rate versus cache size for and vertex. (b) Graph of Transistor count and chip space (die area) against cache size. It is clear that we have to make a rational trade-off in order to achieve performance at reasonable cost.

11 % 0.40% % 0.30% 0.25% % 0.15% 0.10% Transist ors Die Area Table Size 0.05% 0.00% Table Size Figure13. Graph of Branch Predictor miss rate versus branch look up table size. Again the miss rate decreases with increasing Table size but we have to pay more cost for that as shown in the side graph Execution Time Cache Size Execution Time BTB Size Figure14. Graphs of D-Cache Size and BTB size against CPU Execution Time. The values of Execution time do not include the value of Number of Instructions as they are the same for all cases. Although CPI decreases by increasing the cache-size but the additional cycle time penalty effectively increases the CPU time thus reducing performance. Execution Time RUU Size Figure15. Graph of RUU size against CPU execution time. Execution time increases by increasing RUU size although CPI does improve, but again cycle time penalty because of increase in RUU size effectively reduces the performance.

12 The selected configurations are given in Table below. Config D-Cache Size D-Cache Config Bimodal Predictor Config Base 16KB Dl1:512:32:1:l 512 BA 32KB Dl1:1024:32:1:l 512 BB 16KB Dl1:512:32:1:l 1024 BC 64KB Dl1:2048:32:1:l 1024 BD 128KB Dl1:4096:32:1:l 1024 Table6. The selected configurations for further analysis Let us now apply our cost/performance model for these configurations to select the best of them in terms of cost/performance ratio. For that we follow the same steps as we did in previous section. Again it is not necessary that the configuration which gives the best CPI value may be the best in terms of Execution time as well, as this decrease in CPI is cancelled out by increase in Cycle time in relevant cases. Note here that the execution time values in graph are normalized with respect to Base configuration s value. The graphs clearly show that in terms of Execution Time, or performance for that matter, base configuration has the best results. Moreover, as outlined in our performance model, we have to get the unified value of the execution time by taking the Geometric Mean of the individual values. Results of this process are shown in graphs below. BASE BA BASE BA BC BC BB BB BD BD CPI Execution Time Figure16. Left hand graph shows CPI values of select configurations. Right hand graph shows Normalized CPU time (with respect to base) the said configurations.

13 Base BA BC BB BD Geometric Mean Figure17. Graph of GM values of the select configurations. Now let us apply the cost model to these configurations. The results are summarized in Graph below. Please not here that the Graph shows total chip space and Transistor count estimates for all configurations normalized to the corresponding value of the base configuration. Base BA Chip Space Transistor Count BC BB BD Transistor & Die Area Count Figure18. Graph for normalized Values (with respect to base configuration) of total transistor count and chip space (die area) for the select configurations. Again the base configuration has the best cost values. Now we can easily predict that the base configuration will have the best cost/performance ratio which is verified by the following graph. Base BA BC BB BD Cost/Performance Ratio Figure19. Graph showing cost/performance ratio for the configurations under test. Base Configuration has the best value.

14 5 Suggested Improvements In our design so far we have selected a direct mapped cache of a relatively small size. For direct caches, miss rate is one area of big concern as it allows a large number of conflict misses. Now there can be two ways to overcome this problem. One is to attempt to reduce the miss rate by using larger block size and larger cache size while the other way is to reduce the cache miss penalty. The cache performance formula is given as [3] Average Memory Access Time = Hit Time + Miss Rate*Miss Penalty This can easily be seen from the above cache performance formula that improvements in miss penalty can be just as beneficial as improvement in miss rate. One approach to lower miss penalty is to use Victim Cache [2] which has significant implications in reducing miss penalty for small direct mapped cache. A victim cache is a small, fully associative cache between a cache and its refill path (L2 Cache in our simulated environment). The victim cache stores those blocks that are discarded from the cache upon a miss thus being victim of address conflict. In a scenario with a victim cache, on each miss, first the victim cache is checked to see if it has the desired data before going to the next level of memory. If it is found there, the victim block and the cache block are swapped. It has been suggested that victim caches of one to five entries are effective at reducing misses especially in case of small direct mapped caches [2] like in our configuration. Depending on the program, a four-entry victim cache might remove one-quarter of the misses in a 4-KB direct mapped cache [2, 3]. Since we have used sim-outorder to simulate our configurations and as sim-outorder uses write back scheme for writing values back to next level of memory hierarchy (it does not allow customized definition of such scheme). One suggested improvement may be to use write-through scheme with write buffers when cost is at premium like for embedded processors. Problem with write back scheme is that it is very complex and expensive to build though it reduces memory traffic a lot. The high bandwidth requirement as in the case of write-through cache can be reduced by the use of write-buffers effectively achieving performance at reduced cost of implementation. As mentioned above, similar effects can be achieved by reducing miss rate by increasing block size as they make effective use of spatial locality and reduce compulsory misses (block sizes can be changed in sim-outorder too). But there is always a trade-off in increasing the block size because increasing it would mean that conflict misses are more frequent and thus may increase miss-rate. This trade-off can be evaluated by a simple quantitative/qualitative simulated study. Another technique a suggested by [3] to reduce miss penalty/rate is to use Prefetching. Prefetching is of two types, hardware Prefetching and compiler controlled Prefetching. The idea is to essentially bring data/instructions into processor before they are actually required and requested by the processor. In hardware Prefetching, whenever a block is requested and it is not found in the cache, two blocks are accessed from the next level of memory hierarchy, the required block and the one consecutive to it. The required block is placed in the cache while the consecutive block is placed in the Instruction/Data Stream Buffer. Jouppi et al. [2] found that a single instruction stream buffer would catch 15% to 25% of the misses from a 4-KB direct mapped instruction cache with 16 byte blocks. An alternative to hardware

15 Prefetching is for the compiler to insert prefetch instructions to request the data before they are needed. Above mentioned arguments deal with the cache related parameters only. Since our final parameter selection is based upon both cache and branch related parameters. Some improvements in branch prediction schemes as suggested by research community are inorder here. One such improvement scheme as employed in most recent processors is to use Return Address Predictors to predict indirect jumps, that is, jumps whose destination address varies at run time [3]. Though procedure return can be predicted with a branch target buffer, the accuracy of such a prediction technique can be low if the procedure is called from multiple sites and the calls from one site are not clustered in time [3]. To overcome this problem the concept of a Return Address Stack (RAS), that is a small buffer of return addresses operating as a stack, has been proposed which is one form of a return address predictor (sim-outorder allows specifying return address stack size). Important here is to note that a RAS caches most recent return addresses on the stack. It pushes a return address on the stack when the call to procedure is made and it pops off the return address when the flow returns from procedure. If the stack size is sufficiently large i-e as large as the maximum call depth, RAS will predict the return addresses perfectly. This concludes our optimal parameter selection methodology analysis and improvement suggestions. 6 Conclusions An exercise for optimal parameter selection for a reduced cost-reasonable performance processor is presented. The quantitative and qualitative aspects of a processor design process based upon a simulated environment are presented in a step by step fashion. The quantitative results from SimpleScalar simulations are judged for cost/performance ratio on the yardstick of a qualitatively analyzed cost/performance model. A small direct mapped cache based processor is justified to give best cost/performance ratio for the selected set of parameters. The cache simulations are also based upon sim-outorder which is not specifically meant for cache simulation. The use of sim-cache or sim-cheetah can give more fine grained analysis options. In the end, qualitatively induced results based upon previous studies are presented to further enhance the performance of the proposed processor configuration while still maintaining the advantage of reduced price. 7 References 1. Marc Steinways and Theo Ungerer, Hardware Complexity of processors 2. Joupi, N.P. Improving direct mapped cache performance by the addition of a small fully associative cache and prefetch buffers, Proc.17 th Annual Int l Symposium on Computer Architecture, J.L.Hennessy, David Patterson, Computer Architecture, A Quantitative Approach 3 rd ed. Morgan Kaufman Publishers San Francisco, Douglas C. Burger and Todd M. Austin, The SimpleScalar Tool Set, Version 2.0, UW Madison Computer Sciences Technical Report #1342, June, 1997.

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.]

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] A Quick SimpleScalar Tutorial [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] What is SimpleScalar? SimpleScalar is a tool set for high-performance simulation and research of modern

More information

HW#3 COEN-4730 Computer Architecture. Objective:

HW#3 COEN-4730 Computer Architecture. Objective: HW#3 COEN-4730 Computer Architecture Objective: To learn about SimpleScalar and Wattch tools. Learn how to find a suitable superscalar architecture for a specific benchmark through so called Design Space

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Performance of several branch predictor types and different RAS configurations

Performance of several branch predictor types and different RAS configurations Performance of several branch predictor types and different RAS configurations Advanced Computer Architecture Simulation project First semester, 2009/2010 Done by: Dua'a AL-Najdawi Date: 20-1-2010 1 Design

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

CSE P548 Spring 2005 SimpleScalar 3.0 Guide April 2 nd 2005

CSE P548 Spring 2005 SimpleScalar 3.0 Guide April 2 nd 2005 SimpleScalar What is it? SimpleScalar is a suite of processor simulators and supporting tools. The simulation architecture is called PISA, and is similar to the MIPS architecture. Sim-outorder is an instruction-level

More information

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

SimpleScalar 3.0 Quick Guide

SimpleScalar 3.0 Quick Guide SimpleScalar This document should act as a quick reference to the SimpleScalar out-of-order issue processor simulator that we will use throughout much of this course. What is it? SimpleScalar is a suite

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

COMPUTER ARCHITECTURE SIMULATOR

COMPUTER ARCHITECTURE SIMULATOR International Journal of Electrical and Electronics Engineering Research (IJEEER) ISSN 2250-155X Vol. 3, Issue 1, Mar 2013, 297-302 TJPRC Pvt. Ltd. COMPUTER ARCHITECTURE SIMULATOR P. ANURADHA 1, HEMALATHA

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

FLAP: Flow Look-Ahead Prefetcher

FLAP: Flow Look-Ahead Prefetcher FLAP: Flow LookAhead Prefetcher Rice University ELEC 525 Final Report Sapna Modi, Charles Tripp, Daniel Wu, KK Yu Abstract In this paper, we explore a method for improving memory latency hiding as well

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141 EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role

More information

5 Memory-Hierarchy Design

5 Memory-Hierarchy Design 5 Memory-Hierarchy Design 1 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1 Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Components of a Computer Processor

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache Fundamental Questions Computer Architecture and Organization Hierarchy: Set Associative Q: Where can a block be placed in the upper level? (Block placement) Q: How is a block found if it is in the upper

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

EECS 470 Final Project Report

EECS 470 Final Project Report EECS 470 Final Project Report Group No: 11 (Team: Minion) Animesh Jain, Akanksha Jain, Ryan Mammina, Jasjit Singh, Zhuoran Fan Department of Computer Science and Engineering University of Michigan, Ann

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Revisiting Processor Performance Program Execution Time =

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page

For Problems 1 through 8, You can learn about the go SPEC95 benchmark by looking at the web page Problem 1: Cache simulation and associativity. For Problems 1 through 8, You can learn about the "go" SPEC95 benchmark by looking at the web page http://www.spec.org/osg/cpu95/news/099go.html. This problem

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information