A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar

Size: px

Start display at page:

Download "A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar"

Aileen Tyler
6 years ago
Views:

1 A Quantitative/Qualitative Study for Optimal Parameter Selection of a Superscalar Processor using SimpleScalar Abstract Waseem Ahmad, Enrico Ng {wahmad1, eng3}@uic.edu Department of Electrical and Computer Engineering University of Illinois, Chicago This report is being submitted as a requirement for the term project for ECE 466 Course Instructor Prof. Gyungho Lee Wide Issue super scalar processors are very complex machines. Simplescalar toolset makes the job of selected parameter simulation for such processors a lot easier. The use of critical quantitative analysis based upon the SimpleScalar simulations augmented with a qualitative analysis for a rational cost/performance model is necessary to select optimal parameter values for the processor aimed at specific target environment. We present one such exercise where we make qualitatively analyzed cost/performance model and apply it to quantitative results obtained by SimpleScalar simulations and come up with a processor model based upon restricted set of parameters which gives us a reasonable performance on reduced cost. The target environment for such processors may range from desktop to embedded systems. 1 Introduction SimpleScalar is the most popular simulation toolset for performance studies of super scalar microprocessors. We make use of this toolset to develop a basic framework for optimal parameter selection for a super scalar processor. Optimal here refers to a rational trade-off between cost and performance. The procedure adopted is as follows; we first run simoutorder (from SimpleScalar toolset), details of which can be found in next section, for two SPEC2000 integer programmes (a compiler programme) and (a database application) for varying configurations of select parameters, with fixed I-Cache of 64KB, D- Cache of 16 KB and assumption of a perfect branch prediction. The parameter values, that are being selected to experiment with, are Instruction issue width (and the number of function units), Register Update Unit (RUU) size, and D-cache associativity. For an in-depth focus on the quantitative and qualitative aspects of performance at the expense of more hardware we propose a cost model. The proposed cost model enables us to develop a rational trade-off between cost and performance and lets us pick one configuration, which performs the best for this cost model, as the candidate for the next set of experiments. Next, for this candidate configuration, we experiment with varying D-cache sizes and Branch Target Buffers (BTB) sizes for branch prediction. We now use more practical bimodal predictor with different configurations of increasing amount of entries. Assuming cycle time will increase 10% by doubling the size and D-cache (up to 128KB); we try to decide the D-cache

2 size along with the BTB size that provides the best cost/performance ratio. The rest of the paper is organized as follows, we first give a brief introduction to SimpleScalar toolset and variety of tools that it provides, after that we give a compilation of first set of experiments and conclusions drawn from them based upon proposed cost model which is also presented in the same section. After that we take up the second set of experiments and then give a conclusive parameter selection based upon the proposed cost model. We conclude by suggesting ways to improve performance of the selected processor configuration. 2 SimpleScalar Background The SimpleScalar Toolset is an architectural research tool set including compiler, assembler, linker, simulation, and visualization tools for the Simplescalar architecture [4]. With this tool set, the user can simulate real programs using fast execution-driven simulation. It is a flexible and accurate cycle resolution simulator that implements a close derivative of the MIPS-IV Instruction Set Architecture (ISA). More precisely, the Simplescalar instruction set (also called the PISA, or "Portable Instruction Set Architecture", instruction set) is a superset of MIPS with a few minor differences and additions. It is capable of simulating binary programs on one of the several processor simulators provided. These specialized simulators fall into four main categories: Functional simulation The fastest and most elementary simulator is sim-fast, which performs only functional simulation using in-order execution of the instructions (i.e. they are executed in the order they appear in the program). This simulator is optimized for raw speed and does not take into account instruction checking or existence of cache memory. A separate version of sim-fast, called sim-safe, also performs functional simulation, but checks for correct alignment and access permissions for each memory reference. Profiling Simplescalar includes a functional simulator that can be used for obtaining profiling information. This simulator, sim-profile, can provide detailed profiles on instruction classes and addresses, text symbols, memory accesses, branches, and data segment symbols. Cache simulation The tool set provides two functional cache simulators: sim-cache and sim-cheetah. These simulators are ideal for fast simulation of architectures that include cache memory, provided that the cache access time is not relevant with regard to execution performance. These tools are useful for evaluating a variety of cache organizations. Out-of-order processor timing simulation The most complicated and detailed simulator included in the tool set is sim-outorder. This is the one that we use for our study on optimal parameter selection for a super scalar processor. This simulator performs out-of-order execution of the instructions, based on the Register Update Unit (a scheme that uses a reorder buffer to automatically rename registers and hold the results of pending instructions). Out of Order Execution (OOO) distinguishes a

3 superscalar architecture from others. This is where the name Simplescalar came from - it is a simple superscalar architecture. In the sections to come, we will make use of sim-outorder to run simulation tests for optimal parameter selection based upon the simulation results for perfect and practical branch prediction schemes. 3 Parameter Selection for perfect branch prediction As mentioned earlier, in this set of experiments, we define a set of configurations based upon three parameters namely Instruction issue width (and the number of function units), RUU size and D-cache associativity. Meanwhile fixed parameters include I-Cache of 64 KB D-Cache of 16KB Perfect branch prediction. Following assumptions are made regarding increase in clock cycle time on increasing values of certain parameters. 10% by doubling the issue rate 10% by doubling the RUU size 2% by doubling the D-cache associativity 2% by doubling the number of function units (including load/store units) Forty seven simulations were executed, first testing each setting separately, then testing combinations. The settings include following configurations as given in table 1. Base Case Range/Value Configuration Value Branch Prediction Perfect Bpred Perfect I Cache Size cache:in1 il1:2048:32:1:l Issue Width 4,16,32,64 issue:width 4 Fetch Queue Size 4,16,32,64 fetch:ifqsize 4 Decode Width 4,16,32,64 decode:width 4 Commit Width 4,16,32,64 commit:width 4 RUU Size 16,32,64,128,256,512,1024 Ruu:size 16 Dcache Associativity 1,2,4,8,16,32,64 cache:dl1 Dl1:512:32:4:l Integer ALUs 4,8,16 Res:ialu 4 Integer ALUs 1,2 Res:imult 1 FP ALUs 4,8,16 Res:fpalu 4 FP ALUs 1,2 Res:fpmult 1 Table1. Range of values for different parameters under first set of experiments. Note that branch prediction is assumed to be perfect. After the simulations were run, we pruned the results to select four configurations which had best execution time on the two benchmarks. We observed following interesting patterns in the simulated configurations for the ranges and parameters shown in Table1.

4 % 9.00% Increase 8.00% 7.00% 6.00% 5.00% 4.00% 3.00% Transistors Die Area RUU size 2.00% 1.00% 0.00% RUU Size Figure1. (a) Graph of Percentage Time RUU is full versus RUU size. Clearly for the given benchmarks there is no need to go beyond RUU size of more than 16 as it won t be utilizing its capacity anyway. As expected the cost scales linearly by increasing the RUU Size. The curve seemingly going down below 0 in left side graph is because of interpolation and not based on actual values I ssue Wi dt h Issue Width Execution Time Figure2. (a) Graph of Percentage Time Fetch Queue Full versus Issue width. Graph clearly shows that there is no use going beyond issue width of 4 to 8 as most of the time IFQ will not be full anyways and the increased issue width won t get served properly. Graph (b) clearly shows that being a Data base application has more load stores and keeps the LSQ full most of the time as compared to but in any case this LSQ utilization decreases with increasing issue width. Note here that LSQ utilization peeks at issue-width =8, this is because for this set of experiments we were using LSQ size of % % % % % 50.00% Transistors 0.00% Issue Width Issue Width Figure3 (a) Graph shows that increasing the issue-width increases the CPU execution time almost linearly. Although CPI improves with increasing issue-width to certain extent but the additional clock cycle penalty dominates the improvement to effectively reduce performance with increasing issue width. Moreover, perfect branch prediction assumption implies that control hazards don t reduce the performance, but since FU s are still set to default values, structural hazards are playing their role in decreasing performance with increasing issue-width. (We ll see the effect of FUs with increased issue-width soon). Die Area

5 Execution Time Associativity Figure4. Figure shows Graphs of D-Cache Associativity versus Execution Time. It gives us a very insightful result that associativity of 1 gives the best results. This corresponds to direct mapped cache and it has the lowest cost as well. So this will be part of set of ultimate configurations when we take up cost/performance model in detail % % Miss Rate Associativity % 50.00% 0.00% Transistors Die Area Associativity Figure5. Graphs of D-Cache Associativity against miss rate and cost (in terms of die area and transistors). It shows us clearly, why increased associativity does not give better cost/performance results. Although increasing associativity increases performance by decreasing miss rate (as there are few conflict misses) but the cost paid to achieve this performance does not scale well with the performance. So we can predict that optimal parameter selection will involve lower values for associativity. Moreover, this also illustrates that although miss rate may be reduced but the CPU execution time eventually increases which is the true measure of performance Execution Time mul - 2 mul - 1 mul - 2 mul Integer ALU Units Figure6. Graph of Integer units against CPU execution time. It simply illustrates that increasing functional units does not itself enhance performance as long as issue-width and other parameters are same. Since the benchmarks are integer, FP FU s don t impact performance. These illustrations also fully explain the logic behind our elimination criteria for configurations under critical analysis. The configurations finally selected for cost/performance model evaluation and explanation are given in Table 2.

6 Config bpred Issue:width Ruu:size Cache:DL1 Res:ialu Res:imult Res:fpalu Res:fpmult TotalSize= 16KB Base perfect 4 16 dl1:512:32:4:l A perfect 4 16 dl1:512:32:2:l B perfect 4 16 dl1:512:32:1:l C perfect 4 32 dl1:512:32:4:l D perfect 8 16 dl1:512:32:4:l E perfect 4 16 dl1:512:32:4:l Table2. The final list of candidate configurations for cost/performance model evaluation and explanation. For all configurations in Table 2, Instruction Fetch queue size, decode-width and commit widths have the same values as their issue-widths. 3.1 Cost Model We assume that all the configurations are based upon the same VLSI fabrication technology, the difference in cost is driven by the transistor count which in turn is based upon number of functional units, complexity of instruction decode and issue logic, RUU size and L1 cache size and its associativity among lots of other parameters. Moreover, the cost also depends upon the die area, as for the same technology wafer cost will be same and the die area will affect number of dies/wafer thus effecting individual die cost. Moreover, die yield also is dependent upon die area. Our target is to set performance in comparison with the number of transistors and the chipspace (die area) required to implement the simulated features. For this purpose we used the tool in [1] that estimates transistor count and chip space requirements of the simulated features. The tool is based on Microsoft Excel. To estimate the chip space and the amount of transistors, [1] uses an analytical method for memory-based structures like register files or internal queues and an empirical method for logic blocks like control logic and functional units. For the analytical method it calculates the amount of bit cells, which are needed for the memory-based structures and the number of ports to access them. Based on this information, number of transistors is calculated, assuming four transistors to implement a basic bit cell, two transistors per write port and one transistor per read port. To calculate the chip-space of a memory based structure, the area of a basic cell is estimated. The basic area of a bit cell is increased in height and width by the number of the ports. The information of a basic bit cell space and the number of the ports is used then to calculate its area, from that the whole chip-space can be estimated. To be independent of chip technology, the half-feature size λ as measure of length is applied, e.g., 1 mm 2 in 0.5 micron technology equals 16 million λ 2. For the non memory-based parts of the processor, the floor plans of existing processors are measured. With this empirical approach, [1] estimates the sizes of the basic logic blocks of the processors. Using this data, it calculates the necessary chip space of the logic blocks. To estimate the transistor amount of hypothetical processors with this tool, the average transistor density of non memory-based structures of real processors is calculated by measuring floor plans (SPARC64 and HP PA-8000) and by additional information about the transistor amount of the measured logic blocks.

7 Transistor Count Calculation The results of the estimator for transistor count for all configurations are given in table 3. Analytical estimated Transistors Transistors Transistors Transistors Transistors Transistors Base A B C D E Registers 51,200 51,200 51,200 51,200 51,200 59,392 LSQ 20,448 20,448 20,448 20,736 20,448 24,992 RUU 124, , , , , ,760 Level 1 d-cache 5,713,920 2,846,720 1,418,240 5,713,920 5,713,920 5,713,920 Empirical estimated Issue (scheduler) 97,937 97,937 97,937 97, ,874 97,937 Write back unit 146, , , , , ,874 Integer units 186, , , , , ,104 Floating point units 244, , , , , ,842 Load store units 183, , , , , ,632 Summary Total without caches 1,475,411 1,475,411 1,475,411 1,599,699 1,583,268 1,752,927 Total with l1 caches 11,131,731 8,264,531 6,836,051 11,256,019 11,239,588 11,409,247 Total with all caches 48,245,587 45,378,387 43,949,907 48,369,875 48,353,444 48,523,103 Table3. The transistor count of candidate configurations Total values of the transistor count will be used when we are doing final calculations. Note here that transistor count scales linearly with the increasing complexity of the data path. Die Area Calculation The results for the die area calculation for all four configurations are given in table 4. M λ 2 M λ 2 M λ 2 M λ 2 M λ 2 M λ 2 Base A B C D E Analytical estimated Registers LSQ RUU Level 1 d-cache 2, , , , , Empirical estimated Write back unit Integer units Floating point units Load strore units Summary Total without caches 2, , , , , , Total with l1 caches 6, , , , , , Total with all caches 21, , , , , , Table4. The Die Area Estimation for different configurations The graphic summary of Table 3 Table 4 is given in Figure below.

8 Base A without Cache with L1 Cache with All Cache Base A without Cache with L1 Cache with All Cache B B C C D D E E Transistors Figure7 (a) Transistor counts for different configurations. Note that the values in each category of the transistor count in every configuration other than the base is normalized to its corresponding base value to ease comparison based upon the base value of all three categories of transistor count. (b) Chip space estimation for different configurations. Same argument holds here regarding normalization of individual values in each category as in (a). Cost can now be estimated easily by using total transistor count and total die area as the basic criterion of evaluation. But since we are interested in comparing cost/performance ratio, let s develop an idea of our performance model first. 3.2 Performance Model The measure used for performance is the CPU time given by CPU execution time = IC * (CPI + misses/inst(l1)*hit_time(l2)+misses/inst(l2)* miss_panalty(l2)) * Clock Cycle Time The default values for hit and miss latencies in SimpleScalar are used in our calculations. Misses/instruction are calculated by simulation result. The reason why we are not choosing CPI as our performance measure is clear from the graphs in Figure8. Although Configuration C has a better CPI but due to the increased cycle time because of increased size of RUU, its CPU Execution time is worse than others. Base A Base A votex B B C C D D E E (a) CPI Execution time Figure8 (a) CPI values for all configurations (b) Execution Time values for all configurations. Note that these values are normalized to the base values. (b)

9 Performance is inversely proportional to CPU Time, higher the CPU Time, lower the performance and vice versa. Since we are using two benchmarks that are and with different total number of instructions, we will follow following steps to obtain Geometric Mean, as suggested by [3], based upon results of both the benchmarks. For each benchmark i, look up T base,i For each benchmark i, run target machine to get T new,i Compute the geometric mean GM = { 1 Π 2 (T new,i / T base,i ) } 1/2. The least is the best. Base A B C D E Geometric Mean Figure10. Geometric Mean values for different configurations normalized to the base. Compute the (Cost machine / Cost base ) * GM. Where Cost machine / Cost base = TotalTransistorCount machine/ TotalTransistorCount base * Die Area machine /Die Area base Here T base,i refers to the CPU time of the base configuration for the ith benchmark and T new,i is the execution time for the configuration under test for that benchmark. Similarly, Cost base refers to the cost of the base configuration and the Cost machine refers to the cost of the machine model represented by the configuration under test. This leads us to our final goal of this set of experiments that is to select the best configuration based upon cost/performance ratio. The cost/performance ratios for the select configurations are plotted in graph of figure below. E D configuration C B A cost/performance ratio Base cost/perfromance Figure11. Cost/performance ratios for the select configurations As is clear from the graph, the machine configuration B gives us the best cost/performance ratio. This configuration represents direct mapped D-cache (see other parameter values in table2), though it has performance disadvantages over A and Base configurations. But the savings in cost are much more significant than this performance disadvantage eventually making it our configuration of choice for next set of experiments. As a word of warning,

10 please note here that sim-outorder is not specifically meant for cache simulation so the performance comparison done here may not lead us to believe that this configuration will work the same for all practical machine designs. 4 Optimal D-Cache and BTB size for Bimodal Branch Predictor for Configuration B After choosing the optimal configuration i-e configuration B in previous section, sixteen additional simulations were executed for this configuration using every combination of the settings below: Base Case Range Configuration Value Max Instructions max:inst Fast Forward Fastfwd Branch Prediction Bpred Bimod BiModal Table Size 512,1024,2048,4096 bpred:btb 512,4 BTB Size 512,1024,2048,4096 bpred:bimod 512 Dcache Size 512,1024,2048,4096 cache:dn1 dl1:512:32:4:l Table5. Range of values to chose from for second set of experiments. Again a simple analytical and quantitative analysis helps us prune these configurations to a few of them based upon the results of simulations run for all of them. This analysis is illustrated in figures below % 20.00% Transistors Die Area % % % Cache Size 0.00% Cache Size Figure12 (a) Graph of miss rate versus cache size for and vertex. (b) Graph of Transistor count and chip space (die area) against cache size. It is clear that we have to make a rational trade-off in order to achieve performance at reasonable cost.

11 % 0.40% % 0.30% 0.25% % 0.15% 0.10% Transist ors Die Area Table Size 0.05% 0.00% Table Size Figure13. Graph of Branch Predictor miss rate versus branch look up table size. Again the miss rate decreases with increasing Table size but we have to pay more cost for that as shown in the side graph Execution Time Cache Size Execution Time BTB Size Figure14. Graphs of D-Cache Size and BTB size against CPU Execution Time. The values of Execution time do not include the value of Number of Instructions as they are the same for all cases. Although CPI decreases by increasing the cache-size but the additional cycle time penalty effectively increases the CPU time thus reducing performance. Execution Time RUU Size Figure15. Graph of RUU size against CPU execution time. Execution time increases by increasing RUU size although CPI does improve, but again cycle time penalty because of increase in RUU size effectively reduces the performance.

12 The selected configurations are given in Table below. Config D-Cache Size D-Cache Config Bimodal Predictor Config Base 16KB Dl1:512:32:1:l 512 BA 32KB Dl1:1024:32:1:l 512 BB 16KB Dl1:512:32:1:l 1024 BC 64KB Dl1:2048:32:1:l 1024 BD 128KB Dl1:4096:32:1:l 1024 Table6. The selected configurations for further analysis Let us now apply our cost/performance model for these configurations to select the best of them in terms of cost/performance ratio. For that we follow the same steps as we did in previous section. Again it is not necessary that the configuration which gives the best CPI value may be the best in terms of Execution time as well, as this decrease in CPI is cancelled out by increase in Cycle time in relevant cases. Note here that the execution time values in graph are normalized with respect to Base configuration s value. The graphs clearly show that in terms of Execution Time, or performance for that matter, base configuration has the best results. Moreover, as outlined in our performance model, we have to get the unified value of the execution time by taking the Geometric Mean of the individual values. Results of this process are shown in graphs below. BASE BA BASE BA BC BC BB BB BD BD CPI Execution Time Figure16. Left hand graph shows CPI values of select configurations. Right hand graph shows Normalized CPU time (with respect to base) the said configurations.

13 Base BA BC BB BD Geometric Mean Figure17. Graph of GM values of the select configurations. Now let us apply the cost model to these configurations. The results are summarized in Graph below. Please not here that the Graph shows total chip space and Transistor count estimates for all configurations normalized to the corresponding value of the base configuration. Base BA Chip Space Transistor Count BC BB BD Transistor & Die Area Count Figure18. Graph for normalized Values (with respect to base configuration) of total transistor count and chip space (die area) for the select configurations. Again the base configuration has the best cost values. Now we can easily predict that the base configuration will have the best cost/performance ratio which is verified by the following graph. Base BA BC BB BD Cost/Performance Ratio Figure19. Graph showing cost/performance ratio for the configurations under test. Base Configuration has the best value.

14 5 Suggested Improvements In our design so far we have selected a direct mapped cache of a relatively small size. For direct caches, miss rate is one area of big concern as it allows a large number of conflict misses. Now there can be two ways to overcome this problem. One is to attempt to reduce the miss rate by using larger block size and larger cache size while the other way is to reduce the cache miss penalty. The cache performance formula is given as [3] Average Memory Access Time = Hit Time + Miss Rate*Miss Penalty This can easily be seen from the above cache performance formula that improvements in miss penalty can be just as beneficial as improvement in miss rate. One approach to lower miss penalty is to use Victim Cache [2] which has significant implications in reducing miss penalty for small direct mapped cache. A victim cache is a small, fully associative cache between a cache and its refill path (L2 Cache in our simulated environment). The victim cache stores those blocks that are discarded from the cache upon a miss thus being victim of address conflict. In a scenario with a victim cache, on each miss, first the victim cache is checked to see if it has the desired data before going to the next level of memory. If it is found there, the victim block and the cache block are swapped. It has been suggested that victim caches of one to five entries are effective at reducing misses especially in case of small direct mapped caches [2] like in our configuration. Depending on the program, a four-entry victim cache might remove one-quarter of the misses in a 4-KB direct mapped cache [2, 3]. Since we have used sim-outorder to simulate our configurations and as sim-outorder uses write back scheme for writing values back to next level of memory hierarchy (it does not allow customized definition of such scheme). One suggested improvement may be to use write-through scheme with write buffers when cost is at premium like for embedded processors. Problem with write back scheme is that it is very complex and expensive to build though it reduces memory traffic a lot. The high bandwidth requirement as in the case of write-through cache can be reduced by the use of write-buffers effectively achieving performance at reduced cost of implementation. As mentioned above, similar effects can be achieved by reducing miss rate by increasing block size as they make effective use of spatial locality and reduce compulsory misses (block sizes can be changed in sim-outorder too). But there is always a trade-off in increasing the block size because increasing it would mean that conflict misses are more frequent and thus may increase miss-rate. This trade-off can be evaluated by a simple quantitative/qualitative simulated study. Another technique a suggested by [3] to reduce miss penalty/rate is to use Prefetching. Prefetching is of two types, hardware Prefetching and compiler controlled Prefetching. The idea is to essentially bring data/instructions into processor before they are actually required and requested by the processor. In hardware Prefetching, whenever a block is requested and it is not found in the cache, two blocks are accessed from the next level of memory hierarchy, the required block and the one consecutive to it. The required block is placed in the cache while the consecutive block is placed in the Instruction/Data Stream Buffer. Jouppi et al. [2] found that a single instruction stream buffer would catch 15% to 25% of the misses from a 4-KB direct mapped instruction cache with 16 byte blocks. An alternative to hardware

15 Prefetching is for the compiler to insert prefetch instructions to request the data before they are needed. Above mentioned arguments deal with the cache related parameters only. Since our final parameter selection is based upon both cache and branch related parameters. Some improvements in branch prediction schemes as suggested by research community are inorder here. One such improvement scheme as employed in most recent processors is to use Return Address Predictors to predict indirect jumps, that is, jumps whose destination address varies at run time [3]. Though procedure return can be predicted with a branch target buffer, the accuracy of such a prediction technique can be low if the procedure is called from multiple sites and the calls from one site are not clustered in time [3]. To overcome this problem the concept of a Return Address Stack (RAS), that is a small buffer of return addresses operating as a stack, has been proposed which is one form of a return address predictor (sim-outorder allows specifying return address stack size). Important here is to note that a RAS caches most recent return addresses on the stack. It pushes a return address on the stack when the call to procedure is made and it pops off the return address when the flow returns from procedure. If the stack size is sufficiently large i-e as large as the maximum call depth, RAS will predict the return addresses perfectly. This concludes our optimal parameter selection methodology analysis and improvement suggestions. 6 Conclusions An exercise for optimal parameter selection for a reduced cost-reasonable performance processor is presented. The quantitative and qualitative aspects of a processor design process based upon a simulated environment are presented in a step by step fashion. The quantitative results from SimpleScalar simulations are judged for cost/performance ratio on the yardstick of a qualitatively analyzed cost/performance model. A small direct mapped cache based processor is justified to give best cost/performance ratio for the selected set of parameters. The cache simulations are also based upon sim-outorder which is not specifically meant for cache simulation. The use of sim-cache or sim-cheetah can give more fine grained analysis options. In the end, qualitatively induced results based upon previous studies are presented to further enhance the performance of the proposed processor configuration while still maintaining the advantage of reduced price. 7 References 1. Marc Steinways and Theo Ungerer, Hardware Complexity of processors 2. Joupi, N.P. Improving direct mapped cache performance by the addition of a small fully associative cache and prefetch buffers, Proc.17 th Annual Int l Symposium on Computer Architecture, J.L.Hennessy, David Patterson, Computer Architecture, A Quantitative Approach 3 rd ed. Morgan Kaufman Publishers San Francisco, Douglas C. Burger and Todd M. Austin, The SimpleScalar Tool Set, Version 2.0, UW Madison Computer Sciences Technical Report #1342, June, 1997.

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.]

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] A Quick SimpleScalar Tutorial [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] What is SimpleScalar? SimpleScalar is a tool set for high-performance simulation and research of modern