Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Size: px

Start display at page:

Download "Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research"

Howard Wade
5 years ago
Views:

1 Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness Lenni Kuff Computer Science Department University of Wisconsin, Madison, WI Abstract Present day computer architects use advanced microarchitecture simulators to test the performance of processor designs. The simulator workloads are generally benchmarks, which are representative of specific types of real world applications. Because microarchitecture implementations increase in complexity and the simulation workloads are required to represent complicated applications, the simulation time has greatly increased. To solve the problem, researchers are looking into ways to reduce the amount of time benchmarks run, while maintaining the same workload characterization of the longer benchmarks. MinneSPEC is a representative reduction of SPEC2000, with the reduced input sets found using SimpleScalar profiling tools [1]. With the release of SPEC CPU2006, new benchmarks have been added to the SPEC benchmarking suite which will be used to evaluate performance in tomorrow's microprocessors. These benchmarks are considerably larger than SPEC2000 and using SimpleScalar to profile their workloads would take a large amount of time and effort. This paper suggests a different reduction technique which gathers profiling information using processor performance counters accessed using PAPI. Since workloads are running on a native system instead of a simulator, profiling information can be gathered in a much shorter amount of time. This allows for fine grained tuning of reduced input sets so more representative reduced benchmarks can be found in a much shorter amount of time. Using this technique, we were able to reduce five SPEC2006 benchmarks to under 10 billion instructions yet still have over 95% Chi Squared confidence for instruction mix between the reduced input set and reference input set. While further work in this area needs to be done, the preliminary results of our research show that PAPI benchmark reduction may be a much more effective reduction technique than SimpleScalar alone. 1 Introduction Computer processor design has rapidly changed since the 1980s. Instead of basing microarchitecture design decisions on the advice of experts and rules of thumb, designers now run directed workloads on advanced simulation tools to make microarchitecture decisions [2]. Running benchmarks is an extremely important part of early design performance evaluations [3]. Because of the increasing complexity of the microarchitecture and the workloads executed on them, the simulation times have grown to be very long. Designers must now make the choice between running few long, more representative benchmarks or running more short, less representative benchmarks [1]. This work proposes and implements the use of performance counters available in modern microprocessors to collect information about the instruction mix, branch predictor, cache and translation lookaside buffer (TLB) performance, and instructions per cycle (IPC) for the reference input sets to the SPEC CPU2006 benchmark suite. This information is statistically compared to custom input sets to find 1

2 representative reduced input sets. Techniques of finding reduced input sets will be described in Methodology. Further, the use of performance counters on native hardware will greatly speed the data collection over the use of a simulator such as SimpleScalar [6] that was utilized for MinneSPEC. 2 Related Work SPEC CPU [4] is a popular benchmarking suite widely used in the computer architecture industry. When SPEC released CPU2000, the benchmark run times were greatly increased from the previous version. For example, the SPEC CPU2000 benchmark parser with the reference input is over 500 billion instructions, which could take 3 weeks to simulate at 100,000 instructions/sec. This was a problem for computer architects who wanted to run simulations of the SPEC2000 benchmarks on their designs. The magnitude of the SPEC2000 suite prompted research at the University of Minnesota to develop a set of benchmarks called MinneSPEC [1]. MinneSPEC is a reduced input set version of SPEC CPU2000 benchmark suite that maintains very similar program characteristics as the original SPEC2000 suite. With the release of SPEC CPU2006, new benchmarks have been added which designers will want to use when simulating future processor designs. These new benchmarks have effective reference set instruction counts that are almost an order of magnitude larger than SPEC2000 and it would be beneficial to extend the MinneSPEC research and reduce the input sets for the benchmarks in SPEC CPU2006. The paper [2] describes the use of Principle Component Analysis to cluster the benchmarks within a suite in order to further reduce the suite by finding reduced input sets that are representative of multiple benchmarks instead of just a single one. This type of analysis was beyond the scope of our project due to the limited time period that was available. 3 Design This work targets reduction of the SPEC CINT2006 benchmarks due to the time constraints on the project, as well as results from [1]. In contrast to the MinneSPEC project, we utilize an open source package called PAPI [5], from the Innovative Computing Laboratory at the University of Tennessee, that provides a library for access to performance counters of many modern microprocessors. This work used a Pentium 4 3.0GHz processor with 1GB memory for this data collection. By using native hardware to collect the data that we used, our runtimes were sped up by a factor of about 1000 over using SimpleScalar. In addition, we have run a number of the reference and reduced input set benchmarks on SimpleScalar for validation of our reduced input sets on the Alpha instruction set architecture. We utilized PAPI preset performance counters for the Pentium 4 as well as sim profile, and sim cache in SimpleScalar. 4 Methodology The data that we collected includes instruction mix, performance of cache, branch predictor and TLB, and the IPC for a run of the benchmark. The instruction mix is defined as percentages of the following instructions: Loads Stores Unconditional Branches Conditional Branches Integer Instructions Floating Point Instructions Miscellaneous (Vector, Trap, etc.) Data cache miss rate information was not supported in PAPI, so we relied on the results of sim cache in SimpleScalar in order to make comparisons. The branch predictor results came from PAPI counters and included branches predicted taken, mispredicted taken, predicted not taken, and mispredicted not taken. We also collected the data TLB misses. Finally, the IPC was collected using PAPI, and could, in future work, be collected using SimpleScalar. Using the afore mentioned collected data, we were able to reduce five of the CINT benchmarks using the following technique. We used instruction mix as the major factor in searching for a reduced input set, and the other factors were instructions per cycle, branch predictor, cache and TLB performance. When instruction mix was collected for the reference set and a trial reduced set, the mixes were compared using Chi Square Test, with the 2

3 Inst Mix Reduction Benchmark Type Description Reduction Method Match Factor (x86) Run Length ISA 401.bzip2 Int Compression Modified Command Yes 426x 6.19 / 7.52 billion x86/alpha line, new input file 456.hmmer Int Search Gene Modified Command Yes 735x 4.43 billion x86 Sequence line 462.libquantum Int Quantum Modified Command Yes 3854x 1.00 / 1.01 billion x86 / Alpha Computing line 429.mcf Int Combinatorial Same as test Yes 95x 4.73 / 4.88 billion x86 /Alpha Optimization Figure 1: SPEC CINT2006 Benchmark Reduction Results. reference as the expected value, to find our confidence in their dependence. Our target for Chi Square confidence was 90% per the MinneSPEC results. Further, we wished to reduce the instruction count to between 1 and 10 billion instructions, a slight increase over the target size for MinneSPEC due to Moore's law and a large increase in size between the SPEC2000 benchmarks and the SPEC2006 suite. Finally, a qualitative analysis of cache, branch predictor and TLB performance, and IPC is used to fully compare reference to reduced benchmarks. The results and discussion explain how these parts fit together as well as what would be required to complete this work. To find a reduced input set, we employed an iterative process. First, a benchmark was selected and run to collect data for the reference, test and train data input sets provided by SPEC. The test and train sets were often very useful for getting a starting point in the iterative process. The iterative part of reduction is made up of two possible distinct paths: (1) change the command line input to the benchmark, or (2) change the input files. For instance, the benchmark 462.libquantum, described later, only used command line input, while 401.bzip2 compresses data files, so to reduce its input, we tried swapping out or changing the contents of its input files. Finally, we checked the data collected, and if the candidate input set met our requirements, we ran that set on SimpleScalar. Otherwise, we iterated the process again. The speed of native hardware allowed us to make quick changes to the trial input and keep momentum, whereas using SimpleScalar, we would have to wait minutes for candidate reduced input data to be returned. 5 Results Figure 1 shows the results of our reduction process for the benchmarks 401.bzip2, 429.mcf, 456.hmmer, 458.sjeng, and 462.libquantum. The run lengths range between 1 and 7.5 billion instructions, which, depending on the benchmark equates to reductions from the reference size of two to four orders of magnitude. The Alpha run lengths are supplied where we had sufficient time to collect them using SimpleScalar. The three reduced sets that ran to completion on SimpleScalar met the Chi Square confidence for the instruction mix. The distribution of instructions within a class between the x86 and the Alpha architectures were very different. For instance, the percentage of integer instructions on the x86 was much smaller than the percentage of integer instructions on the Alpha for most benchmarks. This is could possibly be due to the embedded integer operations done within a single x86 instruction like some loop instructions. This is also reflected in the comparison between total number of instructions per benchmark for each benchmark. Figure 2 and Figure 3 present example data for 401.bzip2 pertaining to the TLB miss rate per instruction and IPC values. Since the IPC was one of our factors in deciding how well a candidate reduced input set matched the reference input set, observations about correlations within data guided our reduction process. Figure 4 presents data cache miss rate for the 429.mcf benchmark over multiple cache sizes as collected from SimpleScalar. More information on these data components is included in the discussion. 3

6 Discussion Candid 1 Reduced Candid 2 Train Ref TLB Misses Per Instruction: 0.0272 0.0253 0.0209 0.0105 0.0111 IPC: 0.3780 0.3950 0.4948 0.5462 0.

4 6 Discussion Candid 1 Reduced Candid 2 Train Ref TLB Misses Per Instruction: IPC: Figure 2: The comparison of TLB misses per instruction and IPC for 401.bzip2 for two candidate input sets, the reference, train, and reduced input sets. IPC IPC vs TLB Misses Per Instruction TLB Misses Per Instruction: Figure 3: A comparison of TLB misses per instruction and IPC for the 401.bzip2 benchmark. Notice that as the TLB miss rate increases, the IPC increases. We can see from the bzip2 results in Figure 2 and Figure 3 that the TLB misses per instruction for the reduced benchmarks was high. We attributed this the overhead of filling the TLB at the beginning of the run, so that shorter runs would have a higher number of TLB misses per instruction compared to longer runs that miss infrequently as the run goes on. This increase in TLB miss rate as the size of the input set decreased is reflected in the IPC difference from reference to reduced input set. The consideration of file size, type and the length of the benchmark run were used as a guide to finding a reduced input set that maintained the instruction mix and count characteristics desired while avoiding a large IPC difference. Also, the quick access to the TLB and IPC data that is available with PAPI is beneficial for guiding the reduction process. The data cache miss rate collected in SimpleScalar must be analyzed qualitatively in comparing the reference set to the reduced input set. As can be seen in Figure 4, as the cache size increases, the miss rate decreases logarithmically. However, when comparing the reference data and the reduced data, we see that the reduced input set is more sensitive to a change in cache size. This is mainly due to the memory footprint difference between Figure 4: Data cache miss rate for the Test/Reduced and Reference 429.MCF input files. The smaller reduced set is more sensitive to cache size changes. 4

5 the reference input and the reduced input. The reduced input set has a smaller memory footprint, so it responds better to more cache. In addition, the poor cache miss rates for the reduced input set with small cache can probably be attributed to worse data access locality when compared to the reference input. The other two benchmarks for which this data was collected showed similar cache sensitivity from reference to reduced input sets. This work needs to be extended to include a function profile using a tool like gprof or oprofile. Other research in benchmark reduction uses a function profile to compare reference and reduced input sets. The function profile is the list of C like functions that are called during execution of the program, and the percentage of time was spent in each. This profile gives a higher level comparison of how a program executes between reference and reduced input sets. Relative profile statistics should be less architecture dependent than the instruction mix characteristics that we collected. This is due to the fact that a function will be executed on any architecture given the same input, while particular instructions are not necessarily included in every ISA. We have found that PAPI cannot be used concurrently with gprof, so with time constraints, we were not able to do function profiles of the benchmarks that we worked on. We did run across an anomaly using PAPI to collect branch information. During a test run using PAPI, we found that the conditional branch count exceeded the total branch count. We tried multiple means of counting the same information, but the results remained the same. This bug needs to be looked into, but on the P4, the results did not sway the Chi Square Test results enough for us to spend more time evaluating the bug. As a brief aside, a couple of the benchmarks that we attempted to reduce continued to show a Chi Square confidence of more than 95% as we continued to reduce their size to what could be considered microbenchmark size (e.g. 462.libquantum was reduced to fewer than 150 million instructions). 7 Conclusions In the past, reducing benchmark input sets has been a long and tedious task. Every time an input set change was made the person reducing the benchmark would have to wait for the simulation profile results to return, which could take over an hour. To allow for faster and more flexible benchmark reduction, we have used PAPI to access the processor performance counters to gather the instruction mix and other workload characteristics, such as TLB miss rate. Because the workload is running on a native system, the profile results are returned in minutes instead of hours or days. By using PAPI for workload characterization, and using an interactive benchmark reduction process, we have successfully reduced five of the SPEC CPU2006 integer benchmarks. Because PAPI returned profile results in such a short amount of time, we were able to make fine grained changes to our reduced input set to hone in on a representative reduced input set for the benchmark. Our initial results from using this reduction strategy technique seem to be very promising. We see no problems for continuing on and reducing the remaining SPEC CPU2006 Integer and Floating Point benchmarks in this manner. In fact, this technique could be applied to reducing other benchmarking suites as well. 8 References [1] A. J. KleinOsowski and D. J. Lilja, "MinneSPEC: A New SPEC Benchmark Workload for Simulation Based Computer Architecture Research," Computer Architecture Letters, IEEE, vol. 1, pp. 7 7, [2] L. Eeckhout, H. Vandierendonck and K. De Bosschere, "Designing Computer Architecture Research Workloads," Computer, vol. 36, pp , [3] P. Bose and T. M. Conte, "Performance Analysis and its Impact on Design," Computer, vol. 31, pp , [4] Standard Performance Evaluation Corporation. SPEC benchmark suites. Various benchmark suites available. Details available at [5] Performance Application Programming Interface (PAPI). Innovative Computing Library. Available at [6] Todd Austin, Eric Larson, and Dan Ernst. 5

6 SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, pages 59 67, February [7] David J. Lilja. Measuring Computer Performance: A Practitioners Guide. Cambridge University Press, New York, NY, [8] W.C. Hsu et al., "On the Predictability of Program Behavior Using Different Input Data Sets," Proc. 6th Workshop on Interaction between Compilers and Computer Architectures (INTERACT 02), held in conjunction with the 8th Int'l Symp. High Performance Computer Architecture (HPCA 8), IEEE CS Press, 2002, pp

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing