Improved Estimation for Software Multiplexing of Performance Counters

Size: px

Start display at page:

Download "Improved Estimation for Software Multiplexing of Performance Counters"

Veronica Charles
5 years ago
Views:

1 Improved Estimation for Software Multiplexing of Performance Counters Wiplove Mathur Texas Instruments, Inc. San Diego, CA Jeanine Cook New Mexico State University Las Cruces, NM Abstract On-chip performance counters are gaining popularity as an analysis and validation tool. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously at fixed sampling periods. Through multiplexing and estimation, an even greater number of unique events can be monitored in a single program execution. When a program is sampled in multiplexed mode using round-robin scheduling of a specified event set, the number of events that are physically counted during each sampling period is limited by the number of counters that can be simultaneously accessed. During this period, the remaining events of the multiplexed event-set are not monitored, but their counts are estimated. Our work quantifies the estimation error of the event-counts in the multiplexed mode, which indicates that as many as 42% of sampled intervals are estimated with error greater than 10%. We propose new estimation algorithms that result in an accuracy improvement of up to 40%. 1 Introduction Performance counters or Performance Monitoring Counters (PMCs) are built-in hardware counters that are fabricated in the CPU chip. They can be programmed using event-select registers to count a specified event from a set of events such as L1-data cache accesses, load misses, and branches taken. These performance counters enable accurate, minimally intrusive monitoring of application performance [14]. Moreover, the statistics are collected in realtime and on the hardware platform that is under test, therefore providing a high degree of confidence in the results. Simulators are widely used to gather cycle-by-cycle performance data of a large set of metrics in a single execution of a program on a simulated micro-architecture. With the data of various events available at the same cycle, an accurate view of the cycle-by-cycle processor state can be studied. However, when performing user-level microarchitecture simulation, the system level details such as interfaces to buses, interrupt controllers, disks, and video memory are not taken into consideration. Additionally, program behavior is affected by external factors such as the operating system and TLB effects [6], which suggest that the performance data generated through user-level simulation may not be completely accurate unless the simulator does full-system simulation. However, full-system simulation is extremely slow. These simulation short-comings are even more prevalent in the context of multiprocessors, where the availability of accurate simulators is limited and the speed perturbation is much larger. In contrast to simulators, monitoring a workload execution on native hardware using PMCs provides a much faster (i.e. real-time) platform for evaluation. PMCs are used for system tuning, workload characterization, profiling, code optimization, architecture validation, and performance evaluation. The PMCs in modern processors can monitor a very large set of events that is comparable to the number of events monitored by simulators. However, as listed in Table 1, most contemporary processors have only between two and six physical counters that can monitor an equal number of unique events simultaneously. For tance, an Intel Pentium-III processor, which has two PMCs, can monitor two events at a given tant. Moreover, one of the events counted forms the independent variable for analysis purposes (e.g. number of cycles or number of tructions). This implies that only one event is monitored in one run of a benchmark. Hence, in order to gather data of n events, the benchmark has to be executed n times. Furthermore, every architecture has certain events that are defined as conflicting which the PMCs cannot count concurrently. The Pentium IV Xeon is an exception with respect to the number of physical counters implemented on-chip. It has eighteen physical counters, which is more than any other contemporary commodity processor that we investigated. This processor is the exception, not the rule. Most contem-

2 porary processors, including other Intel architectures, are limited to between two and eight counters. Therefore, our work is still applicable to the majority of processors in use today. To overcome the limitation of only a few counters being available on a CPU and to enable simultaneous monitoring of a larger number of events, the technique of multiplexing is used [12]. The desired events are scheduled to be monitored by the PMCs in a round-robin fashion. During the period a particular event is monitored, the remaining events are not counted by the counters, but their counts are estimated. We quantified the estimation error of event counts in the multiplexed mode and found that 42% of the intervals are estimated with error greater than 10%. This inaccuracy can lead to false conclusions in the analysis and validation of the processor architecture or the program behavior. In order to improve the estimation of event counts, we implement four new estimation algorithms that result in an accuracy improvement of up to 40% and that can can be easily incorporated into any PMC interface that supports software multiplexing such as PAPI [5]. Processor #PMCs #Events 1 width Pentium-III [8] bit R10000 [13] bit Itanium 2 [9] bit AMD Athlon64 [3] bit POWER4 [18] n/a Pentium-IV Xeon [17] bit 2 PMC Interface Table 1. Processor PMCs Several interfaces are available to access the PMCs on different microprocessor families. Some processor specific interfaces include IBM s Performance Inspector [1] for PowerPC processors, Intel s VTune for Intel processors [2], and Rabbit for Intel and AMD processors [7]. Additionally, interfaces that provide portable trumentation on multiple platforms include Performance Counter Library (PCL) [4] and Performance Application Programming Interface (PAPI) [5]. PCL and PAPI support performance counters on PowerPC, Alpha, MIPS, Pentium, AMD and UltraSPARC processors. Both PCL and PAPI support multithreading, and PAPI explicitly supports multiplexing. For this work, we use PAPI as our interface to the PMCs on a Pentium-III processor. We chose the Pentium-III primarily due to its availability and low impact because it is a stand-alone system. Although the Pentium III has the fewest counters as listed in Table 1, our estimation techniques are applicable in all contexts where the number of 1 The number of events supported by the processor is an approximate as interpreted by the authors of this paper. countable events is much greater than the number of physical counters. We chose PAPI for its multiplexing support, its widespread use in performance analysis, and the large number of architectures it supports. A number of end-user tools use the PAPI library as their interface to the performance counters [11]. PAPI can monitor any event that is supported by the processor it is running on; a subset of these events is listed in Table 2. Category Events L1, L2, Instruction Hits, Misses, Accesses, Reads, Writes, and Data Caches Load/Store misses, TLB misses Total tructions executed, Total tructions issued, Instruction Mix FP tructions executed, Total branch tructions executed, FP mult & div tructions (Conditional) Branch Branch tructions taken/not taken, Prediction Branches mispredicted/predicted correctly Table 2. Subset of common events 2.1 Multiplexed Mode The multiplexed mode of counting is used to simultaneously monitor a larger number of events than the number of PMCs available on the processor. The events of interest are specified in an event-list and are monitored in a round-robin fashion. For example, consider events A, B, C, and D defined as an event set to be counted by the PMCs in the multiplexed mode. Figure 1 shows a possible sequence in which the events may be monitored. All of the events are monitored in an interval, with every event being counted for a fixed time slice. At the end of each time-slice, the current event-count is read and stored in a file which is followed by the monitoring of the next event in the event-list (after resetting the counter). This sequence continues throughout event monitoring. Thus, an event A is physically counted only once in the entire interval. The counts corresponding to A are not known when other events (B, C, and D) are monitored. Since the count of an event corresponding to the complete interval is desired, including the time slices when other events are monitored, a value is estimated by the multiplexing software. PAPI implements multiplexing of counters in software since no contemporary processors (that we are aware of) support hardware multiplexing. The MPX library [12] is the basis of the multiplexing technique in PAPI. The switching of events is triggered by the Unix interval timer ITIMER PROF, while the SIGPROF signal is used as a trap to the monitored process. After every fixed time duration (10 ms by default), MPX halts the counter, stores the current count, and starts counting the next event. The counts that are not physically counted in an interval are then estimated as discussed in Section 5.1. Pentium-III has

3 no previous work that improves the estimation of counts in the multiplexed mode. 4 Experimental Setup Figure 1. Multiplexed event count estimation two PMCs and can monitor two events simultaneously [8]. In MPX, one of the two counters is always set to read the total number of cycles (cyc) executed by the trumented code, whereas the other counter is used to monitor an event of interest. Table 3 shows the number of multiplexed intervals that occur in the full execution of certain benchmarks. It also shows the error distribution of the estimated multiplexed counts that is calculated by MPX (further discussed in Section 5.1). Although 58.3% of all the intervals are estimated with an error of less than 10%, 30% of the intervals are estimated with error in the range of 10 to 50%. Moreover, 10.6% of the intervals are estimated with error greater than 50% which clearly motivates the need for new estimation techniques to obtain more accurate data. Number of Error distribution of intervals (in %) Workload Intervals CV <10% 10-25% 25-50% >50% crafty mcf parser twolf vortex vpr ammp art equake Total Table 3. Number of multiplexed intervals; Coefficient of Variation(CV) for three execution runs; estimation error distribution per benchmark. Floating point 3 Related Work The accuracy of performance counters is studied by Korn et al. in [10] which identifies the granularity of the measured code to be a major factor contributing to the inaccuracy of performance counter values. The study of techniques for performance monitoring by Ojha [14] suggests that the on-chip counters are the least intrusive of the common performance analysis techniques. However, we found This section describes our experimental setup. The software that is used to interface to and monitor the performance counters, the events that we monitor with this software, and the benchmarks we use in performance analysis are discussed below. 4.1 PAPI In our work, we use PAPI, version 2.3.2, to interface to the performance counters on a Pentium-III, 1 GHz, dual processor machine running Red-Hat Linux 7.3 operating system, kernel version The Linux kernel is patched with perfctr (version 2.4.1) which is a Linux/x86 Performance Monitoring Counters Driver [15]. PAPI uses this package to gain access to the counters on Linux/x86 platforms. The normal operation of PAPI generates cumulative counts of the events that are selected to be monitored by the PMCs. To obtain the data at the end of every time-slice, the PAPI code is compiled in the DEBUG mode which outputs the counter values and the event ID in addition to other information. A timer and a signal handler (similar to the one discussed in Section 2.1) is incorporated for reading the counters in the non-multiplexed mode at regular time slices. The non-multiplexed mode of counting involves monitoring one fixed event A during every time slice as shown in Figure Benchmarks and Event Sets We chose a subset of benchmarks from the SPEC CPU2000 benchmark suite [16] to use in the performance analysis of the new estimation techniques; we use the reference input size in all experiments. We use a subset of events to study the accuracy of estimation techniques used in multiplexing. The Pentium-III (P6 architecture) has two performance counters that can be configured to count more than 80 events [8]. The events that we monitor and use in this work are randomly chosen and are listed in Table 4. All of the benchmark source codes are handtrumented by PAPI calls. Pseudo code for collecting the event counts in multiplexed or non-multiplexed mode is shown below:

4 main() { /* Benchmark variables defined */ /* Define PAPI variables */ /* Set the timers for sampling in multiplexed / non-multiplexed mode */ /* Enable multiplex feature if counters are to be run in multiplexed mode */ /* Create the eventset */ /* Start the counters */ --- Benchmark code executes --- /* Stop the counters */ return(0); } After declaring the variables of the benchmark, the PAPI library is initialized which is followed by setting the timers for the required mode. The default counting of events is non-multiplexed mode; multiplexing is enabled if desired. The event set is created and the counters are started after which the benchmark code executes in its normal sequence. The counters are read at regular time slices of 10ms duration, which is the smallest time slice as defined by the Unix interval timer ITIMER PROF, and are stopped just before the completion of the benchmark. Event (Acronym) PAPI Preset Value L1 Data Cache Misses () PAPI L1 DCM 0x L1 Data Cache Access () PAPI L1 DCA 0x Instructions Committed () PAPI TOT INS 0x Load Misses () PAPI L1 LDM 0x Store Misses () PAPI L1 STM 0x Branches taken () PAPI BR TKN 0x C Total Cycles (cyc) PAPI TOT CYC 0x C Table 4. Multiplexed events used. Acronym refers to each particular event; PAPI preset used in trumentation code; Value is mapped event number for Pentium III 5 Methodology and Algorithms The benchmarks are trumented with the multiplexed and non-multiplexed code (as discussed in Section 4.2) for monitoring the different events listed in Table 2. We execute an individual benchmark in each mode three times to reduce the error due to variability of data collection in different executions of the same benchmark. Our goal is to obtain minimum absolute error between the multiplexed and the non-multiplexed counts in every interval. The nonmultiplexed counts reflect the actual or accurate count of an event since they are monitored continuously throughout the benchmark execution which is not the case for multiplexed counts (Section 2.1). The steps to calculate the statistics for analyzing the estimation techniques are as follows: Figure 2. Collection of non-multiplexed and multiplexed counters at regular time slices 1. Benchmarks are executed three times each. 2. A non-multiplexed data vector consists of six 1 counts of a specific event for each equivalent multiplexed interval. The sum of these six counts is the nonmultiplexed event count count nm i. 3. Using an algorithm described in the following sections, the multiplexed event count count m i is estimated for the intervals during which it is not physically counted. 4. The estimation error, count m i count nm i, for every interval is calculated. Figure 2 illustrates the manner in which data is obtained from the counters in non-multiplexed and multiplexed modes at regular time slices. cyc i(k) is the total number of cycles elapsed since the beginning of code trumentation; i is the interval in which all the multiplexed events are being physically monitored once by the PMC; k is the time slice for which a counter is accumulating event occurrences (after resetting the counter at the end of time slice k-1). Therefore, if n events are multiplexed, then k can take values from 0 to n. For simplicity, we assume cyc i(n) to be equivalent to cyc i+1(0). Figure 3 shows the rate plot of a set of events measured in the multiplexed mode. Some important variables (at the i th interval, k th time-slice) that we use are listed below, rate nm i(k) = rate m i = count m i(n) cyc i(n) cyc i(n 1) (1) count nm i(k) cyc i(k) cyc i(k 1) ; 0 k n (2) 1 This reflects the number of events in the defined event set. The number rema the same for a particular experiment, but changes if more events are multiplexed in an experiment

5 Figure 4. Base algorithm event-count calculation Figure 3. Conversion of event counts to rates where count nm i = n count nm i(k) (3) k=1 slope i = rate m i rate m (i 1) cyc i(n) cyc i 1(n) (4) n = number of events being multiplexed k = the time slice during which event-count is sampled rate m i and rate nm i(k) = rate of occurrence of an event in multiplexed and non-multiplexed mode, respectively count m i(n) = number of times an event has occurred in i th interval and in time slice between cyc i(n 1) and cyc i(n) count nm i(k) = number of times an event has occurred in k th time slice of i th interval (that is the period between cyc i(k 1) and cyc i(k) ) slope i = slope of rate between i 1 th and i th intervals We discuss the estimation algorithms in the following sections. 5.1 Base Algorithm The estimation algorithm (henceforth called Base Algorithm) used in PAPI is developed and implemented in [12]. It is used to estimate the counts of the multiplexed event in each interval. Consider the case as shown in Figures 2 and 3. We discuss the base algorithm that is used to estimate the count of event A in the i th interval. Event A is monitored in the time slice k=4 (between cycles cyc i(3) and cyc i(4) ). If count m i(4) is the number of occurrences of event A in this time slice, then the rate of event A can be calculated using Eqn. 1 as: rate m i = count m i(4) (5) cyc i(4) cyc i(3) Figure 4 shows the plot of Rate vs. Total Cycles for event A alone. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, can be calculated using Eqn. 5. In the base algorithm, rate m i is assumed to be constant for the entire i th interval and the count of event A is estimated by the following equation: count m i rate m i (cyc i(n) cyc i 1(n) ) (6) Recall that rate m i is calculated using the data corresponding to the period between i(n-1) and i(n) (that is one time slice), whereas the count between i-1(n) and i(n) (that is one interval) is being estimated. 5.2 Trapezoid-area Method (TAM) Figure 5 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i- 1 respectively, is calculated using Eqn. 5. In the trapezoidarea method, the rate of occurrence of event A is assumed to be linearly changing within an interval. Thus, the estimated count of the multiplexed event A in the i th interval is given by the area under the trapezoid PQRS (Figure 5). Mathematically, count m i 0.5 (rate m i +rate m i 1) (cyc i(n) cyc i 1(n) ) (7)

6 cycles cyc i(k 1) and cyc i(k). Thus, the area under rectangle PQRS in Figure 6 is given by the formula: count m i(k) rate m i(k) (cyc i(k) cyc i(k 1) ) (9) where value of rate m i(k) is obtained from Eqn Repeat steps 1 and 2 for 1 k j. 4. The estimated count of the multiplexed event A in the i th interval is given by: count m i = j count m i(k) (10) k=1 Figure 5. TAM event-count calculation 5.3 Divided-interval Rectangular Area (DIRA) In our case, j=n. 5.4 Positional Mean Error (PME) Figure 6 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, is calculated using Eqn. 5. The algorithm is explained in the following steps: Figure 7. PME event-count calculation Figure 6. DIRA event-count calculation 1. The i th interval (where the count is being estimated) is divided or split into j equal parts (Figure 6). The rate at the k th division is calculated by using linear interpolation as follows: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (8) where slope i is given by Eqn The area corresponding to the k th division is calculated by assuming the rate to remain constant between The Positional Mean Error (PME) algorithm is a two phase algorithm. Phase-1 involves calculating the rate corrections or the positional mean errors and Phase-2 consists of using the PMEs to correct the multiplexed rates and estimate the event count. Note that this algorithm is most useful in studies where a program must be executed multiple times, such as program performance and code optimization. The following steps comprise Phase-1: 1. The rate of event A in multiplexed mode at the k th position, rate m i(k), is calculated by using linear interpolation: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (11) where slope i is given by Eqn. 4.

7 2. The difference between the rate of event A in nonmultiplexed mode rate nm i(k) and the rate that is calculated in Step 1 above is given by: e k = rate nm i(k) rate m i(k) (12) where rate nm i (k) is given by Eqn. 2. This difference is the positional error for position k in the i th interval and is calculated for 1 k n and i. 3. The PME is then given by: pme k = 1 e k (13) i total where pme k is the Positional Mean Error for the k th position and i total is the total number of intervals. i Phase-1 produces n PMEs that are used in Phase 2 for estimating the event counts. Phase 2 includes the following steps: 1. Same as Step 1 of Phase Calculate the corrected rate at k th position k: c rate m i(k) = rate m i (k) + pme k (14) 3. Assuming linear rate between corrected positional rates, the count in the k th slice-division is now estimated using the trapezoid area method discussed in Section 5.2: count m i(k) 0.5 (c rate m i(k) +c rate m i(k 1) ) (cyc i(k) cyc i(k 1) ) (15) 4. Estimated count in the i th interval is then given by: count m i = n count m i(k) (16) k=1 5.5 Multiple Linear Regression (MLR) Model The Multiple Linear Regression (MLR) model allows prediction of a response variable as a function of predictor variables using a linear model. In vector notation, it is given by: y = Xb (17) where y = column vector of non-multiplexed counts when aggregated in respective multiplexed intervals. X = matrix where each column element is multiplexed subinterval area as shown in Figure 8 and data in a specific row corresponds to a particular interval. b = Predictor parameters. Figure 8. MLR event-count calculation Hence, the multiplexed sub-interval areas are represented as a linear model of the actual count (the non-multiplexed count). The predictor parameter estimation is given by: b = (X T X) 1 (X T y) (18) The estimated parameter is then used to scale the trapezoid area in an interval and the sum of the scaled area is estimated as the multiplexed count of that interval. Mathematically, scaled x k = b[k] x k (19) count m i = j scaled x k (20) k=1 For our study, the sample size = 0.5 of the population size. This algorithm, like PME, is also most useful in studies like program performance and code optimization. 6 Results We show the results of applying the algorithms described in Sections 5.1 through 5.5 in Figure 9. These figures describe the accuracy of each algorithm for each event in terms of percentage improvement. The improvement in the algorithm is computed by comparing the estimated multiplexed counts to the non-multiplexed counts of the same event. The total absolute error ( k count m k count nm k ) is computed for all the algorithms and compared with the base. For all of the benchmarks, the new estimation algorithms result in decreased error compared to the base algorithm for each event. For the benchmark crafty, the improvement varies between 5-40% over the set of events as shown in Figure 9. For data cache misses () and

8 % Improvement (compared to Base) % Improvement (compared to Base) % Improvement (compared to Base) crafty mcf parser Workloads (Integer) and Events twolf vortex vpr Workloads (Integer) and Events ammp art equake Workloads (Floating point) and Events Figure 9. Improvement is calculated by comparing total absolute error of estimation by each algorithm to that of base algorithm load misses (), the improvement realized by PME and MLR is approximately 40%. TAM and DIRA estimated the store misses () and branches taken () with approximately 15% greater accuracy. Similar improvements are observed for the floating point benchmarks. The error reduction varies between 7-40% for all of the floating point benchmarks across the six events. All the algorithms proved to be the best for art, with improvement of more than 30% for five out of six events. Table 5 summarizes the characteristics of the algorithms;table 6 summarizes the average estimation improvements realized by each of events for all algorithms and benchmarks. The average computed is the geometric mean and all averages exclude the negative-valued outliers. The maximum estimation improvement for all of the algorithms is at least 32%; the average estimation improvement for TAM is 11.6% with that for DIRA being 12.0%. PME and MLR provide estimation improvements between 2 and 40% (with a single outlier of 35% for store misses () in ammp for MLR), with average improvements of 13.5% and 14.1%, respectively. However, PME and MLR require correction parameter and predictor variable libraries, respectively, that are in turn used for estimating the multiplexed counts. These libraries are benchmark and event specific and require non-multiplexed counts for their computation. This also implies that PME and MLR can be implemented in real time, along with the program execution, only if a pre-calculated library is available. On the other hand, TAM and DIRA are more generic in nature. They can be implemented on any benchmark for any desired event without the

9 need for a library. PME and MLR perform better for crafty, Factor /Algorithm Base TAM DIRA PME MLR Requires a pre-calculated library? No No No Yes Yes Needs non-multiplexed counts for its No No No Yes Yes implementation? Real time No/ No/ implementable? Yes Yes Yes Yes Yes Computation Very Very Intensity low low Low High High Table 5. Algorithm characteristics twolf, and equake whereas TAM and DIRA perform better for parser and mcf. In the case of art, all of the algorithms perform equally well. Event /Algorithm TAM DIRA PME MLR Range of improvement 2.6/ 3.1/ 2.1/ 2.6/ (% min/max) Avg % Improvement Overall Table 6. Average % improvement of each event for all algorithms and benchmarks 6.1 Sensitivity The sensitivity of the algorithms to the number of multiplexed events is studied here. We multiplex six, ten, and fourteen events. When monitoring a higher number of events, the event set is chosen as a super set of the event set when monitoring a smaller number of events. Thus, the six events listed in Table 2 are monitored in all of the experiments. We expect to see an increase in the estimation errors as the number of events is increased. This is because when a larger number of events is monitored, the frequency of the physical monitoring of an event is smaller and therefore, the estimation is carried out over a larger duration of time. The sensitivity study is performed using the mcf (integer) and art (floating point) benchmarks. These applications are chosen since all of the algorithms performed consistently on them (Figure 9). Three experiments, namely, mult6, mult10, and mult14 are performed in which six, ten and fourteen events are multiplexed, respectively. The sensitivity of the algorithms on the six events that are monitored in the three experiments for the mcf and art benchmarks are shown in Figure 10. The performance of the algorithms is evaluated as discussed in Section 6. Even with the increase in the number of multiplexed events, all of our algorithms estimate the multiplexed counts with more accuracy when compared to the base algorithm. This can be seen in Figure 10 where the dotted lines indicating the implemented algorithms are always below the solid line indicating the base algorithm. The gap between the base and any algorithm shows the extent of accuracy improvement; the larger the gap, the greater the improvement. The hypothesis that estimation errors increase with an increase in the number of multiplexed events seems to be incorrect. For tance, the plot for mcf and art shows that the estimation error decreases with increase in multiplexed events. The and plots show that the error increases from mult6 to mult10, but it decreases from mult10 to mult14. The reduced error may be attributed to the averaging problem, wherein the positive and negative counts, around a mean value, cancel each other. If the interval is large, the chances of cancellation increase and the estimation error reduces. Nevertheless, there exists a tradeoff between the number of multiplexed events, the accuracy of the estimated multiplexed counts, and the number of intervals for which the data is collected. The accuracy of the multiplexed counts depends on the behavior of the monitored event in the window where it is estimated. 7 Conclusions and Future Work The algorithms we introduce reduce the estimation error for all of the multiplexed events for each of the benchmarks. Improvement of up to 40% is achieved by the PME and MLR algorithms and up to 32% for TAM and DIRA. Utilizing any of these techniques will greatly reduce the estimation errors of the multiplexed counts. PME and MLR require a pre-calculated library of correction parameters (corresponding to the event-benchmark pair) for their implementation whereas TAM and DIRA are more generic and independent of any event or benchmark. Since the interval size is defined by time (10 ms in our case), the event counts cannot be collected at a specified cycle. Therefore, it is difficult to collect cycle-synchronized performance metrics that can provide a complete snapshot of cycle-by-cycle program behavior. We plan to address this in the future by implementing an algorithm that we have developed for cycle-synchronization into the techniques discussed in this paper. Because PME and MLR require pre-calculated, benchmark-specific libraries, their usefulness is limited to specific types of studies. Therefore, we are investigating techniques to calculate these libraries that are independent of the benchmark. We are also examining the possibiltiy of collecting benchmark event data using the test input and applying it to reference data executions for each algorithm. Finally, we are active applying our estimation techniques

10 Total absolute error 1.6 x x x (a) (b) (c) 1.6 x x x (d) 1.5 (e) Number of multiplexed events Total absolute error (f) 1.5 x x (a) (b) 1.6 x x (d) 1.8 (e) Number of multiplexed events Figure 10. Sensitivity plots for mcf (left, int);art (right, fp) 2.6 x (c) 9 x (f) to PMCs on contemporary processors such as the Intel Pentium IV and AMD Opteron. 8 Acknowledgments This work was supported by the National Science Foundation ADVANCE Institutional Transformation Program at NMSU, fund #NSF References [1] Performance pector ibm.com/developerworks/oss/pi/. [2] Vtune profiling software, [3] AMD. BIOS and Kernel Developer s Guide for AMD Athlon 64 and AMD Opteron Processors. AMD, September [4] R. Berrendorf, H. Ziegler, and B. Molar. PCL - the performance counter library: A common interface to access hardware performance counters on microprocessors. Research Centre Juelich GmbH, 2.1, February [5] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 14(3): , Fall [6] J. Gibson, R. Kunz, D. Ofelt, and M. Heinrich. FLASH vs. (simulated) FLASH: Closing the simulation loop. In Architectural Support for Programming Languages and Operating Systems, pages 49 58, [7] D. Heller. Rabbit: A performance counters library for Intel/AMD processors and linux. [8] Intel. IA-32 Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide. [9] Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. [10] W. Korn, P. Teller, and G. Castillo. Just how accurate are performance counters? In 20th IEEE International Performance, Computing, and Communications Conference, Phoenix, Arizona., April [11] K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, and T. Spencer. End-user tools for application performance analysis, using hardware counters. In International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August [12] J. M. May. MPX: Software for multiplexing hardware performance counters in multithreaded programs. In Parallel and Distributed Processing Symposium., Proceedings 15th International, pages 8, April [13] MIPS Technologies. R10000 Microprocessor User s Manual - Version 1.0. MIPS Technologies Inc., Mountain View, CA, June [14] A. K. Ojha. Techniques in least-intrusive computer system performance monitoring. In SoutheastCon Proceedings. IEEE, pages , [15] M. Pettersson. Linux x86 performance-monitoring counters driver. mikpe/linux/perfctr/. [16] S. P. E. C. (SPEC). [17] B. Sprunt. Pentium 4 performance-monitoring features. MI- CRO, 22(4):72 82, [18] S. Vetter. The POWER4 Processor Introduction and Tuning Guide. IBM Corporation, International Technical Support Organization, first edition, November 2001.

The PAPI Cross-Platform Interface to Hardware Performance Counters

The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu