Improved Estimation for Software Multiplexing of Performance Counters

Size: px
Start display at page:

Download "Improved Estimation for Software Multiplexing of Performance Counters"

Transcription

1 Improved Estimation for Software Multiplexing of Performance Counters Wiplove Mathur Texas Instruments, Inc. San Diego, CA Jeanine Cook New Mexico State University Las Cruces, NM Abstract On-chip performance counters are gaining popularity as an analysis and validation tool. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously at fixed sampling periods. Through multiplexing and estimation, an even greater number of unique events can be monitored in a single program execution. When a program is sampled in multiplexed mode using round-robin scheduling of a specified event set, the number of events that are physically counted during each sampling period is limited by the number of counters that can be simultaneously accessed. During this period, the remaining events of the multiplexed event-set are not monitored, but their counts are estimated. Our work quantifies the estimation error of the event-counts in the multiplexed mode, which indicates that as many as 42% of sampled intervals are estimated with error greater than 10%. We propose new estimation algorithms that result in an accuracy improvement of up to 40%. 1 Introduction Performance counters or Performance Monitoring Counters (PMCs) are built-in hardware counters that are fabricated in the CPU chip. They can be programmed using event-select registers to count a specified event from a set of events such as L1-data cache accesses, load misses, and branches taken. These performance counters enable accurate, minimally intrusive monitoring of application performance [14]. Moreover, the statistics are collected in realtime and on the hardware platform that is under test, therefore providing a high degree of confidence in the results. Simulators are widely used to gather cycle-by-cycle performance data of a large set of metrics in a single execution of a program on a simulated micro-architecture. With the data of various events available at the same cycle, an accurate view of the cycle-by-cycle processor state can be studied. However, when performing user-level microarchitecture simulation, the system level details such as interfaces to buses, interrupt controllers, disks, and video memory are not taken into consideration. Additionally, program behavior is affected by external factors such as the operating system and TLB effects [6], which suggest that the performance data generated through user-level simulation may not be completely accurate unless the simulator does full-system simulation. However, full-system simulation is extremely slow. These simulation short-comings are even more prevalent in the context of multiprocessors, where the availability of accurate simulators is limited and the speed perturbation is much larger. In contrast to simulators, monitoring a workload execution on native hardware using PMCs provides a much faster (i.e. real-time) platform for evaluation. PMCs are used for system tuning, workload characterization, profiling, code optimization, architecture validation, and performance evaluation. The PMCs in modern processors can monitor a very large set of events that is comparable to the number of events monitored by simulators. However, as listed in Table 1, most contemporary processors have only between two and six physical counters that can monitor an equal number of unique events simultaneously. For tance, an Intel Pentium-III processor, which has two PMCs, can monitor two events at a given tant. Moreover, one of the events counted forms the independent variable for analysis purposes (e.g. number of cycles or number of tructions). This implies that only one event is monitored in one run of a benchmark. Hence, in order to gather data of n events, the benchmark has to be executed n times. Furthermore, every architecture has certain events that are defined as conflicting which the PMCs cannot count concurrently. The Pentium IV Xeon is an exception with respect to the number of physical counters implemented on-chip. It has eighteen physical counters, which is more than any other contemporary commodity processor that we investigated. This processor is the exception, not the rule. Most contem-

2 porary processors, including other Intel architectures, are limited to between two and eight counters. Therefore, our work is still applicable to the majority of processors in use today. To overcome the limitation of only a few counters being available on a CPU and to enable simultaneous monitoring of a larger number of events, the technique of multiplexing is used [12]. The desired events are scheduled to be monitored by the PMCs in a round-robin fashion. During the period a particular event is monitored, the remaining events are not counted by the counters, but their counts are estimated. We quantified the estimation error of event counts in the multiplexed mode and found that 42% of the intervals are estimated with error greater than 10%. This inaccuracy can lead to false conclusions in the analysis and validation of the processor architecture or the program behavior. In order to improve the estimation of event counts, we implement four new estimation algorithms that result in an accuracy improvement of up to 40% and that can can be easily incorporated into any PMC interface that supports software multiplexing such as PAPI [5]. Processor #PMCs #Events 1 width Pentium-III [8] bit R10000 [13] bit Itanium 2 [9] bit AMD Athlon64 [3] bit POWER4 [18] n/a Pentium-IV Xeon [17] bit 2 PMC Interface Table 1. Processor PMCs Several interfaces are available to access the PMCs on different microprocessor families. Some processor specific interfaces include IBM s Performance Inspector [1] for PowerPC processors, Intel s VTune for Intel processors [2], and Rabbit for Intel and AMD processors [7]. Additionally, interfaces that provide portable trumentation on multiple platforms include Performance Counter Library (PCL) [4] and Performance Application Programming Interface (PAPI) [5]. PCL and PAPI support performance counters on PowerPC, Alpha, MIPS, Pentium, AMD and UltraSPARC processors. Both PCL and PAPI support multithreading, and PAPI explicitly supports multiplexing. For this work, we use PAPI as our interface to the PMCs on a Pentium-III processor. We chose the Pentium-III primarily due to its availability and low impact because it is a stand-alone system. Although the Pentium III has the fewest counters as listed in Table 1, our estimation techniques are applicable in all contexts where the number of 1 The number of events supported by the processor is an approximate as interpreted by the authors of this paper. countable events is much greater than the number of physical counters. We chose PAPI for its multiplexing support, its widespread use in performance analysis, and the large number of architectures it supports. A number of end-user tools use the PAPI library as their interface to the performance counters [11]. PAPI can monitor any event that is supported by the processor it is running on; a subset of these events is listed in Table 2. Category Events L1, L2, Instruction Hits, Misses, Accesses, Reads, Writes, and Data Caches Load/Store misses, TLB misses Total tructions executed, Total tructions issued, Instruction Mix FP tructions executed, Total branch tructions executed, FP mult & div tructions (Conditional) Branch Branch tructions taken/not taken, Prediction Branches mispredicted/predicted correctly Table 2. Subset of common events 2.1 Multiplexed Mode The multiplexed mode of counting is used to simultaneously monitor a larger number of events than the number of PMCs available on the processor. The events of interest are specified in an event-list and are monitored in a round-robin fashion. For example, consider events A, B, C, and D defined as an event set to be counted by the PMCs in the multiplexed mode. Figure 1 shows a possible sequence in which the events may be monitored. All of the events are monitored in an interval, with every event being counted for a fixed time slice. At the end of each time-slice, the current event-count is read and stored in a file which is followed by the monitoring of the next event in the event-list (after resetting the counter). This sequence continues throughout event monitoring. Thus, an event A is physically counted only once in the entire interval. The counts corresponding to A are not known when other events (B, C, and D) are monitored. Since the count of an event corresponding to the complete interval is desired, including the time slices when other events are monitored, a value is estimated by the multiplexing software. PAPI implements multiplexing of counters in software since no contemporary processors (that we are aware of) support hardware multiplexing. The MPX library [12] is the basis of the multiplexing technique in PAPI. The switching of events is triggered by the Unix interval timer ITIMER PROF, while the SIGPROF signal is used as a trap to the monitored process. After every fixed time duration (10 ms by default), MPX halts the counter, stores the current count, and starts counting the next event. The counts that are not physically counted in an interval are then estimated as discussed in Section 5.1. Pentium-III has

3 no previous work that improves the estimation of counts in the multiplexed mode. 4 Experimental Setup Figure 1. Multiplexed event count estimation two PMCs and can monitor two events simultaneously [8]. In MPX, one of the two counters is always set to read the total number of cycles (cyc) executed by the trumented code, whereas the other counter is used to monitor an event of interest. Table 3 shows the number of multiplexed intervals that occur in the full execution of certain benchmarks. It also shows the error distribution of the estimated multiplexed counts that is calculated by MPX (further discussed in Section 5.1). Although 58.3% of all the intervals are estimated with an error of less than 10%, 30% of the intervals are estimated with error in the range of 10 to 50%. Moreover, 10.6% of the intervals are estimated with error greater than 50% which clearly motivates the need for new estimation techniques to obtain more accurate data. Number of Error distribution of intervals (in %) Workload Intervals CV <10% 10-25% 25-50% >50% crafty mcf parser twolf vortex vpr ammp art equake Total Table 3. Number of multiplexed intervals; Coefficient of Variation(CV) for three execution runs; estimation error distribution per benchmark. Floating point 3 Related Work The accuracy of performance counters is studied by Korn et al. in [10] which identifies the granularity of the measured code to be a major factor contributing to the inaccuracy of performance counter values. The study of techniques for performance monitoring by Ojha [14] suggests that the on-chip counters are the least intrusive of the common performance analysis techniques. However, we found This section describes our experimental setup. The software that is used to interface to and monitor the performance counters, the events that we monitor with this software, and the benchmarks we use in performance analysis are discussed below. 4.1 PAPI In our work, we use PAPI, version 2.3.2, to interface to the performance counters on a Pentium-III, 1 GHz, dual processor machine running Red-Hat Linux 7.3 operating system, kernel version The Linux kernel is patched with perfctr (version 2.4.1) which is a Linux/x86 Performance Monitoring Counters Driver [15]. PAPI uses this package to gain access to the counters on Linux/x86 platforms. The normal operation of PAPI generates cumulative counts of the events that are selected to be monitored by the PMCs. To obtain the data at the end of every time-slice, the PAPI code is compiled in the DEBUG mode which outputs the counter values and the event ID in addition to other information. A timer and a signal handler (similar to the one discussed in Section 2.1) is incorporated for reading the counters in the non-multiplexed mode at regular time slices. The non-multiplexed mode of counting involves monitoring one fixed event A during every time slice as shown in Figure Benchmarks and Event Sets We chose a subset of benchmarks from the SPEC CPU2000 benchmark suite [16] to use in the performance analysis of the new estimation techniques; we use the reference input size in all experiments. We use a subset of events to study the accuracy of estimation techniques used in multiplexing. The Pentium-III (P6 architecture) has two performance counters that can be configured to count more than 80 events [8]. The events that we monitor and use in this work are randomly chosen and are listed in Table 4. All of the benchmark source codes are handtrumented by PAPI calls. Pseudo code for collecting the event counts in multiplexed or non-multiplexed mode is shown below:

4 main() { /* Benchmark variables defined */ /* Define PAPI variables */ /* Set the timers for sampling in multiplexed / non-multiplexed mode */ /* Enable multiplex feature if counters are to be run in multiplexed mode */ /* Create the eventset */ /* Start the counters */ --- Benchmark code executes --- /* Stop the counters */ return(0); } After declaring the variables of the benchmark, the PAPI library is initialized which is followed by setting the timers for the required mode. The default counting of events is non-multiplexed mode; multiplexing is enabled if desired. The event set is created and the counters are started after which the benchmark code executes in its normal sequence. The counters are read at regular time slices of 10ms duration, which is the smallest time slice as defined by the Unix interval timer ITIMER PROF, and are stopped just before the completion of the benchmark. Event (Acronym) PAPI Preset Value L1 Data Cache Misses () PAPI L1 DCM 0x L1 Data Cache Access () PAPI L1 DCA 0x Instructions Committed () PAPI TOT INS 0x Load Misses () PAPI L1 LDM 0x Store Misses () PAPI L1 STM 0x Branches taken () PAPI BR TKN 0x C Total Cycles (cyc) PAPI TOT CYC 0x C Table 4. Multiplexed events used. Acronym refers to each particular event; PAPI preset used in trumentation code; Value is mapped event number for Pentium III 5 Methodology and Algorithms The benchmarks are trumented with the multiplexed and non-multiplexed code (as discussed in Section 4.2) for monitoring the different events listed in Table 2. We execute an individual benchmark in each mode three times to reduce the error due to variability of data collection in different executions of the same benchmark. Our goal is to obtain minimum absolute error between the multiplexed and the non-multiplexed counts in every interval. The nonmultiplexed counts reflect the actual or accurate count of an event since they are monitored continuously throughout the benchmark execution which is not the case for multiplexed counts (Section 2.1). The steps to calculate the statistics for analyzing the estimation techniques are as follows: Figure 2. Collection of non-multiplexed and multiplexed counters at regular time slices 1. Benchmarks are executed three times each. 2. A non-multiplexed data vector consists of six 1 counts of a specific event for each equivalent multiplexed interval. The sum of these six counts is the nonmultiplexed event count count nm i. 3. Using an algorithm described in the following sections, the multiplexed event count count m i is estimated for the intervals during which it is not physically counted. 4. The estimation error, count m i count nm i, for every interval is calculated. Figure 2 illustrates the manner in which data is obtained from the counters in non-multiplexed and multiplexed modes at regular time slices. cyc i(k) is the total number of cycles elapsed since the beginning of code trumentation; i is the interval in which all the multiplexed events are being physically monitored once by the PMC; k is the time slice for which a counter is accumulating event occurrences (after resetting the counter at the end of time slice k-1). Therefore, if n events are multiplexed, then k can take values from 0 to n. For simplicity, we assume cyc i(n) to be equivalent to cyc i+1(0). Figure 3 shows the rate plot of a set of events measured in the multiplexed mode. Some important variables (at the i th interval, k th time-slice) that we use are listed below, rate nm i(k) = rate m i = count m i(n) cyc i(n) cyc i(n 1) (1) count nm i(k) cyc i(k) cyc i(k 1) ; 0 k n (2) 1 This reflects the number of events in the defined event set. The number rema the same for a particular experiment, but changes if more events are multiplexed in an experiment

5 Figure 4. Base algorithm event-count calculation Figure 3. Conversion of event counts to rates where count nm i = n count nm i(k) (3) k=1 slope i = rate m i rate m (i 1) cyc i(n) cyc i 1(n) (4) n = number of events being multiplexed k = the time slice during which event-count is sampled rate m i and rate nm i(k) = rate of occurrence of an event in multiplexed and non-multiplexed mode, respectively count m i(n) = number of times an event has occurred in i th interval and in time slice between cyc i(n 1) and cyc i(n) count nm i(k) = number of times an event has occurred in k th time slice of i th interval (that is the period between cyc i(k 1) and cyc i(k) ) slope i = slope of rate between i 1 th and i th intervals We discuss the estimation algorithms in the following sections. 5.1 Base Algorithm The estimation algorithm (henceforth called Base Algorithm) used in PAPI is developed and implemented in [12]. It is used to estimate the counts of the multiplexed event in each interval. Consider the case as shown in Figures 2 and 3. We discuss the base algorithm that is used to estimate the count of event A in the i th interval. Event A is monitored in the time slice k=4 (between cycles cyc i(3) and cyc i(4) ). If count m i(4) is the number of occurrences of event A in this time slice, then the rate of event A can be calculated using Eqn. 1 as: rate m i = count m i(4) (5) cyc i(4) cyc i(3) Figure 4 shows the plot of Rate vs. Total Cycles for event A alone. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, can be calculated using Eqn. 5. In the base algorithm, rate m i is assumed to be constant for the entire i th interval and the count of event A is estimated by the following equation: count m i rate m i (cyc i(n) cyc i 1(n) ) (6) Recall that rate m i is calculated using the data corresponding to the period between i(n-1) and i(n) (that is one time slice), whereas the count between i-1(n) and i(n) (that is one interval) is being estimated. 5.2 Trapezoid-area Method (TAM) Figure 5 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i- 1 respectively, is calculated using Eqn. 5. In the trapezoidarea method, the rate of occurrence of event A is assumed to be linearly changing within an interval. Thus, the estimated count of the multiplexed event A in the i th interval is given by the area under the trapezoid PQRS (Figure 5). Mathematically, count m i 0.5 (rate m i +rate m i 1) (cyc i(n) cyc i 1(n) ) (7)

6 cycles cyc i(k 1) and cyc i(k). Thus, the area under rectangle PQRS in Figure 6 is given by the formula: count m i(k) rate m i(k) (cyc i(k) cyc i(k 1) ) (9) where value of rate m i(k) is obtained from Eqn Repeat steps 1 and 2 for 1 k j. 4. The estimated count of the multiplexed event A in the i th interval is given by: count m i = j count m i(k) (10) k=1 Figure 5. TAM event-count calculation 5.3 Divided-interval Rectangular Area (DIRA) In our case, j=n. 5.4 Positional Mean Error (PME) Figure 6 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, is calculated using Eqn. 5. The algorithm is explained in the following steps: Figure 7. PME event-count calculation Figure 6. DIRA event-count calculation 1. The i th interval (where the count is being estimated) is divided or split into j equal parts (Figure 6). The rate at the k th division is calculated by using linear interpolation as follows: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (8) where slope i is given by Eqn The area corresponding to the k th division is calculated by assuming the rate to remain constant between The Positional Mean Error (PME) algorithm is a two phase algorithm. Phase-1 involves calculating the rate corrections or the positional mean errors and Phase-2 consists of using the PMEs to correct the multiplexed rates and estimate the event count. Note that this algorithm is most useful in studies where a program must be executed multiple times, such as program performance and code optimization. The following steps comprise Phase-1: 1. The rate of event A in multiplexed mode at the k th position, rate m i(k), is calculated by using linear interpolation: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (11) where slope i is given by Eqn. 4.

7 2. The difference between the rate of event A in nonmultiplexed mode rate nm i(k) and the rate that is calculated in Step 1 above is given by: e k = rate nm i(k) rate m i(k) (12) where rate nm i (k) is given by Eqn. 2. This difference is the positional error for position k in the i th interval and is calculated for 1 k n and i. 3. The PME is then given by: pme k = 1 e k (13) i total where pme k is the Positional Mean Error for the k th position and i total is the total number of intervals. i Phase-1 produces n PMEs that are used in Phase 2 for estimating the event counts. Phase 2 includes the following steps: 1. Same as Step 1 of Phase Calculate the corrected rate at k th position k: c rate m i(k) = rate m i (k) + pme k (14) 3. Assuming linear rate between corrected positional rates, the count in the k th slice-division is now estimated using the trapezoid area method discussed in Section 5.2: count m i(k) 0.5 (c rate m i(k) +c rate m i(k 1) ) (cyc i(k) cyc i(k 1) ) (15) 4. Estimated count in the i th interval is then given by: count m i = n count m i(k) (16) k=1 5.5 Multiple Linear Regression (MLR) Model The Multiple Linear Regression (MLR) model allows prediction of a response variable as a function of predictor variables using a linear model. In vector notation, it is given by: y = Xb (17) where y = column vector of non-multiplexed counts when aggregated in respective multiplexed intervals. X = matrix where each column element is multiplexed subinterval area as shown in Figure 8 and data in a specific row corresponds to a particular interval. b = Predictor parameters. Figure 8. MLR event-count calculation Hence, the multiplexed sub-interval areas are represented as a linear model of the actual count (the non-multiplexed count). The predictor parameter estimation is given by: b = (X T X) 1 (X T y) (18) The estimated parameter is then used to scale the trapezoid area in an interval and the sum of the scaled area is estimated as the multiplexed count of that interval. Mathematically, scaled x k = b[k] x k (19) count m i = j scaled x k (20) k=1 For our study, the sample size = 0.5 of the population size. This algorithm, like PME, is also most useful in studies like program performance and code optimization. 6 Results We show the results of applying the algorithms described in Sections 5.1 through 5.5 in Figure 9. These figures describe the accuracy of each algorithm for each event in terms of percentage improvement. The improvement in the algorithm is computed by comparing the estimated multiplexed counts to the non-multiplexed counts of the same event. The total absolute error ( k count m k count nm k ) is computed for all the algorithms and compared with the base. For all of the benchmarks, the new estimation algorithms result in decreased error compared to the base algorithm for each event. For the benchmark crafty, the improvement varies between 5-40% over the set of events as shown in Figure 9. For data cache misses () and

8 % Improvement (compared to Base) % Improvement (compared to Base) % Improvement (compared to Base) crafty mcf parser Workloads (Integer) and Events twolf vortex vpr Workloads (Integer) and Events ammp art equake Workloads (Floating point) and Events Figure 9. Improvement is calculated by comparing total absolute error of estimation by each algorithm to that of base algorithm load misses (), the improvement realized by PME and MLR is approximately 40%. TAM and DIRA estimated the store misses () and branches taken () with approximately 15% greater accuracy. Similar improvements are observed for the floating point benchmarks. The error reduction varies between 7-40% for all of the floating point benchmarks across the six events. All the algorithms proved to be the best for art, with improvement of more than 30% for five out of six events. Table 5 summarizes the characteristics of the algorithms;table 6 summarizes the average estimation improvements realized by each of events for all algorithms and benchmarks. The average computed is the geometric mean and all averages exclude the negative-valued outliers. The maximum estimation improvement for all of the algorithms is at least 32%; the average estimation improvement for TAM is 11.6% with that for DIRA being 12.0%. PME and MLR provide estimation improvements between 2 and 40% (with a single outlier of 35% for store misses () in ammp for MLR), with average improvements of 13.5% and 14.1%, respectively. However, PME and MLR require correction parameter and predictor variable libraries, respectively, that are in turn used for estimating the multiplexed counts. These libraries are benchmark and event specific and require non-multiplexed counts for their computation. This also implies that PME and MLR can be implemented in real time, along with the program execution, only if a pre-calculated library is available. On the other hand, TAM and DIRA are more generic in nature. They can be implemented on any benchmark for any desired event without the

9 need for a library. PME and MLR perform better for crafty, Factor /Algorithm Base TAM DIRA PME MLR Requires a pre-calculated library? No No No Yes Yes Needs non-multiplexed counts for its No No No Yes Yes implementation? Real time No/ No/ implementable? Yes Yes Yes Yes Yes Computation Very Very Intensity low low Low High High Table 5. Algorithm characteristics twolf, and equake whereas TAM and DIRA perform better for parser and mcf. In the case of art, all of the algorithms perform equally well. Event /Algorithm TAM DIRA PME MLR Range of improvement 2.6/ 3.1/ 2.1/ 2.6/ (% min/max) Avg % Improvement Overall Table 6. Average % improvement of each event for all algorithms and benchmarks 6.1 Sensitivity The sensitivity of the algorithms to the number of multiplexed events is studied here. We multiplex six, ten, and fourteen events. When monitoring a higher number of events, the event set is chosen as a super set of the event set when monitoring a smaller number of events. Thus, the six events listed in Table 2 are monitored in all of the experiments. We expect to see an increase in the estimation errors as the number of events is increased. This is because when a larger number of events is monitored, the frequency of the physical monitoring of an event is smaller and therefore, the estimation is carried out over a larger duration of time. The sensitivity study is performed using the mcf (integer) and art (floating point) benchmarks. These applications are chosen since all of the algorithms performed consistently on them (Figure 9). Three experiments, namely, mult6, mult10, and mult14 are performed in which six, ten and fourteen events are multiplexed, respectively. The sensitivity of the algorithms on the six events that are monitored in the three experiments for the mcf and art benchmarks are shown in Figure 10. The performance of the algorithms is evaluated as discussed in Section 6. Even with the increase in the number of multiplexed events, all of our algorithms estimate the multiplexed counts with more accuracy when compared to the base algorithm. This can be seen in Figure 10 where the dotted lines indicating the implemented algorithms are always below the solid line indicating the base algorithm. The gap between the base and any algorithm shows the extent of accuracy improvement; the larger the gap, the greater the improvement. The hypothesis that estimation errors increase with an increase in the number of multiplexed events seems to be incorrect. For tance, the plot for mcf and art shows that the estimation error decreases with increase in multiplexed events. The and plots show that the error increases from mult6 to mult10, but it decreases from mult10 to mult14. The reduced error may be attributed to the averaging problem, wherein the positive and negative counts, around a mean value, cancel each other. If the interval is large, the chances of cancellation increase and the estimation error reduces. Nevertheless, there exists a tradeoff between the number of multiplexed events, the accuracy of the estimated multiplexed counts, and the number of intervals for which the data is collected. The accuracy of the multiplexed counts depends on the behavior of the monitored event in the window where it is estimated. 7 Conclusions and Future Work The algorithms we introduce reduce the estimation error for all of the multiplexed events for each of the benchmarks. Improvement of up to 40% is achieved by the PME and MLR algorithms and up to 32% for TAM and DIRA. Utilizing any of these techniques will greatly reduce the estimation errors of the multiplexed counts. PME and MLR require a pre-calculated library of correction parameters (corresponding to the event-benchmark pair) for their implementation whereas TAM and DIRA are more generic and independent of any event or benchmark. Since the interval size is defined by time (10 ms in our case), the event counts cannot be collected at a specified cycle. Therefore, it is difficult to collect cycle-synchronized performance metrics that can provide a complete snapshot of cycle-by-cycle program behavior. We plan to address this in the future by implementing an algorithm that we have developed for cycle-synchronization into the techniques discussed in this paper. Because PME and MLR require pre-calculated, benchmark-specific libraries, their usefulness is limited to specific types of studies. Therefore, we are investigating techniques to calculate these libraries that are independent of the benchmark. We are also examining the possibiltiy of collecting benchmark event data using the test input and applying it to reference data executions for each algorithm. Finally, we are active applying our estimation techniques

10 Total absolute error 1.6 x x x (a) (b) (c) 1.6 x x x (d) 1.5 (e) Number of multiplexed events Total absolute error (f) 1.5 x x (a) (b) 1.6 x x (d) 1.8 (e) Number of multiplexed events Figure 10. Sensitivity plots for mcf (left, int);art (right, fp) 2.6 x (c) 9 x (f) to PMCs on contemporary processors such as the Intel Pentium IV and AMD Opteron. 8 Acknowledgments This work was supported by the National Science Foundation ADVANCE Institutional Transformation Program at NMSU, fund #NSF References [1] Performance pector ibm.com/developerworks/oss/pi/. [2] Vtune profiling software, [3] AMD. BIOS and Kernel Developer s Guide for AMD Athlon 64 and AMD Opteron Processors. AMD, September [4] R. Berrendorf, H. Ziegler, and B. Molar. PCL - the performance counter library: A common interface to access hardware performance counters on microprocessors. Research Centre Juelich GmbH, 2.1, February [5] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 14(3): , Fall [6] J. Gibson, R. Kunz, D. Ofelt, and M. Heinrich. FLASH vs. (simulated) FLASH: Closing the simulation loop. In Architectural Support for Programming Languages and Operating Systems, pages 49 58, [7] D. Heller. Rabbit: A performance counters library for Intel/AMD processors and linux. [8] Intel. IA-32 Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide. [9] Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. [10] W. Korn, P. Teller, and G. Castillo. Just how accurate are performance counters? In 20th IEEE International Performance, Computing, and Communications Conference, Phoenix, Arizona., April [11] K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, and T. Spencer. End-user tools for application performance analysis, using hardware counters. In International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August [12] J. M. May. MPX: Software for multiplexing hardware performance counters in multithreaded programs. In Parallel and Distributed Processing Symposium., Proceedings 15th International, pages 8, April [13] MIPS Technologies. R10000 Microprocessor User s Manual - Version 1.0. MIPS Technologies Inc., Mountain View, CA, June [14] A. K. Ojha. Techniques in least-intrusive computer system performance monitoring. In SoutheastCon Proceedings. IEEE, pages , [15] M. Pettersson. Linux x86 performance-monitoring counters driver. mikpe/linux/perfctr/. [16] S. P. E. C. (SPEC). [17] B. Sprunt. Pentium 4 performance-monitoring features. MI- CRO, 22(4):72 82, [18] S. Vetter. The POWER4 Processor Introduction and Tuning Guide. IBM Corporation, International Technical Support Organization, first edition, November 2001.

The PAPI Cross-Platform Interface to Hardware Performance Counters

The PAPI Cross-Platform Interface to Hardware Performance Counters The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Microprocessors. Microprocessors and rpeanut. Memory. Eric McCreath

Microprocessors. Microprocessors and rpeanut. Memory. Eric McCreath Microprocessors Microprocessors and rpeanut Eric McCreath There are many well known microprocessors: Intel x86 series, Pentium, Celeron, Xeon, etc. AMD Opteron, Intel Itanium, Motorola 680xx series, PowerPC,

More information

Microprocessors and rpeanut. Eric McCreath

Microprocessors and rpeanut. Eric McCreath Microprocessors and rpeanut Eric McCreath Microprocessors There are many well known microprocessors: Intel x86 series, Pentium, Celeron, Xeon, etc. AMD Opteron, Intel Itanium, Motorola 680xx series, PowerPC,

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune

Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune W. E. Cohen, R. K. Gaede, and J. B. Rodgers {cohen,gaede}@ece.uah.edu jrodgers@hiwaay.net

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E.

MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E. MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR BY AMOL SHAMKANT PANDIT, B.E. A thesis submitted to the Graduate School in partial fulfillment of the requirements

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

APPENDIX Summary of Benchmarks

APPENDIX Summary of Benchmarks 158 APPENDIX Summary of Benchmarks The experimental results presented throughout this thesis use programs from four benchmark suites: Cyclone benchmarks (available from [Cyc]): programs used to evaluate

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,

More information

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

URL:   Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture 01 1 EE 4720 Computer Architecture 01 1 URL: https://www.ece.lsu.edu/ee4720/ RSS: https://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 3316R P. F. Taylor Hall, 578-5482, koppel@ece.lsu.edu,

More information

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

URL:   Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture 01 1 EE 4720 Computer Architecture 01 1 URL: http://www.ece.lsu.edu/ee4720/ RSS: http://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Chapter 2. OS Overview

Chapter 2. OS Overview Operating System Chapter 2. OS Overview Lynn Choi School of Electrical Engineering Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, Kong-Hak-Kwan 411, lchoi@korea.ac.kr,

More information

Introduction to the MMAGIX Multithreading Supercomputer

Introduction to the MMAGIX Multithreading Supercomputer Introduction to the MMAGIX Multithreading Supercomputer A supercomputer is defined as a computer that can run at over a billion instructions per second (BIPS) sustained while executing over a billion floating

More information

PerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture. Technical Report

PerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture. Technical Report PerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Classes of Computers Personal computers General purpose, variety of software

More information

Interaction of JVM with x86, Sparc and MIPS

Interaction of JVM with x86, Sparc and MIPS Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal

More information

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains: The Lecture Contains: Program Optimization for Multi-core: Hardware Side of It Contents RECAP: VIRTUAL MEMORY AND CACHE Why Virtual Memory? Virtual Memory Addressing VM VA to PA Translation Page Fault

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

PoTrA: A framework for Building Power Models For Next Generation Multicore Architectures

PoTrA: A framework for Building Power Models For Next Generation Multicore Architectures www.bsc.es PoTrA: A framework for Building Power Models For Next Generation Multicore Architectures Part II: modeling methods Outline Background Known pitfalls Objectives Part I: Decomposable power models:

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

PAPI Users Group Meeting SC2003

PAPI Users Group Meeting SC2003 PAPI Users Group Meeting SC2003 Philip Mucci, mucci@cs.utk.edu Felix Wolf, fwolf@cs.utk.edu Nils Smeds, smeds@pdc.kth.se Tuesday, November 18 th Phoenix, AZ Agenda CVS Web Structure 2.3.4 Bugs 2.3.5 Release

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB

More information

Performance Prediction using Program Similarity

Performance Prediction using Program Similarity Performance Prediction using Program Similarity Aashish Phansalkar Lizy K. John {aashish, ljohn}@ece.utexas.edu University of Texas at Austin Abstract - Modern computer applications are developed at a

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction

More information

PAPI: Performance API

PAPI: Performance API Santiago 2015 PAPI: Performance API Andrés Ávila Centro de Modelación y Computación Científica Universidad de La Frontera andres.avila@ufrontera.cl October 27th, 2015 1 Motivation 2 Motivation PERFORMANCE

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Verification and Validation of X-Sim: A Trace-Based Simulator

Verification and Validation of X-Sim: A Trace-Based Simulator http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator

More information

Reporting Performance Results

Reporting Performance Results Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility - another experimenter would need to duplicate the results. However: A system s software

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

Computer Organization and Design THE HARDWARE/SOFTWARE INTERFACE

Computer Organization and Design THE HARDWARE/SOFTWARE INTERFACE T H I R D E D I T I O N R E V I S E D Computer Organization and Design THE HARDWARE/SOFTWARE INTERFACE Contents v Contents Preface C H A P T E R S Computer Abstractions and Technology 2 1.1 Introduction

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Adil Gheewala*, Jih-Kwon Peir*, Yen-Kuang Chen**, Konrad Lai** *Department of CISE, University of Florida,

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis

Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis June 27, 2017, ROSS @ Washington, DC Soramichi Akiyama, Takahiro Hirofuchi National Institute of Advanced Industrial Science

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Using Software Transactional Memory In Interrupt-Driven Systems

Using Software Transactional Memory In Interrupt-Driven Systems Using Software Transactional Memory In Interrupt-Driven Systems Department of Mathematics, Statistics, and Computer Science Marquette University Thesis Defense Introduction Thesis Statement Software transactional

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

PowerEdge 3250 Features and Performance Report

PowerEdge 3250 Features and Performance Report Performance Brief Jan 2004 Revision 3.2 Executive Summary Dell s Itanium Processor Strategy and 1 Product Line 2 Transitioning to the Itanium Architecture 3 Benefits of the Itanium processor Family PowerEdge

More information

c 2004 by Ritu Gupta. All rights reserved.

c 2004 by Ritu Gupta. All rights reserved. c by Ritu Gupta. All rights reserved. JOINT PROCESSOR-MEMORY ADAPTATION FOR ENERGY FOR GENERAL-PURPOSE APPLICATIONS BY RITU GUPTA B.Tech, Indian Institute of Technology, Bombay, THESIS Submitted in partial

More information

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Yue Luo, Ajay Joshi, Aashish Phansalkar, Lizy John, and Joydeep Ghosh Department of Electrical and Computer Engineering University

More information

EE382M 15: Assignment 2

EE382M 15: Assignment 2 EE382M 15: Assignment 2 Professor: Lizy K. John TA: Jee Ho Ryoo Department of Electrical and Computer Engineering University of Texas, Austin Due: 11:59PM September 28, 2014 1. Introduction The goal of

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information