Improved Estimation for Software Multiplexing of Performance Counters
|
|
- Veronica Charles
- 5 years ago
- Views:
Transcription
1 Improved Estimation for Software Multiplexing of Performance Counters Wiplove Mathur Texas Instruments, Inc. San Diego, CA Jeanine Cook New Mexico State University Las Cruces, NM Abstract On-chip performance counters are gaining popularity as an analysis and validation tool. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously at fixed sampling periods. Through multiplexing and estimation, an even greater number of unique events can be monitored in a single program execution. When a program is sampled in multiplexed mode using round-robin scheduling of a specified event set, the number of events that are physically counted during each sampling period is limited by the number of counters that can be simultaneously accessed. During this period, the remaining events of the multiplexed event-set are not monitored, but their counts are estimated. Our work quantifies the estimation error of the event-counts in the multiplexed mode, which indicates that as many as 42% of sampled intervals are estimated with error greater than 10%. We propose new estimation algorithms that result in an accuracy improvement of up to 40%. 1 Introduction Performance counters or Performance Monitoring Counters (PMCs) are built-in hardware counters that are fabricated in the CPU chip. They can be programmed using event-select registers to count a specified event from a set of events such as L1-data cache accesses, load misses, and branches taken. These performance counters enable accurate, minimally intrusive monitoring of application performance [14]. Moreover, the statistics are collected in realtime and on the hardware platform that is under test, therefore providing a high degree of confidence in the results. Simulators are widely used to gather cycle-by-cycle performance data of a large set of metrics in a single execution of a program on a simulated micro-architecture. With the data of various events available at the same cycle, an accurate view of the cycle-by-cycle processor state can be studied. However, when performing user-level microarchitecture simulation, the system level details such as interfaces to buses, interrupt controllers, disks, and video memory are not taken into consideration. Additionally, program behavior is affected by external factors such as the operating system and TLB effects [6], which suggest that the performance data generated through user-level simulation may not be completely accurate unless the simulator does full-system simulation. However, full-system simulation is extremely slow. These simulation short-comings are even more prevalent in the context of multiprocessors, where the availability of accurate simulators is limited and the speed perturbation is much larger. In contrast to simulators, monitoring a workload execution on native hardware using PMCs provides a much faster (i.e. real-time) platform for evaluation. PMCs are used for system tuning, workload characterization, profiling, code optimization, architecture validation, and performance evaluation. The PMCs in modern processors can monitor a very large set of events that is comparable to the number of events monitored by simulators. However, as listed in Table 1, most contemporary processors have only between two and six physical counters that can monitor an equal number of unique events simultaneously. For tance, an Intel Pentium-III processor, which has two PMCs, can monitor two events at a given tant. Moreover, one of the events counted forms the independent variable for analysis purposes (e.g. number of cycles or number of tructions). This implies that only one event is monitored in one run of a benchmark. Hence, in order to gather data of n events, the benchmark has to be executed n times. Furthermore, every architecture has certain events that are defined as conflicting which the PMCs cannot count concurrently. The Pentium IV Xeon is an exception with respect to the number of physical counters implemented on-chip. It has eighteen physical counters, which is more than any other contemporary commodity processor that we investigated. This processor is the exception, not the rule. Most contem-
2 porary processors, including other Intel architectures, are limited to between two and eight counters. Therefore, our work is still applicable to the majority of processors in use today. To overcome the limitation of only a few counters being available on a CPU and to enable simultaneous monitoring of a larger number of events, the technique of multiplexing is used [12]. The desired events are scheduled to be monitored by the PMCs in a round-robin fashion. During the period a particular event is monitored, the remaining events are not counted by the counters, but their counts are estimated. We quantified the estimation error of event counts in the multiplexed mode and found that 42% of the intervals are estimated with error greater than 10%. This inaccuracy can lead to false conclusions in the analysis and validation of the processor architecture or the program behavior. In order to improve the estimation of event counts, we implement four new estimation algorithms that result in an accuracy improvement of up to 40% and that can can be easily incorporated into any PMC interface that supports software multiplexing such as PAPI [5]. Processor #PMCs #Events 1 width Pentium-III [8] bit R10000 [13] bit Itanium 2 [9] bit AMD Athlon64 [3] bit POWER4 [18] n/a Pentium-IV Xeon [17] bit 2 PMC Interface Table 1. Processor PMCs Several interfaces are available to access the PMCs on different microprocessor families. Some processor specific interfaces include IBM s Performance Inspector [1] for PowerPC processors, Intel s VTune for Intel processors [2], and Rabbit for Intel and AMD processors [7]. Additionally, interfaces that provide portable trumentation on multiple platforms include Performance Counter Library (PCL) [4] and Performance Application Programming Interface (PAPI) [5]. PCL and PAPI support performance counters on PowerPC, Alpha, MIPS, Pentium, AMD and UltraSPARC processors. Both PCL and PAPI support multithreading, and PAPI explicitly supports multiplexing. For this work, we use PAPI as our interface to the PMCs on a Pentium-III processor. We chose the Pentium-III primarily due to its availability and low impact because it is a stand-alone system. Although the Pentium III has the fewest counters as listed in Table 1, our estimation techniques are applicable in all contexts where the number of 1 The number of events supported by the processor is an approximate as interpreted by the authors of this paper. countable events is much greater than the number of physical counters. We chose PAPI for its multiplexing support, its widespread use in performance analysis, and the large number of architectures it supports. A number of end-user tools use the PAPI library as their interface to the performance counters [11]. PAPI can monitor any event that is supported by the processor it is running on; a subset of these events is listed in Table 2. Category Events L1, L2, Instruction Hits, Misses, Accesses, Reads, Writes, and Data Caches Load/Store misses, TLB misses Total tructions executed, Total tructions issued, Instruction Mix FP tructions executed, Total branch tructions executed, FP mult & div tructions (Conditional) Branch Branch tructions taken/not taken, Prediction Branches mispredicted/predicted correctly Table 2. Subset of common events 2.1 Multiplexed Mode The multiplexed mode of counting is used to simultaneously monitor a larger number of events than the number of PMCs available on the processor. The events of interest are specified in an event-list and are monitored in a round-robin fashion. For example, consider events A, B, C, and D defined as an event set to be counted by the PMCs in the multiplexed mode. Figure 1 shows a possible sequence in which the events may be monitored. All of the events are monitored in an interval, with every event being counted for a fixed time slice. At the end of each time-slice, the current event-count is read and stored in a file which is followed by the monitoring of the next event in the event-list (after resetting the counter). This sequence continues throughout event monitoring. Thus, an event A is physically counted only once in the entire interval. The counts corresponding to A are not known when other events (B, C, and D) are monitored. Since the count of an event corresponding to the complete interval is desired, including the time slices when other events are monitored, a value is estimated by the multiplexing software. PAPI implements multiplexing of counters in software since no contemporary processors (that we are aware of) support hardware multiplexing. The MPX library [12] is the basis of the multiplexing technique in PAPI. The switching of events is triggered by the Unix interval timer ITIMER PROF, while the SIGPROF signal is used as a trap to the monitored process. After every fixed time duration (10 ms by default), MPX halts the counter, stores the current count, and starts counting the next event. The counts that are not physically counted in an interval are then estimated as discussed in Section 5.1. Pentium-III has
3 no previous work that improves the estimation of counts in the multiplexed mode. 4 Experimental Setup Figure 1. Multiplexed event count estimation two PMCs and can monitor two events simultaneously [8]. In MPX, one of the two counters is always set to read the total number of cycles (cyc) executed by the trumented code, whereas the other counter is used to monitor an event of interest. Table 3 shows the number of multiplexed intervals that occur in the full execution of certain benchmarks. It also shows the error distribution of the estimated multiplexed counts that is calculated by MPX (further discussed in Section 5.1). Although 58.3% of all the intervals are estimated with an error of less than 10%, 30% of the intervals are estimated with error in the range of 10 to 50%. Moreover, 10.6% of the intervals are estimated with error greater than 50% which clearly motivates the need for new estimation techniques to obtain more accurate data. Number of Error distribution of intervals (in %) Workload Intervals CV <10% 10-25% 25-50% >50% crafty mcf parser twolf vortex vpr ammp art equake Total Table 3. Number of multiplexed intervals; Coefficient of Variation(CV) for three execution runs; estimation error distribution per benchmark. Floating point 3 Related Work The accuracy of performance counters is studied by Korn et al. in [10] which identifies the granularity of the measured code to be a major factor contributing to the inaccuracy of performance counter values. The study of techniques for performance monitoring by Ojha [14] suggests that the on-chip counters are the least intrusive of the common performance analysis techniques. However, we found This section describes our experimental setup. The software that is used to interface to and monitor the performance counters, the events that we monitor with this software, and the benchmarks we use in performance analysis are discussed below. 4.1 PAPI In our work, we use PAPI, version 2.3.2, to interface to the performance counters on a Pentium-III, 1 GHz, dual processor machine running Red-Hat Linux 7.3 operating system, kernel version The Linux kernel is patched with perfctr (version 2.4.1) which is a Linux/x86 Performance Monitoring Counters Driver [15]. PAPI uses this package to gain access to the counters on Linux/x86 platforms. The normal operation of PAPI generates cumulative counts of the events that are selected to be monitored by the PMCs. To obtain the data at the end of every time-slice, the PAPI code is compiled in the DEBUG mode which outputs the counter values and the event ID in addition to other information. A timer and a signal handler (similar to the one discussed in Section 2.1) is incorporated for reading the counters in the non-multiplexed mode at regular time slices. The non-multiplexed mode of counting involves monitoring one fixed event A during every time slice as shown in Figure Benchmarks and Event Sets We chose a subset of benchmarks from the SPEC CPU2000 benchmark suite [16] to use in the performance analysis of the new estimation techniques; we use the reference input size in all experiments. We use a subset of events to study the accuracy of estimation techniques used in multiplexing. The Pentium-III (P6 architecture) has two performance counters that can be configured to count more than 80 events [8]. The events that we monitor and use in this work are randomly chosen and are listed in Table 4. All of the benchmark source codes are handtrumented by PAPI calls. Pseudo code for collecting the event counts in multiplexed or non-multiplexed mode is shown below:
4 main() { /* Benchmark variables defined */ /* Define PAPI variables */ /* Set the timers for sampling in multiplexed / non-multiplexed mode */ /* Enable multiplex feature if counters are to be run in multiplexed mode */ /* Create the eventset */ /* Start the counters */ --- Benchmark code executes --- /* Stop the counters */ return(0); } After declaring the variables of the benchmark, the PAPI library is initialized which is followed by setting the timers for the required mode. The default counting of events is non-multiplexed mode; multiplexing is enabled if desired. The event set is created and the counters are started after which the benchmark code executes in its normal sequence. The counters are read at regular time slices of 10ms duration, which is the smallest time slice as defined by the Unix interval timer ITIMER PROF, and are stopped just before the completion of the benchmark. Event (Acronym) PAPI Preset Value L1 Data Cache Misses () PAPI L1 DCM 0x L1 Data Cache Access () PAPI L1 DCA 0x Instructions Committed () PAPI TOT INS 0x Load Misses () PAPI L1 LDM 0x Store Misses () PAPI L1 STM 0x Branches taken () PAPI BR TKN 0x C Total Cycles (cyc) PAPI TOT CYC 0x C Table 4. Multiplexed events used. Acronym refers to each particular event; PAPI preset used in trumentation code; Value is mapped event number for Pentium III 5 Methodology and Algorithms The benchmarks are trumented with the multiplexed and non-multiplexed code (as discussed in Section 4.2) for monitoring the different events listed in Table 2. We execute an individual benchmark in each mode three times to reduce the error due to variability of data collection in different executions of the same benchmark. Our goal is to obtain minimum absolute error between the multiplexed and the non-multiplexed counts in every interval. The nonmultiplexed counts reflect the actual or accurate count of an event since they are monitored continuously throughout the benchmark execution which is not the case for multiplexed counts (Section 2.1). The steps to calculate the statistics for analyzing the estimation techniques are as follows: Figure 2. Collection of non-multiplexed and multiplexed counters at regular time slices 1. Benchmarks are executed three times each. 2. A non-multiplexed data vector consists of six 1 counts of a specific event for each equivalent multiplexed interval. The sum of these six counts is the nonmultiplexed event count count nm i. 3. Using an algorithm described in the following sections, the multiplexed event count count m i is estimated for the intervals during which it is not physically counted. 4. The estimation error, count m i count nm i, for every interval is calculated. Figure 2 illustrates the manner in which data is obtained from the counters in non-multiplexed and multiplexed modes at regular time slices. cyc i(k) is the total number of cycles elapsed since the beginning of code trumentation; i is the interval in which all the multiplexed events are being physically monitored once by the PMC; k is the time slice for which a counter is accumulating event occurrences (after resetting the counter at the end of time slice k-1). Therefore, if n events are multiplexed, then k can take values from 0 to n. For simplicity, we assume cyc i(n) to be equivalent to cyc i+1(0). Figure 3 shows the rate plot of a set of events measured in the multiplexed mode. Some important variables (at the i th interval, k th time-slice) that we use are listed below, rate nm i(k) = rate m i = count m i(n) cyc i(n) cyc i(n 1) (1) count nm i(k) cyc i(k) cyc i(k 1) ; 0 k n (2) 1 This reflects the number of events in the defined event set. The number rema the same for a particular experiment, but changes if more events are multiplexed in an experiment
5 Figure 4. Base algorithm event-count calculation Figure 3. Conversion of event counts to rates where count nm i = n count nm i(k) (3) k=1 slope i = rate m i rate m (i 1) cyc i(n) cyc i 1(n) (4) n = number of events being multiplexed k = the time slice during which event-count is sampled rate m i and rate nm i(k) = rate of occurrence of an event in multiplexed and non-multiplexed mode, respectively count m i(n) = number of times an event has occurred in i th interval and in time slice between cyc i(n 1) and cyc i(n) count nm i(k) = number of times an event has occurred in k th time slice of i th interval (that is the period between cyc i(k 1) and cyc i(k) ) slope i = slope of rate between i 1 th and i th intervals We discuss the estimation algorithms in the following sections. 5.1 Base Algorithm The estimation algorithm (henceforth called Base Algorithm) used in PAPI is developed and implemented in [12]. It is used to estimate the counts of the multiplexed event in each interval. Consider the case as shown in Figures 2 and 3. We discuss the base algorithm that is used to estimate the count of event A in the i th interval. Event A is monitored in the time slice k=4 (between cycles cyc i(3) and cyc i(4) ). If count m i(4) is the number of occurrences of event A in this time slice, then the rate of event A can be calculated using Eqn. 1 as: rate m i = count m i(4) (5) cyc i(4) cyc i(3) Figure 4 shows the plot of Rate vs. Total Cycles for event A alone. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, can be calculated using Eqn. 5. In the base algorithm, rate m i is assumed to be constant for the entire i th interval and the count of event A is estimated by the following equation: count m i rate m i (cyc i(n) cyc i 1(n) ) (6) Recall that rate m i is calculated using the data corresponding to the period between i(n-1) and i(n) (that is one time slice), whereas the count between i-1(n) and i(n) (that is one interval) is being estimated. 5.2 Trapezoid-area Method (TAM) Figure 5 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i- 1 respectively, is calculated using Eqn. 5. In the trapezoidarea method, the rate of occurrence of event A is assumed to be linearly changing within an interval. Thus, the estimated count of the multiplexed event A in the i th interval is given by the area under the trapezoid PQRS (Figure 5). Mathematically, count m i 0.5 (rate m i +rate m i 1) (cyc i(n) cyc i 1(n) ) (7)
6 cycles cyc i(k 1) and cyc i(k). Thus, the area under rectangle PQRS in Figure 6 is given by the formula: count m i(k) rate m i(k) (cyc i(k) cyc i(k 1) ) (9) where value of rate m i(k) is obtained from Eqn Repeat steps 1 and 2 for 1 k j. 4. The estimated count of the multiplexed event A in the i th interval is given by: count m i = j count m i(k) (10) k=1 Figure 5. TAM event-count calculation 5.3 Divided-interval Rectangular Area (DIRA) In our case, j=n. 5.4 Positional Mean Error (PME) Figure 6 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i 1, corresponding to intervals i and i-1 respectively, is calculated using Eqn. 5. The algorithm is explained in the following steps: Figure 7. PME event-count calculation Figure 6. DIRA event-count calculation 1. The i th interval (where the count is being estimated) is divided or split into j equal parts (Figure 6). The rate at the k th division is calculated by using linear interpolation as follows: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (8) where slope i is given by Eqn The area corresponding to the k th division is calculated by assuming the rate to remain constant between The Positional Mean Error (PME) algorithm is a two phase algorithm. Phase-1 involves calculating the rate corrections or the positional mean errors and Phase-2 consists of using the PMEs to correct the multiplexed rates and estimate the event count. Note that this algorithm is most useful in studies where a program must be executed multiple times, such as program performance and code optimization. The following steps comprise Phase-1: 1. The rate of event A in multiplexed mode at the k th position, rate m i(k), is calculated by using linear interpolation: rate m i(k) = slope i (cyc i(k) cyc i 1(n) )+rate m i 1 (11) where slope i is given by Eqn. 4.
7 2. The difference between the rate of event A in nonmultiplexed mode rate nm i(k) and the rate that is calculated in Step 1 above is given by: e k = rate nm i(k) rate m i(k) (12) where rate nm i (k) is given by Eqn. 2. This difference is the positional error for position k in the i th interval and is calculated for 1 k n and i. 3. The PME is then given by: pme k = 1 e k (13) i total where pme k is the Positional Mean Error for the k th position and i total is the total number of intervals. i Phase-1 produces n PMEs that are used in Phase 2 for estimating the event counts. Phase 2 includes the following steps: 1. Same as Step 1 of Phase Calculate the corrected rate at k th position k: c rate m i(k) = rate m i (k) + pme k (14) 3. Assuming linear rate between corrected positional rates, the count in the k th slice-division is now estimated using the trapezoid area method discussed in Section 5.2: count m i(k) 0.5 (c rate m i(k) +c rate m i(k 1) ) (cyc i(k) cyc i(k 1) ) (15) 4. Estimated count in the i th interval is then given by: count m i = n count m i(k) (16) k=1 5.5 Multiple Linear Regression (MLR) Model The Multiple Linear Regression (MLR) model allows prediction of a response variable as a function of predictor variables using a linear model. In vector notation, it is given by: y = Xb (17) where y = column vector of non-multiplexed counts when aggregated in respective multiplexed intervals. X = matrix where each column element is multiplexed subinterval area as shown in Figure 8 and data in a specific row corresponds to a particular interval. b = Predictor parameters. Figure 8. MLR event-count calculation Hence, the multiplexed sub-interval areas are represented as a linear model of the actual count (the non-multiplexed count). The predictor parameter estimation is given by: b = (X T X) 1 (X T y) (18) The estimated parameter is then used to scale the trapezoid area in an interval and the sum of the scaled area is estimated as the multiplexed count of that interval. Mathematically, scaled x k = b[k] x k (19) count m i = j scaled x k (20) k=1 For our study, the sample size = 0.5 of the population size. This algorithm, like PME, is also most useful in studies like program performance and code optimization. 6 Results We show the results of applying the algorithms described in Sections 5.1 through 5.5 in Figure 9. These figures describe the accuracy of each algorithm for each event in terms of percentage improvement. The improvement in the algorithm is computed by comparing the estimated multiplexed counts to the non-multiplexed counts of the same event. The total absolute error ( k count m k count nm k ) is computed for all the algorithms and compared with the base. For all of the benchmarks, the new estimation algorithms result in decreased error compared to the base algorithm for each event. For the benchmark crafty, the improvement varies between 5-40% over the set of events as shown in Figure 9. For data cache misses () and
8 % Improvement (compared to Base) % Improvement (compared to Base) % Improvement (compared to Base) crafty mcf parser Workloads (Integer) and Events twolf vortex vpr Workloads (Integer) and Events ammp art equake Workloads (Floating point) and Events Figure 9. Improvement is calculated by comparing total absolute error of estimation by each algorithm to that of base algorithm load misses (), the improvement realized by PME and MLR is approximately 40%. TAM and DIRA estimated the store misses () and branches taken () with approximately 15% greater accuracy. Similar improvements are observed for the floating point benchmarks. The error reduction varies between 7-40% for all of the floating point benchmarks across the six events. All the algorithms proved to be the best for art, with improvement of more than 30% for five out of six events. Table 5 summarizes the characteristics of the algorithms;table 6 summarizes the average estimation improvements realized by each of events for all algorithms and benchmarks. The average computed is the geometric mean and all averages exclude the negative-valued outliers. The maximum estimation improvement for all of the algorithms is at least 32%; the average estimation improvement for TAM is 11.6% with that for DIRA being 12.0%. PME and MLR provide estimation improvements between 2 and 40% (with a single outlier of 35% for store misses () in ammp for MLR), with average improvements of 13.5% and 14.1%, respectively. However, PME and MLR require correction parameter and predictor variable libraries, respectively, that are in turn used for estimating the multiplexed counts. These libraries are benchmark and event specific and require non-multiplexed counts for their computation. This also implies that PME and MLR can be implemented in real time, along with the program execution, only if a pre-calculated library is available. On the other hand, TAM and DIRA are more generic in nature. They can be implemented on any benchmark for any desired event without the
9 need for a library. PME and MLR perform better for crafty, Factor /Algorithm Base TAM DIRA PME MLR Requires a pre-calculated library? No No No Yes Yes Needs non-multiplexed counts for its No No No Yes Yes implementation? Real time No/ No/ implementable? Yes Yes Yes Yes Yes Computation Very Very Intensity low low Low High High Table 5. Algorithm characteristics twolf, and equake whereas TAM and DIRA perform better for parser and mcf. In the case of art, all of the algorithms perform equally well. Event /Algorithm TAM DIRA PME MLR Range of improvement 2.6/ 3.1/ 2.1/ 2.6/ (% min/max) Avg % Improvement Overall Table 6. Average % improvement of each event for all algorithms and benchmarks 6.1 Sensitivity The sensitivity of the algorithms to the number of multiplexed events is studied here. We multiplex six, ten, and fourteen events. When monitoring a higher number of events, the event set is chosen as a super set of the event set when monitoring a smaller number of events. Thus, the six events listed in Table 2 are monitored in all of the experiments. We expect to see an increase in the estimation errors as the number of events is increased. This is because when a larger number of events is monitored, the frequency of the physical monitoring of an event is smaller and therefore, the estimation is carried out over a larger duration of time. The sensitivity study is performed using the mcf (integer) and art (floating point) benchmarks. These applications are chosen since all of the algorithms performed consistently on them (Figure 9). Three experiments, namely, mult6, mult10, and mult14 are performed in which six, ten and fourteen events are multiplexed, respectively. The sensitivity of the algorithms on the six events that are monitored in the three experiments for the mcf and art benchmarks are shown in Figure 10. The performance of the algorithms is evaluated as discussed in Section 6. Even with the increase in the number of multiplexed events, all of our algorithms estimate the multiplexed counts with more accuracy when compared to the base algorithm. This can be seen in Figure 10 where the dotted lines indicating the implemented algorithms are always below the solid line indicating the base algorithm. The gap between the base and any algorithm shows the extent of accuracy improvement; the larger the gap, the greater the improvement. The hypothesis that estimation errors increase with an increase in the number of multiplexed events seems to be incorrect. For tance, the plot for mcf and art shows that the estimation error decreases with increase in multiplexed events. The and plots show that the error increases from mult6 to mult10, but it decreases from mult10 to mult14. The reduced error may be attributed to the averaging problem, wherein the positive and negative counts, around a mean value, cancel each other. If the interval is large, the chances of cancellation increase and the estimation error reduces. Nevertheless, there exists a tradeoff between the number of multiplexed events, the accuracy of the estimated multiplexed counts, and the number of intervals for which the data is collected. The accuracy of the multiplexed counts depends on the behavior of the monitored event in the window where it is estimated. 7 Conclusions and Future Work The algorithms we introduce reduce the estimation error for all of the multiplexed events for each of the benchmarks. Improvement of up to 40% is achieved by the PME and MLR algorithms and up to 32% for TAM and DIRA. Utilizing any of these techniques will greatly reduce the estimation errors of the multiplexed counts. PME and MLR require a pre-calculated library of correction parameters (corresponding to the event-benchmark pair) for their implementation whereas TAM and DIRA are more generic and independent of any event or benchmark. Since the interval size is defined by time (10 ms in our case), the event counts cannot be collected at a specified cycle. Therefore, it is difficult to collect cycle-synchronized performance metrics that can provide a complete snapshot of cycle-by-cycle program behavior. We plan to address this in the future by implementing an algorithm that we have developed for cycle-synchronization into the techniques discussed in this paper. Because PME and MLR require pre-calculated, benchmark-specific libraries, their usefulness is limited to specific types of studies. Therefore, we are investigating techniques to calculate these libraries that are independent of the benchmark. We are also examining the possibiltiy of collecting benchmark event data using the test input and applying it to reference data executions for each algorithm. Finally, we are active applying our estimation techniques
10 Total absolute error 1.6 x x x (a) (b) (c) 1.6 x x x (d) 1.5 (e) Number of multiplexed events Total absolute error (f) 1.5 x x (a) (b) 1.6 x x (d) 1.8 (e) Number of multiplexed events Figure 10. Sensitivity plots for mcf (left, int);art (right, fp) 2.6 x (c) 9 x (f) to PMCs on contemporary processors such as the Intel Pentium IV and AMD Opteron. 8 Acknowledgments This work was supported by the National Science Foundation ADVANCE Institutional Transformation Program at NMSU, fund #NSF References [1] Performance pector ibm.com/developerworks/oss/pi/. [2] Vtune profiling software, [3] AMD. BIOS and Kernel Developer s Guide for AMD Athlon 64 and AMD Opteron Processors. AMD, September [4] R. Berrendorf, H. Ziegler, and B. Molar. PCL - the performance counter library: A common interface to access hardware performance counters on microprocessors. Research Centre Juelich GmbH, 2.1, February [5] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 14(3): , Fall [6] J. Gibson, R. Kunz, D. Ofelt, and M. Heinrich. FLASH vs. (simulated) FLASH: Closing the simulation loop. In Architectural Support for Programming Languages and Operating Systems, pages 49 58, [7] D. Heller. Rabbit: A performance counters library for Intel/AMD processors and linux. [8] Intel. IA-32 Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide. [9] Intel. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. [10] W. Korn, P. Teller, and G. Castillo. Just how accurate are performance counters? In 20th IEEE International Performance, Computing, and Communications Conference, Phoenix, Arizona., April [11] K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, and T. Spencer. End-user tools for application performance analysis, using hardware counters. In International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August [12] J. M. May. MPX: Software for multiplexing hardware performance counters in multithreaded programs. In Parallel and Distributed Processing Symposium., Proceedings 15th International, pages 8, April [13] MIPS Technologies. R10000 Microprocessor User s Manual - Version 1.0. MIPS Technologies Inc., Mountain View, CA, June [14] A. K. Ojha. Techniques in least-intrusive computer system performance monitoring. In SoutheastCon Proceedings. IEEE, pages , [15] M. Pettersson. Linux x86 performance-monitoring counters driver. mikpe/linux/perfctr/. [16] S. P. E. C. (SPEC). [17] B. Sprunt. Pentium 4 performance-monitoring features. MI- CRO, 22(4):72 82, [18] S. Vetter. The POWER4 Processor Introduction and Tuning Guide. IBM Corporation, International Technical Support Organization, first edition, November 2001.
The PAPI Cross-Platform Interface to Hardware Performance Counters
The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu
More informationReducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research
Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationAccurate Cache and TLB Characterization Using Hardware Counters
Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationEvaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000
Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationMicroprocessors. Microprocessors and rpeanut. Memory. Eric McCreath
Microprocessors Microprocessors and rpeanut Eric McCreath There are many well known microprocessors: Intel x86 series, Pentium, Celeron, Xeon, etc. AMD Opteron, Intel Itanium, Motorola 680xx series, PowerPC,
More informationMicroprocessors and rpeanut. Eric McCreath
Microprocessors and rpeanut Eric McCreath Microprocessors There are many well known microprocessors: Intel x86 series, Pentium, Celeron, Xeon, etc. AMD Opteron, Intel Itanium, Motorola 680xx series, PowerPC,
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationDemonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune
Demonstration of Repeatable Non-Intrusive Measurement of Program Performance and Compiler Optimization in Linux Using IN-Tune W. E. Cohen, R. K. Gaede, and J. B. Rodgers {cohen,gaede}@ece.uah.edu jrodgers@hiwaay.net
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E.
MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR BY AMOL SHAMKANT PANDIT, B.E. A thesis submitted to the Graduate School in partial fulfillment of the requirements
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application
More informationPerformance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAPPENDIX Summary of Benchmarks
158 APPENDIX Summary of Benchmarks The experimental results presented throughout this thesis use programs from four benchmark suites: Cyclone benchmarks (available from [Cyc]): programs used to evaluate
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
More informationURL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture
01 1 EE 4720 Computer Architecture 01 1 URL: https://www.ece.lsu.edu/ee4720/ RSS: https://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 3316R P. F. Taylor Hall, 578-5482, koppel@ece.lsu.edu,
More informationURL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture
01 1 EE 4720 Computer Architecture 01 1 URL: http://www.ece.lsu.edu/ee4720/ RSS: http://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationChapter 2. OS Overview
Operating System Chapter 2. OS Overview Lynn Choi School of Electrical Engineering Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, Kong-Hak-Kwan 411, lchoi@korea.ac.kr,
More informationIntroduction to the MMAGIX Multithreading Supercomputer
Introduction to the MMAGIX Multithreading Supercomputer A supercomputer is defined as a computer that can run at over a billion instructions per second (BIPS) sustained while executing over a billion floating
More informationPerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture. Technical Report
PerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Classes of Computers Personal computers General purpose, variety of software
More informationInteraction of JVM with x86, Sparc and MIPS
Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical
More informationSkewed-Associative Caches: CS752 Final Project
Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationCHAPTER 5 A Closer Look at Instruction Set Architectures
CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal
More informationModule 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:
The Lecture Contains: Program Optimization for Multi-core: Hardware Side of It Contents RECAP: VIRTUAL MEMORY AND CACHE Why Virtual Memory? Virtual Memory Addressing VM VA to PA Translation Page Fault
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationPoTrA: A framework for Building Power Models For Next Generation Multicore Architectures
www.bsc.es PoTrA: A framework for Building Power Models For Next Generation Multicore Architectures Part II: modeling methods Outline Background Known pitfalls Objectives Part I: Decomposable power models:
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationPrefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor
Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction
More informationPAPI Users Group Meeting SC2003
PAPI Users Group Meeting SC2003 Philip Mucci, mucci@cs.utk.edu Felix Wolf, fwolf@cs.utk.edu Nils Smeds, smeds@pdc.kth.se Tuesday, November 18 th Phoenix, AZ Agenda CVS Web Structure 2.3.4 Bugs 2.3.5 Release
More informationThe Impact of Write Back on Cache Performance
The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,
More informationImplementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling
Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB
More informationPerformance Prediction using Program Similarity
Performance Prediction using Program Similarity Aashish Phansalkar Lizy K. John {aashish, ljohn}@ece.utexas.edu University of Texas at Austin Abstract - Modern computer applications are developed at a
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationLecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture
Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction
More informationPAPI: Performance API
Santiago 2015 PAPI: Performance API Andrés Ávila Centro de Modelación y Computación Científica Universidad de La Frontera andres.avila@ufrontera.cl October 27th, 2015 1 Motivation 2 Motivation PERFORMANCE
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationAutomatic Selection of Compiler Options Using Non-parametric Inferential Statistics
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationPower Measurement Using Performance Counters
Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power
More informationVerification and Validation of X-Sim: A Trace-Based Simulator
http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator
More informationReporting Performance Results
Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility - another experimenter would need to duplicate the results. However: A system s software
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationPerformance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues
Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural
More informationIntegrated CPU and Cache Power Management in Multiple Clock Domain Processors
Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC
More informationAccuracy Enhancement by Selective Use of Branch History in Embedded Processor
Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul
More informationInstruction Based Memory Distance Analysis and its Application to Optimization
Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationPerformance Tuning VTune Performance Analyzer
Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph
More informationCS3350B Computer Architecture CPU Performance and Profiling
CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationAnalyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009
Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationComputer Organization and Design THE HARDWARE/SOFTWARE INTERFACE
T H I R D E D I T I O N R E V I S E D Computer Organization and Design THE HARDWARE/SOFTWARE INTERFACE Contents v Contents Preface C H A P T E R S Computer Abstractions and Technology 2 1.1 Introduction
More informationArchitecture Tuning Study: the SimpleScalar Experience
Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.
More informationA New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *
A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University
More informationEstimating Multimedia Instruction Performance Based on Workload Characterization and Measurement
Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Adil Gheewala*, Jih-Kwon Peir*, Yen-Kuang Chen**, Konrad Lai** *Department of CISE, University of Florida,
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationPerformance Optimization: Simulation and Real Measurement
Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationQuantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis
Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis June 27, 2017, ROSS @ Washington, DC Soramichi Akiyama, Takahiro Hirofuchi National Institute of Advanced Industrial Science
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationUsing Software Transactional Memory In Interrupt-Driven Systems
Using Software Transactional Memory In Interrupt-Driven Systems Department of Mathematics, Statistics, and Computer Science Marquette University Thesis Defense Introduction Thesis Statement Software transactional
More informationBase Vectors: A Potential Technique for Micro-architectural Classification of Applications
Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova
More informationPowerEdge 3250 Features and Performance Report
Performance Brief Jan 2004 Revision 3.2 Executive Summary Dell s Itanium Processor Strategy and 1 Product Line 2 Transitioning to the Itanium Architecture 3 Benefits of the Itanium processor Family PowerEdge
More informationc 2004 by Ritu Gupta. All rights reserved.
c by Ritu Gupta. All rights reserved. JOINT PROCESSOR-MEMORY ADAPTATION FOR ENERGY FOR GENERAL-PURPOSE APPLICATIONS BY RITU GUPTA B.Tech, Indian Institute of Technology, Bombay, THESIS Submitted in partial
More informationAnalyzing and Improving Clustering Based Sampling for Microprocessor Simulation
Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Yue Luo, Ajay Joshi, Aashish Phansalkar, Lizy John, and Joydeep Ghosh Department of Electrical and Computer Engineering University
More informationEE382M 15: Assignment 2
EE382M 15: Assignment 2 Professor: Lizy K. John TA: Jee Ho Ryoo Department of Electrical and Computer Engineering University of Texas, Austin Due: 11:59PM September 28, 2014 1. Introduction The goal of
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationCOL862 Programming Assignment-1
Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,
More information