Performance Profiling Techniques on Intel XScale Microarchitecture Processors

Size: px

Start display at page:

Download "Performance Profiling Techniques on Intel XScale Microarchitecture Processors"

Barrie Cross
5 years ago
Views:

1 Performance Profiling Techniques on Intel XScale Microarchitecture Processors Application Note August 2002 Document Number:

2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel XScale microarchitecture processors may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling or by visiting Intel's website at Copyright Intel Corporation, 2002 AlertVIEW, i960, AnyPoint, AppChoice, BoardWatch, BunnyPeople, CablePort, Celeron, Chips, Commerce Cart, CT Connect, CT Media, Dialogic, DM3, EtherExpress, ETOX, FlashFile, GatherRound, i386, i486, icat, icomp, Insight960, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4, IntelSX2, Intel ChatPad, Intel Create&Share, Intel Dot.Station, Intel GigaBlade, Intel InBusiness, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetStructure, Intel Play, Intel Play logo, Intel Pocket Concert, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel TeamStation, Intel WebOutfitter, Intel Xeon, Intel XScale, Itanium, JobAnalyst, LANDesk, LanRover, MCS, MMX, MMX logo, NetPort, NetportExpress, Optimizer logo, OverDrive, Paragon, PC Dads, PC Parents, Pentium, Pentium II Xeon, Pentium III Xeon, Performance at Your Command, ProShare, RemoteExpress, Screamline, Shiva, SmartDie, Solutions960, Sound Mark, StorageExpress, The Computer Inside, The Journey Inside, This Way In, TokenExpress, Trillium, Vivonic, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. 2 Application Note

3 Contents Performance Profiling Techniques on Intel XScale Microarchitecture Processors Contents 1.0 Forward Introduction PMU Registers on Intel XScale Core Clock Counter (CCNT) Performance Monitor Count Register (PMN0 and PMN1) Performance Monitor Control Register (PMNC) Time-Based Sampling and Tuning Strategies PMU Event-Based Sampling An Example Program - memtest.c Using CCNT as a Timer - PMU.C PMU.C Program Output Using PMN0 and PMN1 Registers to Count PMU Events - pmu_event.c pmu_event.c - Program Output Using the PMU for Time-based Sampling pmu_tbs.c - Program Output Using the PMU to do Event-based Sampling pmu_ebs.c - Program Output Combination PMU Event Sampling Instruction Cache Efficiency Mode Data Cache Efficiency Mode Instruction Fetch Latency Mode Data/Bus Request Buffer Full Mode Stall/Writeback Statistics Instruction TLB Efficiency Mode Data TLB Efficiency Mode Performance Analysis Tools GNU gprof How to use gprof Interpreting the results of gprof Intel VTune Performance Analyzer Version How to use the Intel VTune Analyzer Some Sampling Tips and Aids Skid Writing to the PMU Registers Resetting the PMNC Register Intel Processor Errata that References the PMU Power Management Affects the PMU Conclusion...40 A B Source to h...41 Source to 80310_fiq_irq.h...57 Application Note 3

4 Contents Figures 1 Intel Processor based on Intel XScale Microarchitecture Intel VTune Performance Profiling Module View Intel VTune Performance Profiling HotSpot View of vtunedemo.exe by Function Intel VTune Performance Profiling Source Code View Tables 1 Clock Counter Timing Data Clock Count Register (CCNT) Performance Monitor Count Register (PMN0 and PMN1) Performance Monitor Control Register (CP14, register 0) Performance Monitoring Events Some Common Uses of the PMU memtest.elf - gprof Result File Application Note

5 Contents Revision History Date Revision Description August Miscellaneous Typo corrections. June Added Table 1. Added new Section 8.0. Various text updates. March Initial Release. Application Note 5

6 Contents This page intentionally left blank. 6 Application Note

7 Forward 1.0 Forward The purpose of this paper is to familiarize the reader with the Intel XScale microarchitecture (ARM* architecture compliant) Performance Monitoring Unit (PMU) and techniques of using the PMU for performance profiling. This paper assumes a development environment has already been setupconsistingofanintel IQ80310 Evaluation Platform Board (IQ80310) or Intel IQ80321 Evaluation Platform Board (IQ80321) and a development system with GNUPro* installed, as well as downloading, running and debugging the hello world sample program with GDB*. This paper also assumes previous programming experience and Intel XScale microarchitecture knowledge. 2.0 Introduction Programming the PMU on an Intel XScale microarchitecture processor is a very simple exercise, especially when using the clock counter as a timer. Just a few simple instructions setup the PMU to time the application or sections within the code. One of the first things to do when analyzing the performance of software, is to make sure the hardware is not the bottleneck. For instance, when an application is waiting on the network, optimizing code makes an insignificant difference. The operator needs to verify that the system is not bound by: network - upgrade to GigaBit Ethernet on the test system. hard disk - use the fastest, best performance disks available. memory - consider adding more memory to the test system. less than optimum processor speed - upgrade to the highest speed Intel XScale microarchitecture for the test system. or other system factors Note: Be careful to change only one thing at a time and to record results for future reference. Next, check that the target hardware is not a holding factor, by looking at performance analysis tools. Some tools may be available, depending on the development environment. For example, the following are some of those available to optimize code: GNU* gprof* ARM* ARMprof* Intel VTune Performance Analyzer WindRiver* WindView* LynuxWorks* SpyKer* Normally, when developing a proprietary Operating System or using an OS without supported tools, the source code has to be instrumented by the operator. When using Intel XScale microarchitecture however, the PMU can be used to instrument the code. Application Note 7

8 Introduction Intel XScale microarchitecture includes hardware to help collect performance data with minimal overhead for data gathering. This feature is called the Performance Monitoring Unit (PMU). The PMU consists of a set of counters that can be used to collect data on the performance or timing of the application. There is a counter that counts core cycles for measuring total execution time or time elapsed. There are two additional counters that are capable of counting specific processor events such as cache misses. Figure 1. Intel Processor based on Intel XScale Microarchitecture Instruction Cache 32 Kbytes 32 ways Lockable by line Data Cache Max 32 Kbytes 32 ways wr-back or wr-through Hit under miss Data RAM Max 28 Kbytes Re-map of data cache Mini-Data Cache 2 Kbytes 2 ways Branch Target Buffer 2 Kbytes 2 ways IMMU 32 entry TLB Fully associative Lockable by entry DMMU 32 entry TLB Fully associative Lockable by entry Fill Buffer 4-8 entries Performance Monitoring Debug Hardware Breakpoint Branch History Table Power Management Idle Drowsy Sleep MAC Single Cycle Throughput (16*32) 16-bit SIMD 40-bit Accumulator Write Buffer 8 entries Full coalescing JTAG Interrupt Controller Interrupt Masking FIQ/IRQ Steering Pend Register Bus Controller 1 Bbyte/sec Pipelined, de-multiplexed ECC protection A These registers can be used to time the execution of specific routines or the overall execution time of an application. The timer can also be used to do hot-spot analysis. Hot-spot analysis is used to show how much time the processor is spending on the program instructions. This works by stopping the processor at a regular interval (such as 1 ms.) and seeing which instruction the processor was executing at the time. This is called time-based sampling. By building up a large sample of execution data (3000 or more records), this achieves a statistically significant amount of data, showing where the processor spends most of its time. These hot-spots are the best places to spend optimizing time. The PMU can also collect CPU event data. These events are described in detail in Chapter 5.0, PMU Event-Based Sampling. 8 Application Note

9 PMU Registers on Intel XScale Core 3.0 PMU Registers on Intel XScale Core Note: See chapter 12 in the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for more details. The PMU consists of four registers: a clock counter register. two event-counting registers. a configuration register used to configure the three counting registers. 3.1 Clock Counter (CCNT) The clock counter can be used as a timer, to measure the execution time of a particular routine. It is a 32-bit counter that can interrupt the processor at roll-over. The clock counter counts core clock cycles, which on the Intel Processor based on Intel XScale Microarchitecture (80200) can be from 200 MHz to 733 MHz, depending on the core clock multiplier. At 200 MHz, the counter rolls-over from 0 to 0xFFFF FFFF and back to 0x0 in about 21.5 seconds. At 733 MHz, the counter rolls-over in about 5.9 seconds with a resolution of about 1.4 ns. There is also a divider for the clock counter that increments every 64 core clock cycles. With the divider enabled and the core clock running at 733 MHz, the clock counter rolls over in about 375 seconds, with a resolution of about 87 ns. The clock counter is based on the reference clock to the Intel XScale core. For the IQ80310, the clock is exactly MHz. According to Table 1, a CCLKCFG value of nine gives a CLK multiplier of 11. Therefore, MHz times an 11 CLK multiplier, equals a MHz core clock speed. Table 1. Clock Counter Timing Data CCLKCFG [3:0] (Coprocessor 14, register 16) Multiplier for CCLK CCLK (MHz) CCNT Rollover (secs) No Divider Resolution (ns) 64 Clock Divider CCNT Rollover (secs) Resolution (ns) Note: The clock speed is also dependant upon operating voltage. Application Note 9

10 PMU Registers on Intel XScale Core The clock counter can also be used for hot-spot analysis using time-based sampling. This can be done by stopping the processor at a regularly timed interval, approximately every 1 ms. To achieve this, program a value into the CCNT register that is equivalent to 1 ms before the counter rollsover. When the clock counter rolls over, it interrupts the processor exactly 1 ms after the CCNT register starts counting. The equation to calculate a 1 ms sampling rate assuming an IQ80310 is: CCNT Value = 0x0 ( ( CCLKCFG + 2 ) * reference clock * sampling rate in seconds ) or CCNT Value = 0x0 ( ( ) * Hz * secs.) CCNT Value = 0x0 ( ) CCNT Value = 0xFFF4 EC10 Table 2. Clock Count Register (CCNT) Clock Counter reset value: unpredictable Bits Access Description 31:0 Read / Write 32-bit clock counter - Reset to 0 by PMNC register. When the clock counter reaches its maximum value 0xFFFF,FFFF, the next cycle causes it to roll over to zero and generate an IRQ or FIQ when enabled. 3.2 Performance Monitor Count Register (PMN0 and PMN1) The PMN0 and PMN1 registers are similar to the CCNT register, except that they are incremented on PMU events, instead of core clocks. Examples of PMU events that can be counted are: cache misses TLB misses branch mispredictions The event counters do not count OS events like context switches and memory region accesses. These counters can be used to count the number of CPU events that happen similarly to the CCNT register. For instance, counting the total number of stalls caused by the data cache buffers being full during the run of the application. PMU events can also be used to do event-based sampling, where the processor would be stopped at a cache miss and the current location of the Program Counter saved. Given enough data to be statistically significant, event-based sampling can show which specific instructions of the application are the most cache inefficient. Table 3. Performance Monitor Count Register (PMN0 and PMN1) Event Counter reset value: unpredictable Bits Access Description 31:0 Read / Write 32-bit event counter - Reset to 0 by PMNC register. When an event counter reaches its maximum value 0xFFFF,FFFF, the next event it needs to count causes it to roll over to zero and generate an IRQ or FIQ interrupt when enabled. 10 Application Note

11 PMU Registers on Intel XScale Core 3.3 Performance Monitor Control Register (PMNC) The PMNC is used to setup the PMU. This register controls the events that PMN0 and PMN1 monitor. The PMU can count core clock cycles and PMU events or interrupt the processor at a counter rollover. The PMNC register also tracks which counter has overflowed. In order to have the PMU trigger an interrupt to the processor, a programmer must: enable the counters with the (E) bit of the PMNC. enable interrupts with the (inten) bitsofthepmnc. select an event out of the event list from Table 12-4 in the Intel Processor based on Intel XScale Microarchitecture Developer s Manual. make sure interrupts are enabled in the INTCTL and CPSR registers. Note: The IQ80310 and IQ80321 evaluation boards use slightly different interrupt routing paths on the boards. When using the IQ80310, please refer to the Intel IQ80321 Evaluation Platform Board Manual, chapter 5. When using the IQ80321, please refer to the Intel IQ80310 Evaluation Platform Board Manual, chapter 10. Table 4. Performance Monitor Control Register (CP14, register 0) (Sheet 1 of 2) evtcount1 evtcount0 flag inten D C P E reset value: E and inten are 0, others unpredictable Bits Access Description 31:28 Read-unpredictable / Write-as-0 Reserved 27:20 Read / Write 19:12 Read / Write Event Count1 - identifies the source of events that PMN1 counts. See Table 5 for a description of the values this field may contain. Event Count0 - identifies the source of events that PMN0 counts. See Table 5 for a description of the values this field may contain. 11 Read-unpredictable / Write-as-0 Reserved 10:8 Read / Write Overflow/Interrupt Flag - identifies which counter overflowed Bit 10 = clock counter overflow flag Bit 9 = performance counter 1 overflow flag Bit 8 = performance counter 0 overflow flag Read Values: 0 = no overflow 1 = overflow has occurred Write Values: 0 = no change 1= clearthisbit 7 Read-unpredictable / Write-as-0 Reserved Application Note 11

12 PMU Registers on Intel XScale Core Table 4. Performance Monitor Control Register (CP14, register 0) (Sheet 2 of 2) evtcount1 evtcount0 flag inten D C P E reset value: E and inten are 0, others unpredictable Bits Access Description 6:4 Read / Write 3 Read / Write 2 Read-unpredictable / Write 1 Read-unpredictable / Write 0 Read / Write Interrupt Enable - used to enable/disable interrupt reporting for each counter Bit 6 = clock counter interrupt enable 0 = disable interrupt 1 = enable interrupt Bit 5 = performance counter 1 interrupt enable 0 = disable interrupt 1 = enable interrupt Bit 4 = performance counter 0 interrupt enable 0 = disable interrupt 1 = enable interrupt Clock Counter Divider (D) - 0 = CCNT counts every processor clock cycle 1 = CCNT counts every 64 th processor clock cycle Clock Counter Reset (C) - 0 = no action 1 = reset the clock counter to 0x0 Performance Counter Reset (P) - 0 = no action 1 = reset both performance counters to 0x0 Enable (E) - 0 = all 3 counters are disabled 1 = all 3 counters are enabled The interrupt control register may have to be programmed in order to get Time-Based Sampling and Event-Based Sampling working. Also, the INTSTR register needs to be setup correctly. Please refertochapter9oftheintel Processor based on Intel XScale Microarchitecture Developer s Manual. 12 Application Note

13 Time-Based Sampling and Tuning Strategies 4.0 Time-Based Sampling and Tuning Strategies Time-based sampling is just one of a few ways to extract performance data from an embedded target. Sampling can also be based on OS events or PMU events. Many performance analysis tools are based on OS events such as task switches. OS event data can be gathered by instrumenting the context switch of the OS. Another method of performance data collection, is to collect data during a PMU event, such as a cache miss. This method is described in the next chapter. Time-based sampling is the method of analysis where the processor is stopped at a regularly timed interval (like 1 ms.) and collects performance data. This is not simply timing the elapsed time of the application and comparing it to a previous run, but actually interrupting the processor every 1 ms, while the application is running in a steady state. After collecting data for a significant amount of time, say 15 seconds, 15,000 samples are available that statistically profile the execution performance of the application. So, time-based sampling is like taking a large amount of snapshots of CPU activity, to do performance analysis. One of the biggest benefits of time-based sampling is the ability to be less intrusive. Being minimally intrusive is crucial in getting valid performance data. It is not productive for the gathering of data to interfere with the normal operation of the system. The process of data collection should not introduce additional errors. The sampling rate can be stretched out to long periods of time to collect more samples, to get a statistically significant amount of performance data. The concept is to gather data on where the processor is spending a significant amount of its time and not to worry about edge cases. The best return on investment for performance enhancement is to find where the CPU hot-spots are. When choosing to optimize with sampling strategies, Time-based sampling ought to be the first and most often used strategy. It gives an overall view of how the application is running and where time needs to be spent tuning code. Once optimization is satisfactory for the code that is hindering the performance of the processor (hot-spot), the next slowest section of code can be tuned or event-based sampling can be tried. Some shortcomings of time-based sampling are: First, only a statistical picture of the system performance is given. In other words, if interested in a particular line or section of code, data for that section may or may not show up, depending on how quickly that section executes. Second, time-based sampling gives a good overall picture, but does not show how to take full advantage of the Intel XScale microarchitecture. Intel XScale microarchitecture processors have a 32 K instruction cache and performance is highly dependant on reducing the cache miss rate. It can cost over 80 core clock cycles to fill a cache line after a cache miss, because the core is running at 733 MHz and has to go out to relatively slow (100 MHz) SDRAM. PMU event-based sampling can pin point the instructions that are causing cache misses and allow the program to be modified, to avoid these costly cache misses. See the optimization guide in Appendix B of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for more details. Application Note 13

14 PMU Event-Based Sampling 5.0 PMU Event-Based Sampling PMU event-based sampling differs from time-based sampling. Instead of interrupting on a regular interval, it interrupts execution of the program when a PMU event occurs. An analogy of this, is like a red light running traffic camera, the camera takes a picture only when someone runs a red light. Think of how long it would take to catch a red light runner when the camera is taking a picture once every minute, it might take months. Now think of the small and very fast execution inside a processor based on Intel XScale microarchitecture, every time the PMU sees a cache miss, it takes a picture of what is going on at the time of the cache miss. PMU event-based sampling is different than event counting, in that counting tells how many times an event happened, but does not tell which instruction caused the event. Event counting is much easier to implement, because event-based sampling involves interrupting the processor, while event counting happens automatically, the PMU needs to be told to start and stop. This means that every cache miss can cost over 80 cycles, because the CPU is running at 733 MHz and the memory bus is running at 100 MHz, so the processor stalls while memory is being accessed and the cache fills. Once it is known that a certain instruction is causing a significant amount of cache misses, the cache can be preloaded before getting to that instruction and save the possibly 80 cycles for a cache miss, multiplied by the number of times that instruction runs in a loop. Note: The following information is copied from the Intel Processor based on Intel XScale Microarchitecture Developer s Manual: Table 5 lists events that may be monitored by the PMU. Each of the Performance Monitor Count Registers (PMN0 and PMN1) can count any listed event. Software selects which event is counted by each PMNx register by programming the evtcountx fields of the PMNC register. Table 5. Performance Monitoring Events (Sheet 1 of 2) Event Number (evtcount0 or evtcount1) 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC Event Definition Instruction cache miss requires fetch from external memory. Instruction cache cannot deliver an instruction. This could indicate an ICache miss or an ITLB miss. This event occurs every cycle in which the condition is present. Stall due to a data dependency. This event occurs every cycle in which the condition is present. Instruction TLB miss. Data TLB miss. Branch instruction executed, branch may or may not have changed program flow. Branch mispredicted. (B and BL instructions only.) Instruction executed. Stall because the data cache buffers are full. This event occurs every cycle in which the condition is present. Stall because the data cache buffers are full. This event occurs once for each contiguous sequence of this type of stall. Data cache access, not including Cache Operations (defined in Section 7.2.8, of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual) Data cache miss, not including Cache Operations (defined in Section 7.2.8, of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual) Data cache write-back. This event occurs once for each 1/2 line (four words) that are written back from the cache. 14 Application Note

15 PMU Event-Based Sampling Table 5. Performance Monitoring Events (Sheet 2 of 2) Event Number (evtcount0 or evtcount1) 0xD 0x10 0x11 0x12 0x13 0x14 0x15 0x16 all others Event Definition Software changed the PC. This event occurs any time the PC is changed by software and there is not a mode change. For example, a mov instruction with PC as the destination triggers this event. Executing a swi from User mode does not trigger this event, because it incurs a mode change. The BCU received a new memory request from the core. The BCUs request queue is full. This event takes place each clock cycle in which the condition is met. A high incidence of this event indicates the BCU is often waiting for transactions to complete on the external bus. The number of times the BCU queues were drained due to a Drain Write Buffer command or an I/O transaction as identified by C = 0 and B = 0 (cacheable and bufferable page attribute bits). Reserved, unpredictable results. The BCU detected an ECC error, but no ELOG register was available in which to log the error. (See Section , ECC Error Registers on page 11-9, of the Section of the Intel 80200ProcessorbasedonIntel XScale Microarchitecture Developer s Manual for a description of the ELOG registers). BCU detected a 1-bit error while reading data from the bus. This event may be counted even when reporting of 1-bit errors is disabled. See Section 11.3, Error Handling on page 11-2, of the Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of 1-bit errors. RMW cycle occurred due to narrow write on ECC-protected memory (see Section 11.2, ECC on page 11-1 of the Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of ECC and RMW cycles). Reserved, unpredictable results Application Note 15

16 An Example Program - memtest.c 6.0 An Example Program - memtest.c Here is a simple program to demonstrate the value of using existing optimized library routines. This program sets a block of 500 bytes of memory to the value a. It shows two ways to do it. One way is to use for loops to initialize memory or another way is to use the pre-optimized memset function supplied in the standard C library. Each of the following examples build on this piece of example code, to demonstrate different performance profiling techniques. 1 #include <stdio.h> 2 3 char buf[1000]; 4 5 void test_memset(char *buf, char val, int num) 6 7 int i; 8 for(i=0; i<num; i++) 9 buf[i]=val; int main() unsigned ticks = 0; int i; for(i=0; i<5000; i++) test_memset(buf, 'a', 500); 20 memset(buf, 'a', 500); return 0; Application Note

17 An Example Program - memtest.c 6.1 Using CCNT as a Timer - PMU.C Since the PMU is being set up to time the execution of both the memset and test_memset functions, the four loops are split in two and some assembly language instructions are defined to set the CPU clock (line #2 below), setup and run the clock counter (line #6) and read the counter (line #10). First, setup the CCLKCFG register to 733 MHz (line #32). This is needed because the timing calculations are based on the CPU clock speed. Next, enable the PMU to do clock counting (line #33). The value written to the PMNC Control REgister is 0x This value sets the events to 0, clears all flags, disables all interrupts, disables the divider, resets all counters and then enables all counters. Once the loop has been executed, read the clock counter (line #38) and reset the PMU (line #40) to time the second loop. 1 #include <stdio.h> 2 inline void _Write_CCLKCFG(VAL) /* write to the CCLKCFG register */ 3 4 asm( "mcr\tp14, 0, %0, c6, c0, 0" : : "r" (VAL) ); 5 6 inline void _Write_PMNC(VAL) /* write to the PMNC register of hte PMU */ 7 8 asm( "mcr\tp14, 0, %0, c0, c0, 0" : : "r" (VAL) ); 9 10 inline unsigned _Read_CCNT(void) /* write to the CCNT register of hte PMU */ register unsigned _val_; 13 asm volatile( "mrc\tp14, 0, %0, c1, c0, 0" : "=r" ( _val_ ) ); 14 return _val_; char buf[1000]; void test_memset( char *buf, char val, int num ) int i; 22 for( i=0; i<num; i++ ) 23 buf[i] = val; int main() unsigned ticks = 0; int i; _Write_CCLKCFG( 9 ); /* set clock to 733 MHz */ 33 _Write_PMNC( 0x ); /* Enable the PMU and reset all */ 34 for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 37 Application Note 17

18 An Example Program - memtest.c 38 ticks = _Read_CCNT(); 39 printf( "test_memset took %4u ticks = %02.6f ms and ", ticks, (double)ticks/ ); 40 _Write_PMNC( 0x ); /* Enable the PMU and reset all */ 41 for( i=0; i<5000; i++ ) memset( buf, 'a', 500 ); ticks = _Read_CCNT(); /* read the PMU clock counter */ 46 printf( "memset took %4u ticks = %02.6f ms\n", ticks, (double)ticks/ ); 47 return 0; PMU.C Program Output The output shows that the test_memset loop took ms to execute and the memset library routine took just 3.3 seconds to execute. That shows an improvement of 31 times in speed, just by using a standard C library routine. test_memset took ticks = ms and memset took ticks = ms 18 Application Note

19 An Example Program - memtest.c 6.2 Using PMN0 and PMN1 Registers to Count PMU Events - pmu_event.c This example shows how to use the PMU to count events such as cache misses and branch mispredictions. A loop (line #39) is setup around the test loop, to count events in sequential order. During the first run of the loop, the PMU is setup to count event #0 or Instruction cache miss requires fetch from external memory. The next loop counts event #1 and so on. 1 #include <stdio.h> 2 inline void _Write_CCLKCFG(VAL) 3 4 asm( "mcr\tp14, 0, %0, c6, c0, 0" : : "r" (VAL) ); 5 6 inline void _Write_PMNC(unsigned VAL) 7 8 asm( "mcr\tp14, 0, %0, c0, c0, 0" : : "r" (VAL) ); 9 10 inline unsigned _Read_CCNT(void) register unsigned _val_; 13 asm volatile( "mrc\tp14, 0, %0, c1, c0, 0" : "=r" ( _val_ ) ); 14 return _val_; inline unsigned _Read_PMN0(void) register unsigned _val_; 19 asm volatile( "mrc\tp14, 0, %0, c2, c0, 0" : "=r" ( _val_ ) ); 20 return _val_; char buf[1000]; void test_memset( char *buf, char val, int num ) int i; 28 for( i=0; i<num; i++ ) 29 buf[i] = val; int main() unsigned ticks = 0, events = 0; 35 unsigned int pmnc_val; 36 int i, j; _Write_CCLKCFG( 9 ); /* set clock to 733 MHz */ 39 for (j=0; j<23; j++ ) if( j == 0x0E j == 0x0F ) /* avoid using these reserved values for the event numbers */ Application Note 19

20 An Example Program - memtest.c 42 continue; 43 pmnc_val = 0x (j<<12); /* shift j left by 12 to program the Event Count 0 bits */ 44 _Write_PMNC( pmnc_val );/* Enable the PMU and reset all */ 45 for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 48 memset( buf, 'a', 500 ); ticks = _Read_CCNT(); /* read the PMU clock counter */ 51 events = _Read_PMN0(); /* read the PMU event counter */ 52 printf( "%6u ticks=%02.4f ms, events=%8u, pmnc=%08x\n", 53 ticks, (double)ticks/726000, events, pmnc_val); return 0; Application Note

21 An Example Program - memtest.c pmu_event.c - Program Output Please refer to Table 5, Performance Monitoring Events on page 14, for explanations of the different performance monitoring events. In the output of pmu_event, it shows the count of PMU events in sequential order. The first line corresponds to event #0x0 and the second line corresponds to event #0x1 and so on. Notice that on event #0x0 (instruction cache misses) the program took slightly longer to run because the instruction cache was being loaded. Then the event gets incremented and the program runs again, except the PMU is counting the new event data: ticks= ms, events= 15, pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= 1, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= , pmnc=0000a ticks= ms, events= 0, pmnc=0000b ticks= ms, events= 0, pmnc=0000c ticks= ms, events= , pmnc=0000d ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= Application Note 21

22 An Example Program - memtest.c 6.3 Using the PMU for Time-based Sampling The next example shows the change from a counting / timing usage of the PMU to a sampling function. First, the processor is interrupted every 10 µs and the value of the PC is saved to an array. A slower rate to sample is preferred, but for this example a faster rate was used to collect a more significant amount of data. Over time, these saved PC values give a statistical sampling of where the program spends most of its time. Notice at this time, the change over from external inline functions for PMU registers, to a header file named h (#2), as well as the inclusion of a new header file called 80310_fiq_irq.h (#3), that has some specific interrupt handling code. Also added was _Read_LR() (#6), which is an inline function that reads the Link Register, which points to the Program Counter before the ISR took over execution of the program. To service the interrupt, add the code from line #17 to #47 in the sample code below. The inthandlerattach() function replaces the current RedBoot FIQ vector with the new vector of fiqhandlerpmu() and inthandlerdetach() puts the old vector back when the process is completed. The function fiqhandlerpmu() runs every time there is an interrupt. The first thing that needs to happen is to setup the FIQ mode stack pointer. Choose a value in RAM that would not be overwritten by the application, but be cautious. Next, run ISR_PROLOG (#33) that is an assembly routine to save away the current registers onto the stack. Then find out what caused the interrupt. When the interupt is caused by the PMU, read the Link Register that contains the Program Counter before the interrupt. Also reprogram the PMN0 register to interrupt the CPU again at the same sample rate as before (#39) and reset the PMNC register back to interrupt on overflow (#40). When the PMU does not cause the interrupt (#35), call ISR_CHAIN (#44) that calls the old FIQ vector. Then pop the registers off the stack before exiting the FIQ handler by calling ISR_EPILOG (#46). The next section is part of main(). First, setup the interrupt steering register (#63) to FIQ, by reading the INTSTR register, setting the FIQ bit to 1 and writing the value back into the INTSTR register. Then attach the interrupt handler (#64) as described in the previous paragraph, by replacing the old FIQ handler with the new fiqhandlerpmu(). Line #65 loads the sampling rate value into the CCNT register. Now, instead of counting from zero and stopping after the program finishes, program the CCNT value to (0x0-10 µs) or 0xFFFF E3A4 and when the clock counter counts that many cycles, interrupt the processor. Once the processor has been interrupted and the interrupt serviced, re-program the CCNT register with the (overflow - 10 µs) value and re-enable counting. Line #66 and #67 resets the PMU and enables the PMU to cause FIQ interrupts on a CCNT overflow. Notice that the value being programmed into the PMNC, has changed from 0x to 0x (#66). The reason for this is to enable the clock counter interrupts and not to reset the value just programmed into the CCNT. Please realize that interrupting the processor every 10 µs severely degrades the performance of the application being profiled, but this is being done as an example to generate a large amount of data. After running the target segment of code, disable the interrupts (#75), reset the PMU (#77) and detach the interrupt handler (#78). Display the results of the time-based sampled data collected. 1 #include <stdio.h> 2 #include "80200.h" 3 #include "80310_fiq_irq.h" 4 /************** SAMPRATE = ( 0x0 - ( 726 * microseconds ) ) *************/ 5 #define SAMPRATE 0xFFFFE3A4 /* MHz */ 6 inline unsigned _Read_LR(void) asm volatile("mov\t%0, r14" : "=r" (_val_)); 22 Application Note

23 An Example Program - memtest.c 10 return _val_; unsigned int status, count=0, PCsample[0xFFFF]; 14 ISR_SERVICE *fiqval; /* In 80310_fiq_irg.h */ 15 ISR_SERVICE next_fiq_service; /*************** ISR STUFF *******************/ 18 void inthandlerattach( ISR_SERVICE fiq ) fiqval = (ISR_SERVICE*)FIQVECTOR; 21 next_fiq_service = *fiqval; 22 *fiqval = fiq; void inthandlerdetach() *FIQVECTOR = (unsigned int)next_fiq_service; void attribute ((interrupt("fiq"))) fiqhandlerpmu(void) asm("ldrsp,=0xa "); /* stack pointer in FIQ mode */ 33 ISR_PROLOG; 34 status = _Read_INT_Source(); 35 if( status & 0x ) /* If the PMU caused the FIQ */ PCsample[count] = _Read_LR(); 38 count++; 39 _Write_CCNT( SAMPRATE ); 40 _Write_PMNC( 0x );/* Enable the PMU and reset */ else ISR_CHAIN( next_fiq_service );/* call old vector */ ISR_EPILOG; /*************** TBS Example Code *****************/ 50 char buf[1000]; 51 void test_memset( char *buf, char val, int num ) int j; 54 for( j=0; j<num; j++ ) 55 buf[j] = val; int main() 59 Application Note 23

24 An Example Program - memtest.c unsigned int i, j, location=1; _Write_CCLKCFG( 9 ); /* 733 MHz */ 64 _Write_INT_Steering( _Read_INT_Steering() 1 );/* Steer FIQ */ 65 inthandlerattach( fiqhandlerpmu );/* attach the interrupt handler */ 66 _Write_CCNT( SAMPRATE ); /* Interrupting every? ms */ 67 _Write_PMNC( 0x ); /* Enable the PMU and reset */ 68 _Write_INT_Control( _Read_INT_Control() 5 );/* Enable FIQ */ for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 73 memset( buf, 'a', 500 ); _Write_INT_Control( _Read_INT_Control() & 0xFFFFFFFA );/* Disable FIQ */ _Write_PMNC( 0x ); /* Disable the PMU and reset all */ 79 inthandlerdetach(); /***************** Display Results ******************/ 82 printf( "count=%d\n, count ); 83 for( i=0; i<count; i++ ) if( PCsample[i] ) for( j=i+1; i<count; j++ ) if( PCsample[i] == PCsample[j] ) location++; 92 PCsample[j]=0; printf( "PCsample[%04d]=0x%08X x %d\n", i, PCsample[i]-4, location ); 96 location=1; return 0; Application Note

25 An Example Program - memtest.c pmu_tbs.c - Program Output The first line of output from the time-based sampling program, pmu_tbs.c, displays the value of the count variable. Sampling is being done at a 10 µs rate, the counts equals a ms execution time. The difference in execution time between the ms pmu_event time and the ms time is the overhead for time-based sampling. The overhead for time-based sampling in this case is 340 µs. To make sense on the next section of output, find out where the functions have been mapped in memory using GDB. Match the functions with the corresponding memory addresses at which they reside. 0xa xa test_memset() samples or 96.9% 0xa xa00203b8 main() 30 samples or 00.3% 0xa xa002111c memset() 301 samples or 02.8% So, in summary, the total number of samples was The function test_memset() had samples or 96.9% of all samples collected. On the other hand, the optimized library function memset() had only 298 samples or 2.8% of all samples. These two functions do the exact same work, but memset() does it much more efficiently. count=10627 PCsample[0000]=0xA x 4257 PCsample[0001]=0xA x 1340 PCsample[0002]=0xA002015C x 185 PCsample[0005]=0xA002016C x 1675 PCsample[0006]=0xA x 1537 PCsample[0008]=0xA x 11 PCsample[0010]=0xA x 214 PCsample[0018]=0xA002014C x 628 PCsample[0074]=0xA x 325 PCsample[0121]=0xA002017C x 23 PCsample[0382]=0xA x 40 PCsample[0383]=0xA x 36 PCsample[0384]=0xA x 15 PCsample[0465]=0xA x 1 PCsample[0482]=0xA x 10 PCsample[0499]=0xA00201E8 x 8 PCsample[0516]=0xA x 10 PCsample[0533]=0xA x 3 PCsample[0550]=0xA00210C8 x 23 PCsample[0567]=0xA00210C4 x 109 PCsample[0601]=0xA00210C0 x 40 PCsample[0618]=0xA00210BC x 92 PCsample[1009]=0xA x 1 PCsample[1026]=0xA x 4 PCsample[1500]=0xA002013C x 6 PCsample[1534]=0xA00201EC x 2 PCsample[2044]=0xA002109C x 2 PCsample[2603]=0xA x 5 PCsample[2620]=0xA00210CC x 7 PCsample[3079]=0xA x 4 PCsample[3096]=0xA002106C x 5 Application Note 25

26 An Example Program - memtest.c PCsample[3570]=0xA x 1 PCsample[3996]=0xA x 1 PCsample[4673]=0xA00210F8 x 1 PCsample[5149]=0xA00210A0 x 2 PCsample[5640]=0xA x 1 PCsample[5657]=0xA x 1 PCsample[10401]=0xA x 1 PCsample[10469]=0xA00210FC x 1 26 Application Note

27 An Example Program - memtest.c 6.4 Using the PMU to do Event-based Sampling Event-based sampling is very similar to time-based sampling, except the processor is being interrupted, approximately after every 16 events, instead of every 10 µs, as in the previous example. When the event occurs, the PMN1 register is incremented until it overflows, which causes an interrupt. Then the Link Register value is collected and returned from the interrupt. Notice on line #11, PMN1Event is set to event #0x0B or Data Cache miss. Also notice on line 12, the processor is interrupted every 256 events. The reason the processor is not interrupted at every event, is because the overhead of servicing an interrupt during every Data Cache miss is higher than the work being done. Lines include new code to service interrupts that could come from the PMN0 and PMN1 overflows. The status of the PMNC register tells where the interrupt came from. The rest of the modifications are there to count events on the PMN1, instead of the CCNT as in the previous example. Again, please notice that the value that gets programmed into the PMNC register has changed from the previous example. Before, the value was set to 0x and now the value being programmed is 0x The difference is that the event counter interrupts are being enabled and the PMN0 and PMN1 are not being reset to zero. 1 #include <stdio.h> 2 #include "80200.h" 3 #include "80310_fiq_irq.h" 4 extern inline unsigned _Read_LR(void) asm volatile("mov\t%0, r14" : "=r" (_val_)); 8 return _val_; 9 10 unsigned int status; 11 unsigned int PMN0Event=0, PMN1Event=0x0B; 12 unsigned int PMN0Rate=0xFFFFFF00, PMN1Rate=0xFFFFFF00; 13 unsigned int count=0, countpmn0=0, countpmn1=0; 14 unsigned int PCsample[0xFFFF], PMN0sample[0xFFFF], PMN1sample[0xFFFF]; 15 ISR_SERVICE *fiqval; 16 ISR_SERVICE next_fiq_service; /*************** ISR STUFF *******************/ 19 void inthandlerattach( ISR_SERVICE fiq ) fiqval = (ISR_SERVICE*)FIQVECTOR; 22 next_fiq_service = *fiqval; 23 *fiqval = fiq; void inthandlerdetach() *FIQVECTOR = (unsigned int)next_fiq_service; void attribute ((interrupt("fiq"))) fiqhandlerpmu(void) Application Note 27

28 An Example Program - memtest.c asm("ldrsp,=0xa "); 34 ISR_PROLOG; 35 status = _Read_INT_Source(); 36 if( status & 0x ) /* If the PMU caused the FIQ */ status = _Read_PMNC(); 39 if( status & 0x ) PMN0sample[countPMN0] = _Read_LR(); 42 countpmn0++; 43 _Write_PMN0( PMN0Rate ); 44 _Write_PMNC( status ); else if( status & 0x ) PMN1sample[countPMN1] = _Read_LR(); 49 countpmn1++; 50 _Write_PMN1( PMN1Rate ); 51 _Write_PMNC( status ); else PCsample[count] = _Read_LR(); 56 count++; 57 _Write_CCNT( samprate ); 58 _Write_PMNC( status ); else ISR_CHAIN( next_fiq_service );/* call old vector */ ISR_EPILOG; /*************** Example Code *****************/ 69 char buf[1000]; 70 void test_memset( char *buf, char val, int num ) int j; 73 for( j=0; j<num; j++ ) 74 buf[j] = val; int main() unsigned int i, j, location=1; _Write_CCLKCFG( 9 ); /* 733 MHz */ 28 Application Note

29 An Example Program - memtest.c 82 _Write_INT_Steering( _Read_INT_Steering() 1 );/* Steer FIQ */ 83 inthandlerattach( fiqhandlerpmu );/* attach the interrupt handler */ 84 _Write_CCNT( samprate ); 85 _Write_PMN0( PMN0Rate ); 86 _Write_PMN1( PMN1Rate ); 87 _Write_PMNC( 0x PMN0Event<<12 PMN1Event<<20 );/* Enable the PMU and reset */ _Write_INT_Control( _Read_INT_Control() 5 );/* Enable FIQ */ for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 94 memset( buf, 'a', 500 ); _Write_INT_Control( _Read_INT_Control() & 0xFFFFFFFA );/* Disable FIQ */ _Write_PMNC( 0x ); /* Disable the PMU and reset all */ 100 inthandlerdetach(); /******************** Display the results ***********************/ 103 printf( "count=%u, countpmn0=%u, countpmn1=%u\n", count, countpmn0, countpmn1 ); 104 for( i=0, location=1; i<countpmn1; i++ ) if( PMN1sample[i] ) for( j=i+1; j<countpmn1; j++ ) if( PMN1sample[i] == PMN1sample[j] ) location++; 113 PMN1sample[j]=0; printf( "PMN1sample[%d]=%08X x %d\n", i, PMN1sample[i]-4, location ); 117 location=1; return 0; 121 Application Note 29

30 An Example Program - memtest.c pmu_ebs.c - Program Output Similar to time-based sampling, this is used to find out where the functions are mapped in memory using GDB. 0xa002026c - 0xa00202cc test_memset 3169 samples or 96.3% 0xa00202d0-0xa002078c main 21 samples or 00.6% 0xa xa00214ec memset 101 samples or 03.1% This example program samples on Data Cache miss events. Since the Intel XScale microarchitecture is a cacheed architecture, every Data Cache miss causes the cache to have to be reloaded. When these Data Cache miss can be reduced, a significant amount of processor time can be saved just bysimply modifying the code. count=106, countpmn0=0, countpmn1=3292 PMN1sample[0]=A00202B8 x 81 PMN1sample[1]=A x 768 PMN1sample[2]=A00202A8 x 6 PMN1sample[3]=A x 24 PMN1sample[4]=A00202B0 x 162 PMN1sample[5]=A00202C0 x 734 PMN1sample[6]=A00202C4 x 754 PMN1sample[13]=A00202B4 x 611 PMN1sample[47]=A00202C8 x 35 PMN1sample[52]=A00214B8 x 5 PMN1sample[56]=A002148C x 30 PMN1sample[162]=A x 41 PMN1sample[207]=A x 6 PMN1sample[415]=A00203B4 x 2 PMN1sample[575]=A00202AC x 12 PMN1sample[578]=A x 8 PMN1sample[733]=A x 7 PMN1sample[884]=A00203B0 x 4 PMN1sample[1463]=A x 1 30 Application Note

31 Combination PMU Event Sampling 7.0 Combination PMU Event Sampling The following section is copied from the Intel Processor based on Intel XScale Microarchitecture Developer s Manual: Table 6. Some Common Uses of the PMU Mode PMNC.evtCount0 PMNC.evtCount1 Instruction Cache Efficiency 0x7 (instruction count) 0x0 (ICache miss) Data Cache Efficiency 0xA (Dcache access) 0xB (DCache miss) Instruction Fetch Latency 0x1 (ICache cannot deliver) 0x0 (ICache miss) Data/Bus Request Buffer Full 0x8 (DBuffer stall duration) 0x9 (DBuffer stall) Stall/Writeback Statistics 0x2 (data stall) 0xC (DCache writeback) Instruction TLB Efficiency 0x7 (instruction count) 0x3 (ITLB miss) Data TLB Efficiency 0xA (Dcache access) 0x4 (DTLB miss) 7.1 Instruction Cache Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions fetched from the instruction cache that were never executed. This can happen when a branch instruction changes the program flow; the instruction cache may retrieve the next sequential instructions after the branch, before it receives the target address of the branch. PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time. Statistics derived from these two events: Instruction cache miss-rate. This is derived by dividing PMN1 by PMN0. The average number of cycles it took to execute an instruction or commonly referred to as cycles-per-instruction (CPI). CPI can be derived by dividing CCNT by PMN0, where CCNT was used to measure total execution time. 7.2 Data Cache Efficiency Mode PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable accesses, mini-data cache access and accesses made to locations configured as data RAM. Note: STM and LDM each count as several accesses to the data cache, depending on the number of registers specified in the register list. LDRD registers two accesses. PMN1 counts the number of data cache and mini-data cache misses. Cache operations do not contribute to this count. See Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of these operations. The statistic derived from these two events is: Data cache miss-rate. This is derived by dividing PMN1 by PMN0. Application Note 31

32 Combination PMU Event Sampling 7.3 Instruction Fetch Latency Mode PMN0 accumulates the number of cycles when the instruction-cache is not able to deliver an instruction to the due to an instruction-cache miss or instruction-tlb miss. This event means that the processor core is stalled. PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time. This is the same event as measured in instruction cache efficiency mode and is included in this mode for convenience so that only one performance monitoring run is need. Statistics derived from these two events: The average number of cycles the processor stalled waiting for an instruction fetch from external memory to return. This is calculated by dividing PMN0 by PMN1. When the average is high then the may be starved of the bus external to the The percentage of total execution cycles the processor stalled waiting on an instruction fetch from external memory to return. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. 7.4 Data/Bus Request Buffer Full Mode The Data Cache has buffers available to service cache misses or uncacheable accesses. For every memory request that the Data Cache receives from the processor core, a buffer is speculatively allocated in case an external memory request is required or temporary storage is needed for an unaligned access. When no buffers are available, the Data Cache stalls the processor core. How often the Data Cache stalls depends on the performance of the bus external to the and what the memory access latency is for Data Cache miss requests to external memory. When the memory access latency is high, possibly due to starvation, these Data Cache buffers becomes full. This performance monitoring mode is provided to see when the is being starved of the bus external to the 80200, which effects the performance of the application running on the PMN0 accumulates the number of clock cycles the processor is being stalled due to this condition and PMN1 monitors the number of times this condition occurs. Statistics derived from these two events: The average number of cycles the processor stalled on a data-cache access that may overflow the data-cache buffers. This is calculated by dividing PMN0 by PMN1. This statistic shows when the duration event cycles are due to many requests or are attributed to just a few requests. When the average is high then the may be starved of the bus external to the The percentage of total execution cycles the processor stalled because a Data Cache request buffer was not available. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. 32 Application Note

33 Combination PMU Event Sampling 7.5 Stall/Writeback Statistics When an instruction requires the result of a previous instruction and that result is not yet available, the stalls in order to preserve the correct data dependencies. PMN0 counts the number of stall cycles due to data-dependencies. Not all data-dependencies cause a stall; only the following dependencies cause such a stall penalty: Load-use penalty: attempting to use the result of a load before the load completes. To avoid the penalty, software should delay using the result of a load until it is available. This penalty shows the latency effect of data-cache access. Multiply/Accumulate-use penalty: attempting to use the result of a multiply or multiply-accumulate operation before the operation completes. Again, to avoid the penalty, software should delay using the result until it s available. ALU use penalty: there are a few isolated cases where back to back ALU operations may result in one cycle delay in the execution. These cases are defined in Chapter 14, Performance Considerations of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual. PMN1 counts the number of writeback operations emitted by the data cache. These writebacks occur when the data cache evicts a dirty line of data to make room for a newly requested line or as the result of clean operation (CP15, register 7). Statistics derived from these two events: The percentage of total execution cycles the processor stalled because of a data dependency. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. Often a compiler can reschedule code to avoid these penalties when given the right optimization switches. Total number of data writeback requests to external memory can be derived solely with PMN Instruction TLB Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions that were translated by the instruction TLB and never executed. This can happen when a branch instruction changes the program flow; the instruction TLB may translate the next sequential instructions after the branch, before it receives the target address of the branch. PMN1 counts the number of instruction TLB table-walks, which occurs when there is a TLB miss. When the instruction TLB is disabled, PMN1 does not increment. Statistics derived from these two events: Instruction TLB miss-rate. This is derived by dividing PMN1 by PMN0. The average number of cycles it took to execute an instruction or commonly referred to as cycles-per-instruction (CPI). CPI can be derived by dividing CCNT by PMN0, where CCNT was used to measure total execution time. Application Note 33

34 Combination PMU Event Sampling 7.7 Data TLB Efficiency Mode PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable accesses, mini-data cache access and accesses made to locations configured as data RAM. Note: STM and LDM each count as several accesses to the data TLB depending on the number of registers specified in the register list. LDRD registers two accesses. PMN1 counts the number of data TLB table-walks, which occurs when there is a TLB miss. When the data TLB is disabled, PMN1 does not increment. The statistic derived from these two events is: Data TLB miss-rate. This is derived by dividing PMN1 by PMN0. 34 Application Note

35 Performance Analysis Tools 8.0 Performance Analysis Tools 8.1 GNU gprof GNU gprof is a free profiler that comes with the GNU tools. GNU gprof uses time-based sampling to collect its data. Intel XScale microarchitecture support has been added to the RedHat GNUPro version xscale At the time of this writing, the GNUPro version that supports gprof was not publicly available, but please check the following URL for the latest version available ( Some Linux versions already have gprof support for Intel XScale microarchitecture, so please check the version to be sure How to use gprof The following tutorial is based on the example program source code shown in Section 6.0, An Example Program - memtest.c on page 16. Using the GNUPro version that supports gprof, execute the following command: bash$ xscale-elf-gcc -pg -g -O0 -specs=iq80310.specs memtest.c -o memtest.elf pg option enables the collection of profiling data. g option creates DWARF2 debugging information for gdb and gprof. O0 option turns off optimization so the explanation of the example is straight-forward. specs=iq80310.specs option indicates which startup files and support libraries are needed to generate IQ80310 code. Use specs=redboot.specs for the IQ Now connect to the target with gdb and download the memtest.elf program: bash$ xscale-elf-gdb -nw memtest.elf nw option tells gdb not to use the GUI version of gdb When using the serial port, type at the gdb prompt: (gdb) set remotebaud (gdb) target remote com2 (or whatever com port) When using the Ethernet port, type at the gdb prompt: (gdb) target remote :9000 (or whatever IP and port) When connect to the target and a Connected message displays, type the commands: (gdb) load (gdb) break exit (gdb) continue The memtest.elf program now runs on the IQ80310 target and collects performance data until it hits the breakpoint in the exit routine. Then type at the gdb prompt: (gdb) gmon save gmon.out (gdb) quit GDB has now written the profiling data to a file called gmon.out on the host system. To analyze the performance data type: bash$ xscale-elf-gprof memtest.elf > memtest.res gprof has now generated a results file on the host system called memtest.res Application Note 35

36 Performance Analysis Tools Interpreting the results of gprof Table 7 shows the result file generated by gprof for memtest.elf. Table 7. memtest.elf - gprof Result File Each sample counts as 0.01 seconds. % Time Cumulative Seconds Self Seconds Calls Self ms/call Total ms/call The first line of data shows the function test_memset as taking 77% of the 33 seconds it took to run the program or seconds was spent in this function. This function is obviously the slowest to run and the source code shows that this function is a simple byte-by-byte copy in a for loop. This is where the developer needs to spend time optimizing code and making this function run as efficiently as possible. The second line shows the standard C-library function call to memset. This library call has been optimized and only takes about 7 seconds to run or about 22% of the CPUs time. Both the test_memset and memset functions are called 5000 times, as shown in the calls column. This is a simple way to fix the performance of the first function, that is by calling an optimized standard library function instead of writing one. The third line is for mcount_internal, this is the internal data collection part of gprof and does not exist when the application is not being profiled with gprof. The rest of the lines took 0.00 seconds to run, so they are of little concern. Name test_memset memset mcount_internal atexit do_global_ctors_aux get_memtop exit frame_dummy main 36 Application Note

37 Performance Analysis Tools 8.2 Intel VTune Performance Analyzer Version 6.0 The Intel VTune Performance Analyzer is a graphical performance profiler originally written for the Intel Architecture (x86 instruction set) line of microprocessors. Intel VTune is capable of doing time-based sampling, event-based sampling, counter monitor, call graphing and also has a built-in Intel Tuning Assistant that automatically suggests code improvements. Intel VTune for Intel XScale microarchitecture is coming soon, so please check our website for availability How to use the Intel VTune Analyzer The way Intel VTune works on an embedded system is different than the way it works on a Windows system. Set up two systems, the target system that is running the Intel XScale microarchitecture and a host system running the Windows OS. Collect data on the target and transfer the data to the Windows host. There is probably be a batch file that takes care of the details of getting data onto the host machine. This is an example of using Intel VTune, version 5.0 for performance profiling on a native IA-32 processor running Windows Figure 2. Intel VTune Performance Profiling Module View This is the module view of a profiling session done with Intel VTune on Windows Notice that Intel VTune has collected data on the operating system as well as the target application called vtunedemo. The largest bar is for the ntosknl.exe and notice that hal.dll (the hardware abstraction layer) also is taking some CPU time. But we are interested in the demo application called vtunedemo.exe. Double-clicking the vtunedemo.exe bar zooms into a HotSpot view of vtunedemo.exe by function. Application Note 37

taking the most CPU time. Double-clicking on the red bar drills down further in the source code view. Figure 4.

38 Performance Analysis Tools Figure 3. Intel VTune Performance Profiling HotSpot View of vtunedemo.exe by Function This hotspot view shows the percentage of CPU time that function calls are taking and the red bar to the left is the function that is taking the most CPU time. Double-clicking on the red bar drills down further in the source code view. Figure 4. Intel VTune PerformanceProfilingSourceCodeView In the source code view, shows the exact line of code that is taking the most CPU cycles time during the run of that application. This code is where the developer wants to focus on optimization. 38 Application Note

Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture

Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture Application Note June 2001 Document Number: 273538-001 Information in this document is provided in connection with Intel products.