Performance Profiling Techniques on Intel XScale Microarchitecture Processors

Size: px
Start display at page:

Download "Performance Profiling Techniques on Intel XScale Microarchitecture Processors"

Transcription

1 Performance Profiling Techniques on Intel XScale Microarchitecture Processors Application Note August 2002 Document Number:

2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel XScale microarchitecture processors may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling or by visiting Intel's website at Copyright Intel Corporation, 2002 AlertVIEW, i960, AnyPoint, AppChoice, BoardWatch, BunnyPeople, CablePort, Celeron, Chips, Commerce Cart, CT Connect, CT Media, Dialogic, DM3, EtherExpress, ETOX, FlashFile, GatherRound, i386, i486, icat, icomp, Insight960, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4, IntelSX2, Intel ChatPad, Intel Create&Share, Intel Dot.Station, Intel GigaBlade, Intel InBusiness, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetStructure, Intel Play, Intel Play logo, Intel Pocket Concert, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel TeamStation, Intel WebOutfitter, Intel Xeon, Intel XScale, Itanium, JobAnalyst, LANDesk, LanRover, MCS, MMX, MMX logo, NetPort, NetportExpress, Optimizer logo, OverDrive, Paragon, PC Dads, PC Parents, Pentium, Pentium II Xeon, Pentium III Xeon, Performance at Your Command, ProShare, RemoteExpress, Screamline, Shiva, SmartDie, Solutions960, Sound Mark, StorageExpress, The Computer Inside, The Journey Inside, This Way In, TokenExpress, Trillium, Vivonic, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. 2 Application Note

3 Contents Performance Profiling Techniques on Intel XScale Microarchitecture Processors Contents 1.0 Forward Introduction PMU Registers on Intel XScale Core Clock Counter (CCNT) Performance Monitor Count Register (PMN0 and PMN1) Performance Monitor Control Register (PMNC) Time-Based Sampling and Tuning Strategies PMU Event-Based Sampling An Example Program - memtest.c Using CCNT as a Timer - PMU.C PMU.C Program Output Using PMN0 and PMN1 Registers to Count PMU Events - pmu_event.c pmu_event.c - Program Output Using the PMU for Time-based Sampling pmu_tbs.c - Program Output Using the PMU to do Event-based Sampling pmu_ebs.c - Program Output Combination PMU Event Sampling Instruction Cache Efficiency Mode Data Cache Efficiency Mode Instruction Fetch Latency Mode Data/Bus Request Buffer Full Mode Stall/Writeback Statistics Instruction TLB Efficiency Mode Data TLB Efficiency Mode Performance Analysis Tools GNU gprof How to use gprof Interpreting the results of gprof Intel VTune Performance Analyzer Version How to use the Intel VTune Analyzer Some Sampling Tips and Aids Skid Writing to the PMU Registers Resetting the PMNC Register Intel Processor Errata that References the PMU Power Management Affects the PMU Conclusion...40 A B Source to h...41 Source to 80310_fiq_irq.h...57 Application Note 3

4 Contents Figures 1 Intel Processor based on Intel XScale Microarchitecture Intel VTune Performance Profiling Module View Intel VTune Performance Profiling HotSpot View of vtunedemo.exe by Function Intel VTune Performance Profiling Source Code View Tables 1 Clock Counter Timing Data Clock Count Register (CCNT) Performance Monitor Count Register (PMN0 and PMN1) Performance Monitor Control Register (CP14, register 0) Performance Monitoring Events Some Common Uses of the PMU memtest.elf - gprof Result File Application Note

5 Contents Revision History Date Revision Description August Miscellaneous Typo corrections. June Added Table 1. Added new Section 8.0. Various text updates. March Initial Release. Application Note 5

6 Contents This page intentionally left blank. 6 Application Note

7 Forward 1.0 Forward The purpose of this paper is to familiarize the reader with the Intel XScale microarchitecture (ARM* architecture compliant) Performance Monitoring Unit (PMU) and techniques of using the PMU for performance profiling. This paper assumes a development environment has already been setupconsistingofanintel IQ80310 Evaluation Platform Board (IQ80310) or Intel IQ80321 Evaluation Platform Board (IQ80321) and a development system with GNUPro* installed, as well as downloading, running and debugging the hello world sample program with GDB*. This paper also assumes previous programming experience and Intel XScale microarchitecture knowledge. 2.0 Introduction Programming the PMU on an Intel XScale microarchitecture processor is a very simple exercise, especially when using the clock counter as a timer. Just a few simple instructions setup the PMU to time the application or sections within the code. One of the first things to do when analyzing the performance of software, is to make sure the hardware is not the bottleneck. For instance, when an application is waiting on the network, optimizing code makes an insignificant difference. The operator needs to verify that the system is not bound by: network - upgrade to GigaBit Ethernet on the test system. hard disk - use the fastest, best performance disks available. memory - consider adding more memory to the test system. less than optimum processor speed - upgrade to the highest speed Intel XScale microarchitecture for the test system. or other system factors Note: Be careful to change only one thing at a time and to record results for future reference. Next, check that the target hardware is not a holding factor, by looking at performance analysis tools. Some tools may be available, depending on the development environment. For example, the following are some of those available to optimize code: GNU* gprof* ARM* ARMprof* Intel VTune Performance Analyzer WindRiver* WindView* LynuxWorks* SpyKer* Normally, when developing a proprietary Operating System or using an OS without supported tools, the source code has to be instrumented by the operator. When using Intel XScale microarchitecture however, the PMU can be used to instrument the code. Application Note 7

8 Introduction Intel XScale microarchitecture includes hardware to help collect performance data with minimal overhead for data gathering. This feature is called the Performance Monitoring Unit (PMU). The PMU consists of a set of counters that can be used to collect data on the performance or timing of the application. There is a counter that counts core cycles for measuring total execution time or time elapsed. There are two additional counters that are capable of counting specific processor events such as cache misses. Figure 1. Intel Processor based on Intel XScale Microarchitecture Instruction Cache 32 Kbytes 32 ways Lockable by line Data Cache Max 32 Kbytes 32 ways wr-back or wr-through Hit under miss Data RAM Max 28 Kbytes Re-map of data cache Mini-Data Cache 2 Kbytes 2 ways Branch Target Buffer 2 Kbytes 2 ways IMMU 32 entry TLB Fully associative Lockable by entry DMMU 32 entry TLB Fully associative Lockable by entry Fill Buffer 4-8 entries Performance Monitoring Debug Hardware Breakpoint Branch History Table Power Management Idle Drowsy Sleep MAC Single Cycle Throughput (16*32) 16-bit SIMD 40-bit Accumulator Write Buffer 8 entries Full coalescing JTAG Interrupt Controller Interrupt Masking FIQ/IRQ Steering Pend Register Bus Controller 1 Bbyte/sec Pipelined, de-multiplexed ECC protection A These registers can be used to time the execution of specific routines or the overall execution time of an application. The timer can also be used to do hot-spot analysis. Hot-spot analysis is used to show how much time the processor is spending on the program instructions. This works by stopping the processor at a regular interval (such as 1 ms.) and seeing which instruction the processor was executing at the time. This is called time-based sampling. By building up a large sample of execution data (3000 or more records), this achieves a statistically significant amount of data, showing where the processor spends most of its time. These hot-spots are the best places to spend optimizing time. The PMU can also collect CPU event data. These events are described in detail in Chapter 5.0, PMU Event-Based Sampling. 8 Application Note

9 PMU Registers on Intel XScale Core 3.0 PMU Registers on Intel XScale Core Note: See chapter 12 in the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for more details. The PMU consists of four registers: a clock counter register. two event-counting registers. a configuration register used to configure the three counting registers. 3.1 Clock Counter (CCNT) The clock counter can be used as a timer, to measure the execution time of a particular routine. It is a 32-bit counter that can interrupt the processor at roll-over. The clock counter counts core clock cycles, which on the Intel Processor based on Intel XScale Microarchitecture (80200) can be from 200 MHz to 733 MHz, depending on the core clock multiplier. At 200 MHz, the counter rolls-over from 0 to 0xFFFF FFFF and back to 0x0 in about 21.5 seconds. At 733 MHz, the counter rolls-over in about 5.9 seconds with a resolution of about 1.4 ns. There is also a divider for the clock counter that increments every 64 core clock cycles. With the divider enabled and the core clock running at 733 MHz, the clock counter rolls over in about 375 seconds, with a resolution of about 87 ns. The clock counter is based on the reference clock to the Intel XScale core. For the IQ80310, the clock is exactly MHz. According to Table 1, a CCLKCFG value of nine gives a CLK multiplier of 11. Therefore, MHz times an 11 CLK multiplier, equals a MHz core clock speed. Table 1. Clock Counter Timing Data CCLKCFG [3:0] (Coprocessor 14, register 16) Multiplier for CCLK CCLK (MHz) CCNT Rollover (secs) No Divider Resolution (ns) 64 Clock Divider CCNT Rollover (secs) Resolution (ns) Note: The clock speed is also dependant upon operating voltage. Application Note 9

10 PMU Registers on Intel XScale Core The clock counter can also be used for hot-spot analysis using time-based sampling. This can be done by stopping the processor at a regularly timed interval, approximately every 1 ms. To achieve this, program a value into the CCNT register that is equivalent to 1 ms before the counter rollsover. When the clock counter rolls over, it interrupts the processor exactly 1 ms after the CCNT register starts counting. The equation to calculate a 1 ms sampling rate assuming an IQ80310 is: CCNT Value = 0x0 ( ( CCLKCFG + 2 ) * reference clock * sampling rate in seconds ) or CCNT Value = 0x0 ( ( ) * Hz * secs.) CCNT Value = 0x0 ( ) CCNT Value = 0xFFF4 EC10 Table 2. Clock Count Register (CCNT) Clock Counter reset value: unpredictable Bits Access Description 31:0 Read / Write 32-bit clock counter - Reset to 0 by PMNC register. When the clock counter reaches its maximum value 0xFFFF,FFFF, the next cycle causes it to roll over to zero and generate an IRQ or FIQ when enabled. 3.2 Performance Monitor Count Register (PMN0 and PMN1) The PMN0 and PMN1 registers are similar to the CCNT register, except that they are incremented on PMU events, instead of core clocks. Examples of PMU events that can be counted are: cache misses TLB misses branch mispredictions The event counters do not count OS events like context switches and memory region accesses. These counters can be used to count the number of CPU events that happen similarly to the CCNT register. For instance, counting the total number of stalls caused by the data cache buffers being full during the run of the application. PMU events can also be used to do event-based sampling, where the processor would be stopped at a cache miss and the current location of the Program Counter saved. Given enough data to be statistically significant, event-based sampling can show which specific instructions of the application are the most cache inefficient. Table 3. Performance Monitor Count Register (PMN0 and PMN1) Event Counter reset value: unpredictable Bits Access Description 31:0 Read / Write 32-bit event counter - Reset to 0 by PMNC register. When an event counter reaches its maximum value 0xFFFF,FFFF, the next event it needs to count causes it to roll over to zero and generate an IRQ or FIQ interrupt when enabled. 10 Application Note

11 PMU Registers on Intel XScale Core 3.3 Performance Monitor Control Register (PMNC) The PMNC is used to setup the PMU. This register controls the events that PMN0 and PMN1 monitor. The PMU can count core clock cycles and PMU events or interrupt the processor at a counter rollover. The PMNC register also tracks which counter has overflowed. In order to have the PMU trigger an interrupt to the processor, a programmer must: enable the counters with the (E) bit of the PMNC. enable interrupts with the (inten) bitsofthepmnc. select an event out of the event list from Table 12-4 in the Intel Processor based on Intel XScale Microarchitecture Developer s Manual. make sure interrupts are enabled in the INTCTL and CPSR registers. Note: The IQ80310 and IQ80321 evaluation boards use slightly different interrupt routing paths on the boards. When using the IQ80310, please refer to the Intel IQ80321 Evaluation Platform Board Manual, chapter 5. When using the IQ80321, please refer to the Intel IQ80310 Evaluation Platform Board Manual, chapter 10. Table 4. Performance Monitor Control Register (CP14, register 0) (Sheet 1 of 2) evtcount1 evtcount0 flag inten D C P E reset value: E and inten are 0, others unpredictable Bits Access Description 31:28 Read-unpredictable / Write-as-0 Reserved 27:20 Read / Write 19:12 Read / Write Event Count1 - identifies the source of events that PMN1 counts. See Table 5 for a description of the values this field may contain. Event Count0 - identifies the source of events that PMN0 counts. See Table 5 for a description of the values this field may contain. 11 Read-unpredictable / Write-as-0 Reserved 10:8 Read / Write Overflow/Interrupt Flag - identifies which counter overflowed Bit 10 = clock counter overflow flag Bit 9 = performance counter 1 overflow flag Bit 8 = performance counter 0 overflow flag Read Values: 0 = no overflow 1 = overflow has occurred Write Values: 0 = no change 1= clearthisbit 7 Read-unpredictable / Write-as-0 Reserved Application Note 11

12 PMU Registers on Intel XScale Core Table 4. Performance Monitor Control Register (CP14, register 0) (Sheet 2 of 2) evtcount1 evtcount0 flag inten D C P E reset value: E and inten are 0, others unpredictable Bits Access Description 6:4 Read / Write 3 Read / Write 2 Read-unpredictable / Write 1 Read-unpredictable / Write 0 Read / Write Interrupt Enable - used to enable/disable interrupt reporting for each counter Bit 6 = clock counter interrupt enable 0 = disable interrupt 1 = enable interrupt Bit 5 = performance counter 1 interrupt enable 0 = disable interrupt 1 = enable interrupt Bit 4 = performance counter 0 interrupt enable 0 = disable interrupt 1 = enable interrupt Clock Counter Divider (D) - 0 = CCNT counts every processor clock cycle 1 = CCNT counts every 64 th processor clock cycle Clock Counter Reset (C) - 0 = no action 1 = reset the clock counter to 0x0 Performance Counter Reset (P) - 0 = no action 1 = reset both performance counters to 0x0 Enable (E) - 0 = all 3 counters are disabled 1 = all 3 counters are enabled The interrupt control register may have to be programmed in order to get Time-Based Sampling and Event-Based Sampling working. Also, the INTSTR register needs to be setup correctly. Please refertochapter9oftheintel Processor based on Intel XScale Microarchitecture Developer s Manual. 12 Application Note

13 Time-Based Sampling and Tuning Strategies 4.0 Time-Based Sampling and Tuning Strategies Time-based sampling is just one of a few ways to extract performance data from an embedded target. Sampling can also be based on OS events or PMU events. Many performance analysis tools are based on OS events such as task switches. OS event data can be gathered by instrumenting the context switch of the OS. Another method of performance data collection, is to collect data during a PMU event, such as a cache miss. This method is described in the next chapter. Time-based sampling is the method of analysis where the processor is stopped at a regularly timed interval (like 1 ms.) and collects performance data. This is not simply timing the elapsed time of the application and comparing it to a previous run, but actually interrupting the processor every 1 ms, while the application is running in a steady state. After collecting data for a significant amount of time, say 15 seconds, 15,000 samples are available that statistically profile the execution performance of the application. So, time-based sampling is like taking a large amount of snapshots of CPU activity, to do performance analysis. One of the biggest benefits of time-based sampling is the ability to be less intrusive. Being minimally intrusive is crucial in getting valid performance data. It is not productive for the gathering of data to interfere with the normal operation of the system. The process of data collection should not introduce additional errors. The sampling rate can be stretched out to long periods of time to collect more samples, to get a statistically significant amount of performance data. The concept is to gather data on where the processor is spending a significant amount of its time and not to worry about edge cases. The best return on investment for performance enhancement is to find where the CPU hot-spots are. When choosing to optimize with sampling strategies, Time-based sampling ought to be the first and most often used strategy. It gives an overall view of how the application is running and where time needs to be spent tuning code. Once optimization is satisfactory for the code that is hindering the performance of the processor (hot-spot), the next slowest section of code can be tuned or event-based sampling can be tried. Some shortcomings of time-based sampling are: First, only a statistical picture of the system performance is given. In other words, if interested in a particular line or section of code, data for that section may or may not show up, depending on how quickly that section executes. Second, time-based sampling gives a good overall picture, but does not show how to take full advantage of the Intel XScale microarchitecture. Intel XScale microarchitecture processors have a 32 K instruction cache and performance is highly dependant on reducing the cache miss rate. It can cost over 80 core clock cycles to fill a cache line after a cache miss, because the core is running at 733 MHz and has to go out to relatively slow (100 MHz) SDRAM. PMU event-based sampling can pin point the instructions that are causing cache misses and allow the program to be modified, to avoid these costly cache misses. See the optimization guide in Appendix B of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for more details. Application Note 13

14 PMU Event-Based Sampling 5.0 PMU Event-Based Sampling PMU event-based sampling differs from time-based sampling. Instead of interrupting on a regular interval, it interrupts execution of the program when a PMU event occurs. An analogy of this, is like a red light running traffic camera, the camera takes a picture only when someone runs a red light. Think of how long it would take to catch a red light runner when the camera is taking a picture once every minute, it might take months. Now think of the small and very fast execution inside a processor based on Intel XScale microarchitecture, every time the PMU sees a cache miss, it takes a picture of what is going on at the time of the cache miss. PMU event-based sampling is different than event counting, in that counting tells how many times an event happened, but does not tell which instruction caused the event. Event counting is much easier to implement, because event-based sampling involves interrupting the processor, while event counting happens automatically, the PMU needs to be told to start and stop. This means that every cache miss can cost over 80 cycles, because the CPU is running at 733 MHz and the memory bus is running at 100 MHz, so the processor stalls while memory is being accessed and the cache fills. Once it is known that a certain instruction is causing a significant amount of cache misses, the cache can be preloaded before getting to that instruction and save the possibly 80 cycles for a cache miss, multiplied by the number of times that instruction runs in a loop. Note: The following information is copied from the Intel Processor based on Intel XScale Microarchitecture Developer s Manual: Table 5 lists events that may be monitored by the PMU. Each of the Performance Monitor Count Registers (PMN0 and PMN1) can count any listed event. Software selects which event is counted by each PMNx register by programming the evtcountx fields of the PMNC register. Table 5. Performance Monitoring Events (Sheet 1 of 2) Event Number (evtcount0 or evtcount1) 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC Event Definition Instruction cache miss requires fetch from external memory. Instruction cache cannot deliver an instruction. This could indicate an ICache miss or an ITLB miss. This event occurs every cycle in which the condition is present. Stall due to a data dependency. This event occurs every cycle in which the condition is present. Instruction TLB miss. Data TLB miss. Branch instruction executed, branch may or may not have changed program flow. Branch mispredicted. (B and BL instructions only.) Instruction executed. Stall because the data cache buffers are full. This event occurs every cycle in which the condition is present. Stall because the data cache buffers are full. This event occurs once for each contiguous sequence of this type of stall. Data cache access, not including Cache Operations (defined in Section 7.2.8, of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual) Data cache miss, not including Cache Operations (defined in Section 7.2.8, of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual) Data cache write-back. This event occurs once for each 1/2 line (four words) that are written back from the cache. 14 Application Note

15 PMU Event-Based Sampling Table 5. Performance Monitoring Events (Sheet 2 of 2) Event Number (evtcount0 or evtcount1) 0xD 0x10 0x11 0x12 0x13 0x14 0x15 0x16 all others Event Definition Software changed the PC. This event occurs any time the PC is changed by software and there is not a mode change. For example, a mov instruction with PC as the destination triggers this event. Executing a swi from User mode does not trigger this event, because it incurs a mode change. The BCU received a new memory request from the core. The BCUs request queue is full. This event takes place each clock cycle in which the condition is met. A high incidence of this event indicates the BCU is often waiting for transactions to complete on the external bus. The number of times the BCU queues were drained due to a Drain Write Buffer command or an I/O transaction as identified by C = 0 and B = 0 (cacheable and bufferable page attribute bits). Reserved, unpredictable results. The BCU detected an ECC error, but no ELOG register was available in which to log the error. (See Section , ECC Error Registers on page 11-9, of the Section of the Intel 80200ProcessorbasedonIntel XScale Microarchitecture Developer s Manual for a description of the ELOG registers). BCU detected a 1-bit error while reading data from the bus. This event may be counted even when reporting of 1-bit errors is disabled. See Section 11.3, Error Handling on page 11-2, of the Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of 1-bit errors. RMW cycle occurred due to narrow write on ECC-protected memory (see Section 11.2, ECC on page 11-1 of the Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of ECC and RMW cycles). Reserved, unpredictable results Application Note 15

16 An Example Program - memtest.c 6.0 An Example Program - memtest.c Here is a simple program to demonstrate the value of using existing optimized library routines. This program sets a block of 500 bytes of memory to the value a. It shows two ways to do it. One way is to use for loops to initialize memory or another way is to use the pre-optimized memset function supplied in the standard C library. Each of the following examples build on this piece of example code, to demonstrate different performance profiling techniques. 1 #include <stdio.h> 2 3 char buf[1000]; 4 5 void test_memset(char *buf, char val, int num) 6 7 int i; 8 for(i=0; i<num; i++) 9 buf[i]=val; int main() unsigned ticks = 0; int i; for(i=0; i<5000; i++) test_memset(buf, 'a', 500); 20 memset(buf, 'a', 500); return 0; Application Note

17 An Example Program - memtest.c 6.1 Using CCNT as a Timer - PMU.C Since the PMU is being set up to time the execution of both the memset and test_memset functions, the four loops are split in two and some assembly language instructions are defined to set the CPU clock (line #2 below), setup and run the clock counter (line #6) and read the counter (line #10). First, setup the CCLKCFG register to 733 MHz (line #32). This is needed because the timing calculations are based on the CPU clock speed. Next, enable the PMU to do clock counting (line #33). The value written to the PMNC Control REgister is 0x This value sets the events to 0, clears all flags, disables all interrupts, disables the divider, resets all counters and then enables all counters. Once the loop has been executed, read the clock counter (line #38) and reset the PMU (line #40) to time the second loop. 1 #include <stdio.h> 2 inline void _Write_CCLKCFG(VAL) /* write to the CCLKCFG register */ 3 4 asm( "mcr\tp14, 0, %0, c6, c0, 0" : : "r" (VAL) ); 5 6 inline void _Write_PMNC(VAL) /* write to the PMNC register of hte PMU */ 7 8 asm( "mcr\tp14, 0, %0, c0, c0, 0" : : "r" (VAL) ); 9 10 inline unsigned _Read_CCNT(void) /* write to the CCNT register of hte PMU */ register unsigned _val_; 13 asm volatile( "mrc\tp14, 0, %0, c1, c0, 0" : "=r" ( _val_ ) ); 14 return _val_; char buf[1000]; void test_memset( char *buf, char val, int num ) int i; 22 for( i=0; i<num; i++ ) 23 buf[i] = val; int main() unsigned ticks = 0; int i; _Write_CCLKCFG( 9 ); /* set clock to 733 MHz */ 33 _Write_PMNC( 0x ); /* Enable the PMU and reset all */ 34 for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 37 Application Note 17

18 An Example Program - memtest.c 38 ticks = _Read_CCNT(); 39 printf( "test_memset took %4u ticks = %02.6f ms and ", ticks, (double)ticks/ ); 40 _Write_PMNC( 0x ); /* Enable the PMU and reset all */ 41 for( i=0; i<5000; i++ ) memset( buf, 'a', 500 ); ticks = _Read_CCNT(); /* read the PMU clock counter */ 46 printf( "memset took %4u ticks = %02.6f ms\n", ticks, (double)ticks/ ); 47 return 0; PMU.C Program Output The output shows that the test_memset loop took ms to execute and the memset library routine took just 3.3 seconds to execute. That shows an improvement of 31 times in speed, just by using a standard C library routine. test_memset took ticks = ms and memset took ticks = ms 18 Application Note

19 An Example Program - memtest.c 6.2 Using PMN0 and PMN1 Registers to Count PMU Events - pmu_event.c This example shows how to use the PMU to count events such as cache misses and branch mispredictions. A loop (line #39) is setup around the test loop, to count events in sequential order. During the first run of the loop, the PMU is setup to count event #0 or Instruction cache miss requires fetch from external memory. The next loop counts event #1 and so on. 1 #include <stdio.h> 2 inline void _Write_CCLKCFG(VAL) 3 4 asm( "mcr\tp14, 0, %0, c6, c0, 0" : : "r" (VAL) ); 5 6 inline void _Write_PMNC(unsigned VAL) 7 8 asm( "mcr\tp14, 0, %0, c0, c0, 0" : : "r" (VAL) ); 9 10 inline unsigned _Read_CCNT(void) register unsigned _val_; 13 asm volatile( "mrc\tp14, 0, %0, c1, c0, 0" : "=r" ( _val_ ) ); 14 return _val_; inline unsigned _Read_PMN0(void) register unsigned _val_; 19 asm volatile( "mrc\tp14, 0, %0, c2, c0, 0" : "=r" ( _val_ ) ); 20 return _val_; char buf[1000]; void test_memset( char *buf, char val, int num ) int i; 28 for( i=0; i<num; i++ ) 29 buf[i] = val; int main() unsigned ticks = 0, events = 0; 35 unsigned int pmnc_val; 36 int i, j; _Write_CCLKCFG( 9 ); /* set clock to 733 MHz */ 39 for (j=0; j<23; j++ ) if( j == 0x0E j == 0x0F ) /* avoid using these reserved values for the event numbers */ Application Note 19

20 An Example Program - memtest.c 42 continue; 43 pmnc_val = 0x (j<<12); /* shift j left by 12 to program the Event Count 0 bits */ 44 _Write_PMNC( pmnc_val );/* Enable the PMU and reset all */ 45 for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 48 memset( buf, 'a', 500 ); ticks = _Read_CCNT(); /* read the PMU clock counter */ 51 events = _Read_PMN0(); /* read the PMU event counter */ 52 printf( "%6u ticks=%02.4f ms, events=%8u, pmnc=%08x\n", 53 ticks, (double)ticks/726000, events, pmnc_val); return 0; Application Note

21 An Example Program - memtest.c pmu_event.c - Program Output Please refer to Table 5, Performance Monitoring Events on page 14, for explanations of the different performance monitoring events. In the output of pmu_event, it shows the count of PMU events in sequential order. The first line corresponds to event #0x0 and the second line corresponds to event #0x1 and so on. Notice that on event #0x0 (instruction cache misses) the program took slightly longer to run because the instruction cache was being loaded. Then the event gets incremented and the program runs again, except the PMU is counting the new event data: ticks= ms, events= 15, pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= 1, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= , pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= , pmnc=0000a ticks= ms, events= 0, pmnc=0000b ticks= ms, events= 0, pmnc=0000c ticks= ms, events= , pmnc=0000d ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= ticks= ms, events= 0, pmnc= Application Note 21

22 An Example Program - memtest.c 6.3 Using the PMU for Time-based Sampling The next example shows the change from a counting / timing usage of the PMU to a sampling function. First, the processor is interrupted every 10 µs and the value of the PC is saved to an array. A slower rate to sample is preferred, but for this example a faster rate was used to collect a more significant amount of data. Over time, these saved PC values give a statistical sampling of where the program spends most of its time. Notice at this time, the change over from external inline functions for PMU registers, to a header file named h (#2), as well as the inclusion of a new header file called 80310_fiq_irq.h (#3), that has some specific interrupt handling code. Also added was _Read_LR() (#6), which is an inline function that reads the Link Register, which points to the Program Counter before the ISR took over execution of the program. To service the interrupt, add the code from line #17 to #47 in the sample code below. The inthandlerattach() function replaces the current RedBoot FIQ vector with the new vector of fiqhandlerpmu() and inthandlerdetach() puts the old vector back when the process is completed. The function fiqhandlerpmu() runs every time there is an interrupt. The first thing that needs to happen is to setup the FIQ mode stack pointer. Choose a value in RAM that would not be overwritten by the application, but be cautious. Next, run ISR_PROLOG (#33) that is an assembly routine to save away the current registers onto the stack. Then find out what caused the interrupt. When the interupt is caused by the PMU, read the Link Register that contains the Program Counter before the interrupt. Also reprogram the PMN0 register to interrupt the CPU again at the same sample rate as before (#39) and reset the PMNC register back to interrupt on overflow (#40). When the PMU does not cause the interrupt (#35), call ISR_CHAIN (#44) that calls the old FIQ vector. Then pop the registers off the stack before exiting the FIQ handler by calling ISR_EPILOG (#46). The next section is part of main(). First, setup the interrupt steering register (#63) to FIQ, by reading the INTSTR register, setting the FIQ bit to 1 and writing the value back into the INTSTR register. Then attach the interrupt handler (#64) as described in the previous paragraph, by replacing the old FIQ handler with the new fiqhandlerpmu(). Line #65 loads the sampling rate value into the CCNT register. Now, instead of counting from zero and stopping after the program finishes, program the CCNT value to (0x0-10 µs) or 0xFFFF E3A4 and when the clock counter counts that many cycles, interrupt the processor. Once the processor has been interrupted and the interrupt serviced, re-program the CCNT register with the (overflow - 10 µs) value and re-enable counting. Line #66 and #67 resets the PMU and enables the PMU to cause FIQ interrupts on a CCNT overflow. Notice that the value being programmed into the PMNC, has changed from 0x to 0x (#66). The reason for this is to enable the clock counter interrupts and not to reset the value just programmed into the CCNT. Please realize that interrupting the processor every 10 µs severely degrades the performance of the application being profiled, but this is being done as an example to generate a large amount of data. After running the target segment of code, disable the interrupts (#75), reset the PMU (#77) and detach the interrupt handler (#78). Display the results of the time-based sampled data collected. 1 #include <stdio.h> 2 #include "80200.h" 3 #include "80310_fiq_irq.h" 4 /************** SAMPRATE = ( 0x0 - ( 726 * microseconds ) ) *************/ 5 #define SAMPRATE 0xFFFFE3A4 /* MHz */ 6 inline unsigned _Read_LR(void) asm volatile("mov\t%0, r14" : "=r" (_val_)); 22 Application Note

23 An Example Program - memtest.c 10 return _val_; unsigned int status, count=0, PCsample[0xFFFF]; 14 ISR_SERVICE *fiqval; /* In 80310_fiq_irg.h */ 15 ISR_SERVICE next_fiq_service; /*************** ISR STUFF *******************/ 18 void inthandlerattach( ISR_SERVICE fiq ) fiqval = (ISR_SERVICE*)FIQVECTOR; 21 next_fiq_service = *fiqval; 22 *fiqval = fiq; void inthandlerdetach() *FIQVECTOR = (unsigned int)next_fiq_service; void attribute ((interrupt("fiq"))) fiqhandlerpmu(void) asm("ldrsp,=0xa "); /* stack pointer in FIQ mode */ 33 ISR_PROLOG; 34 status = _Read_INT_Source(); 35 if( status & 0x ) /* If the PMU caused the FIQ */ PCsample[count] = _Read_LR(); 38 count++; 39 _Write_CCNT( SAMPRATE ); 40 _Write_PMNC( 0x );/* Enable the PMU and reset */ else ISR_CHAIN( next_fiq_service );/* call old vector */ ISR_EPILOG; /*************** TBS Example Code *****************/ 50 char buf[1000]; 51 void test_memset( char *buf, char val, int num ) int j; 54 for( j=0; j<num; j++ ) 55 buf[j] = val; int main() 59 Application Note 23

24 An Example Program - memtest.c unsigned int i, j, location=1; _Write_CCLKCFG( 9 ); /* 733 MHz */ 64 _Write_INT_Steering( _Read_INT_Steering() 1 );/* Steer FIQ */ 65 inthandlerattach( fiqhandlerpmu );/* attach the interrupt handler */ 66 _Write_CCNT( SAMPRATE ); /* Interrupting every? ms */ 67 _Write_PMNC( 0x ); /* Enable the PMU and reset */ 68 _Write_INT_Control( _Read_INT_Control() 5 );/* Enable FIQ */ for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 73 memset( buf, 'a', 500 ); _Write_INT_Control( _Read_INT_Control() & 0xFFFFFFFA );/* Disable FIQ */ _Write_PMNC( 0x ); /* Disable the PMU and reset all */ 79 inthandlerdetach(); /***************** Display Results ******************/ 82 printf( "count=%d\n, count ); 83 for( i=0; i<count; i++ ) if( PCsample[i] ) for( j=i+1; i<count; j++ ) if( PCsample[i] == PCsample[j] ) location++; 92 PCsample[j]=0; printf( "PCsample[%04d]=0x%08X x %d\n", i, PCsample[i]-4, location ); 96 location=1; return 0; Application Note

25 An Example Program - memtest.c pmu_tbs.c - Program Output The first line of output from the time-based sampling program, pmu_tbs.c, displays the value of the count variable. Sampling is being done at a 10 µs rate, the counts equals a ms execution time. The difference in execution time between the ms pmu_event time and the ms time is the overhead for time-based sampling. The overhead for time-based sampling in this case is 340 µs. To make sense on the next section of output, find out where the functions have been mapped in memory using GDB. Match the functions with the corresponding memory addresses at which they reside. 0xa xa test_memset() samples or 96.9% 0xa xa00203b8 main() 30 samples or 00.3% 0xa xa002111c memset() 301 samples or 02.8% So, in summary, the total number of samples was The function test_memset() had samples or 96.9% of all samples collected. On the other hand, the optimized library function memset() had only 298 samples or 2.8% of all samples. These two functions do the exact same work, but memset() does it much more efficiently. count=10627 PCsample[0000]=0xA x 4257 PCsample[0001]=0xA x 1340 PCsample[0002]=0xA002015C x 185 PCsample[0005]=0xA002016C x 1675 PCsample[0006]=0xA x 1537 PCsample[0008]=0xA x 11 PCsample[0010]=0xA x 214 PCsample[0018]=0xA002014C x 628 PCsample[0074]=0xA x 325 PCsample[0121]=0xA002017C x 23 PCsample[0382]=0xA x 40 PCsample[0383]=0xA x 36 PCsample[0384]=0xA x 15 PCsample[0465]=0xA x 1 PCsample[0482]=0xA x 10 PCsample[0499]=0xA00201E8 x 8 PCsample[0516]=0xA x 10 PCsample[0533]=0xA x 3 PCsample[0550]=0xA00210C8 x 23 PCsample[0567]=0xA00210C4 x 109 PCsample[0601]=0xA00210C0 x 40 PCsample[0618]=0xA00210BC x 92 PCsample[1009]=0xA x 1 PCsample[1026]=0xA x 4 PCsample[1500]=0xA002013C x 6 PCsample[1534]=0xA00201EC x 2 PCsample[2044]=0xA002109C x 2 PCsample[2603]=0xA x 5 PCsample[2620]=0xA00210CC x 7 PCsample[3079]=0xA x 4 PCsample[3096]=0xA002106C x 5 Application Note 25

26 An Example Program - memtest.c PCsample[3570]=0xA x 1 PCsample[3996]=0xA x 1 PCsample[4673]=0xA00210F8 x 1 PCsample[5149]=0xA00210A0 x 2 PCsample[5640]=0xA x 1 PCsample[5657]=0xA x 1 PCsample[10401]=0xA x 1 PCsample[10469]=0xA00210FC x 1 26 Application Note

27 An Example Program - memtest.c 6.4 Using the PMU to do Event-based Sampling Event-based sampling is very similar to time-based sampling, except the processor is being interrupted, approximately after every 16 events, instead of every 10 µs, as in the previous example. When the event occurs, the PMN1 register is incremented until it overflows, which causes an interrupt. Then the Link Register value is collected and returned from the interrupt. Notice on line #11, PMN1Event is set to event #0x0B or Data Cache miss. Also notice on line 12, the processor is interrupted every 256 events. The reason the processor is not interrupted at every event, is because the overhead of servicing an interrupt during every Data Cache miss is higher than the work being done. Lines include new code to service interrupts that could come from the PMN0 and PMN1 overflows. The status of the PMNC register tells where the interrupt came from. The rest of the modifications are there to count events on the PMN1, instead of the CCNT as in the previous example. Again, please notice that the value that gets programmed into the PMNC register has changed from the previous example. Before, the value was set to 0x and now the value being programmed is 0x The difference is that the event counter interrupts are being enabled and the PMN0 and PMN1 are not being reset to zero. 1 #include <stdio.h> 2 #include "80200.h" 3 #include "80310_fiq_irq.h" 4 extern inline unsigned _Read_LR(void) asm volatile("mov\t%0, r14" : "=r" (_val_)); 8 return _val_; 9 10 unsigned int status; 11 unsigned int PMN0Event=0, PMN1Event=0x0B; 12 unsigned int PMN0Rate=0xFFFFFF00, PMN1Rate=0xFFFFFF00; 13 unsigned int count=0, countpmn0=0, countpmn1=0; 14 unsigned int PCsample[0xFFFF], PMN0sample[0xFFFF], PMN1sample[0xFFFF]; 15 ISR_SERVICE *fiqval; 16 ISR_SERVICE next_fiq_service; /*************** ISR STUFF *******************/ 19 void inthandlerattach( ISR_SERVICE fiq ) fiqval = (ISR_SERVICE*)FIQVECTOR; 22 next_fiq_service = *fiqval; 23 *fiqval = fiq; void inthandlerdetach() *FIQVECTOR = (unsigned int)next_fiq_service; void attribute ((interrupt("fiq"))) fiqhandlerpmu(void) Application Note 27

28 An Example Program - memtest.c asm("ldrsp,=0xa "); 34 ISR_PROLOG; 35 status = _Read_INT_Source(); 36 if( status & 0x ) /* If the PMU caused the FIQ */ status = _Read_PMNC(); 39 if( status & 0x ) PMN0sample[countPMN0] = _Read_LR(); 42 countpmn0++; 43 _Write_PMN0( PMN0Rate ); 44 _Write_PMNC( status ); else if( status & 0x ) PMN1sample[countPMN1] = _Read_LR(); 49 countpmn1++; 50 _Write_PMN1( PMN1Rate ); 51 _Write_PMNC( status ); else PCsample[count] = _Read_LR(); 56 count++; 57 _Write_CCNT( samprate ); 58 _Write_PMNC( status ); else ISR_CHAIN( next_fiq_service );/* call old vector */ ISR_EPILOG; /*************** Example Code *****************/ 69 char buf[1000]; 70 void test_memset( char *buf, char val, int num ) int j; 73 for( j=0; j<num; j++ ) 74 buf[j] = val; int main() unsigned int i, j, location=1; _Write_CCLKCFG( 9 ); /* 733 MHz */ 28 Application Note

29 An Example Program - memtest.c 82 _Write_INT_Steering( _Read_INT_Steering() 1 );/* Steer FIQ */ 83 inthandlerattach( fiqhandlerpmu );/* attach the interrupt handler */ 84 _Write_CCNT( samprate ); 85 _Write_PMN0( PMN0Rate ); 86 _Write_PMN1( PMN1Rate ); 87 _Write_PMNC( 0x PMN0Event<<12 PMN1Event<<20 );/* Enable the PMU and reset */ _Write_INT_Control( _Read_INT_Control() 5 );/* Enable FIQ */ for( i=0; i<5000; i++ ) test_memset( buf, 'a', 500 ); 94 memset( buf, 'a', 500 ); _Write_INT_Control( _Read_INT_Control() & 0xFFFFFFFA );/* Disable FIQ */ _Write_PMNC( 0x ); /* Disable the PMU and reset all */ 100 inthandlerdetach(); /******************** Display the results ***********************/ 103 printf( "count=%u, countpmn0=%u, countpmn1=%u\n", count, countpmn0, countpmn1 ); 104 for( i=0, location=1; i<countpmn1; i++ ) if( PMN1sample[i] ) for( j=i+1; j<countpmn1; j++ ) if( PMN1sample[i] == PMN1sample[j] ) location++; 113 PMN1sample[j]=0; printf( "PMN1sample[%d]=%08X x %d\n", i, PMN1sample[i]-4, location ); 117 location=1; return 0; 121 Application Note 29

30 An Example Program - memtest.c pmu_ebs.c - Program Output Similar to time-based sampling, this is used to find out where the functions are mapped in memory using GDB. 0xa002026c - 0xa00202cc test_memset 3169 samples or 96.3% 0xa00202d0-0xa002078c main 21 samples or 00.6% 0xa xa00214ec memset 101 samples or 03.1% This example program samples on Data Cache miss events. Since the Intel XScale microarchitecture is a cacheed architecture, every Data Cache miss causes the cache to have to be reloaded. When these Data Cache miss can be reduced, a significant amount of processor time can be saved just bysimply modifying the code. count=106, countpmn0=0, countpmn1=3292 PMN1sample[0]=A00202B8 x 81 PMN1sample[1]=A x 768 PMN1sample[2]=A00202A8 x 6 PMN1sample[3]=A x 24 PMN1sample[4]=A00202B0 x 162 PMN1sample[5]=A00202C0 x 734 PMN1sample[6]=A00202C4 x 754 PMN1sample[13]=A00202B4 x 611 PMN1sample[47]=A00202C8 x 35 PMN1sample[52]=A00214B8 x 5 PMN1sample[56]=A002148C x 30 PMN1sample[162]=A x 41 PMN1sample[207]=A x 6 PMN1sample[415]=A00203B4 x 2 PMN1sample[575]=A00202AC x 12 PMN1sample[578]=A x 8 PMN1sample[733]=A x 7 PMN1sample[884]=A00203B0 x 4 PMN1sample[1463]=A x 1 30 Application Note

31 Combination PMU Event Sampling 7.0 Combination PMU Event Sampling The following section is copied from the Intel Processor based on Intel XScale Microarchitecture Developer s Manual: Table 6. Some Common Uses of the PMU Mode PMNC.evtCount0 PMNC.evtCount1 Instruction Cache Efficiency 0x7 (instruction count) 0x0 (ICache miss) Data Cache Efficiency 0xA (Dcache access) 0xB (DCache miss) Instruction Fetch Latency 0x1 (ICache cannot deliver) 0x0 (ICache miss) Data/Bus Request Buffer Full 0x8 (DBuffer stall duration) 0x9 (DBuffer stall) Stall/Writeback Statistics 0x2 (data stall) 0xC (DCache writeback) Instruction TLB Efficiency 0x7 (instruction count) 0x3 (ITLB miss) Data TLB Efficiency 0xA (Dcache access) 0x4 (DTLB miss) 7.1 Instruction Cache Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions fetched from the instruction cache that were never executed. This can happen when a branch instruction changes the program flow; the instruction cache may retrieve the next sequential instructions after the branch, before it receives the target address of the branch. PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time. Statistics derived from these two events: Instruction cache miss-rate. This is derived by dividing PMN1 by PMN0. The average number of cycles it took to execute an instruction or commonly referred to as cycles-per-instruction (CPI). CPI can be derived by dividing CCNT by PMN0, where CCNT was used to measure total execution time. 7.2 Data Cache Efficiency Mode PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable accesses, mini-data cache access and accesses made to locations configured as data RAM. Note: STM and LDM each count as several accesses to the data cache, depending on the number of registers specified in the register list. LDRD registers two accesses. PMN1 counts the number of data cache and mini-data cache misses. Cache operations do not contribute to this count. See Section of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual for a description of these operations. The statistic derived from these two events is: Data cache miss-rate. This is derived by dividing PMN1 by PMN0. Application Note 31

32 Combination PMU Event Sampling 7.3 Instruction Fetch Latency Mode PMN0 accumulates the number of cycles when the instruction-cache is not able to deliver an instruction to the due to an instruction-cache miss or instruction-tlb miss. This event means that the processor core is stalled. PMN1 counts the number of instruction fetch requests to external memory. Each of these requests loads 32 bytes at a time. This is the same event as measured in instruction cache efficiency mode and is included in this mode for convenience so that only one performance monitoring run is need. Statistics derived from these two events: The average number of cycles the processor stalled waiting for an instruction fetch from external memory to return. This is calculated by dividing PMN0 by PMN1. When the average is high then the may be starved of the bus external to the The percentage of total execution cycles the processor stalled waiting on an instruction fetch from external memory to return. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. 7.4 Data/Bus Request Buffer Full Mode The Data Cache has buffers available to service cache misses or uncacheable accesses. For every memory request that the Data Cache receives from the processor core, a buffer is speculatively allocated in case an external memory request is required or temporary storage is needed for an unaligned access. When no buffers are available, the Data Cache stalls the processor core. How often the Data Cache stalls depends on the performance of the bus external to the and what the memory access latency is for Data Cache miss requests to external memory. When the memory access latency is high, possibly due to starvation, these Data Cache buffers becomes full. This performance monitoring mode is provided to see when the is being starved of the bus external to the 80200, which effects the performance of the application running on the PMN0 accumulates the number of clock cycles the processor is being stalled due to this condition and PMN1 monitors the number of times this condition occurs. Statistics derived from these two events: The average number of cycles the processor stalled on a data-cache access that may overflow the data-cache buffers. This is calculated by dividing PMN0 by PMN1. This statistic shows when the duration event cycles are due to many requests or are attributed to just a few requests. When the average is high then the may be starved of the bus external to the The percentage of total execution cycles the processor stalled because a Data Cache request buffer was not available. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. 32 Application Note

33 Combination PMU Event Sampling 7.5 Stall/Writeback Statistics When an instruction requires the result of a previous instruction and that result is not yet available, the stalls in order to preserve the correct data dependencies. PMN0 counts the number of stall cycles due to data-dependencies. Not all data-dependencies cause a stall; only the following dependencies cause such a stall penalty: Load-use penalty: attempting to use the result of a load before the load completes. To avoid the penalty, software should delay using the result of a load until it is available. This penalty shows the latency effect of data-cache access. Multiply/Accumulate-use penalty: attempting to use the result of a multiply or multiply-accumulate operation before the operation completes. Again, to avoid the penalty, software should delay using the result until it s available. ALU use penalty: there are a few isolated cases where back to back ALU operations may result in one cycle delay in the execution. These cases are defined in Chapter 14, Performance Considerations of the Intel Processor based on Intel XScale Microarchitecture Developer s Manual. PMN1 counts the number of writeback operations emitted by the data cache. These writebacks occur when the data cache evicts a dirty line of data to make room for a newly requested line or as the result of clean operation (CP15, register 7). Statistics derived from these two events: The percentage of total execution cycles the processor stalled because of a data dependency. This is calculated by dividing PMN0 by CCNT, which was used to measure total execution time. Often a compiler can reschedule code to avoid these penalties when given the right optimization switches. Total number of data writeback requests to external memory can be derived solely with PMN Instruction TLB Efficiency Mode PMN0 totals the number of instructions that were executed, which does not include instructions that were translated by the instruction TLB and never executed. This can happen when a branch instruction changes the program flow; the instruction TLB may translate the next sequential instructions after the branch, before it receives the target address of the branch. PMN1 counts the number of instruction TLB table-walks, which occurs when there is a TLB miss. When the instruction TLB is disabled, PMN1 does not increment. Statistics derived from these two events: Instruction TLB miss-rate. This is derived by dividing PMN1 by PMN0. The average number of cycles it took to execute an instruction or commonly referred to as cycles-per-instruction (CPI). CPI can be derived by dividing CCNT by PMN0, where CCNT was used to measure total execution time. Application Note 33

34 Combination PMU Event Sampling 7.7 Data TLB Efficiency Mode PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable accesses, mini-data cache access and accesses made to locations configured as data RAM. Note: STM and LDM each count as several accesses to the data TLB depending on the number of registers specified in the register list. LDRD registers two accesses. PMN1 counts the number of data TLB table-walks, which occurs when there is a TLB miss. When the data TLB is disabled, PMN1 does not increment. The statistic derived from these two events is: Data TLB miss-rate. This is derived by dividing PMN1 by PMN0. 34 Application Note

35 Performance Analysis Tools 8.0 Performance Analysis Tools 8.1 GNU gprof GNU gprof is a free profiler that comes with the GNU tools. GNU gprof uses time-based sampling to collect its data. Intel XScale microarchitecture support has been added to the RedHat GNUPro version xscale At the time of this writing, the GNUPro version that supports gprof was not publicly available, but please check the following URL for the latest version available ( Some Linux versions already have gprof support for Intel XScale microarchitecture, so please check the version to be sure How to use gprof The following tutorial is based on the example program source code shown in Section 6.0, An Example Program - memtest.c on page 16. Using the GNUPro version that supports gprof, execute the following command: bash$ xscale-elf-gcc -pg -g -O0 -specs=iq80310.specs memtest.c -o memtest.elf pg option enables the collection of profiling data. g option creates DWARF2 debugging information for gdb and gprof. O0 option turns off optimization so the explanation of the example is straight-forward. specs=iq80310.specs option indicates which startup files and support libraries are needed to generate IQ80310 code. Use specs=redboot.specs for the IQ Now connect to the target with gdb and download the memtest.elf program: bash$ xscale-elf-gdb -nw memtest.elf nw option tells gdb not to use the GUI version of gdb When using the serial port, type at the gdb prompt: (gdb) set remotebaud (gdb) target remote com2 (or whatever com port) When using the Ethernet port, type at the gdb prompt: (gdb) target remote :9000 (or whatever IP and port) When connect to the target and a Connected message displays, type the commands: (gdb) load (gdb) break exit (gdb) continue The memtest.elf program now runs on the IQ80310 target and collects performance data until it hits the breakpoint in the exit routine. Then type at the gdb prompt: (gdb) gmon save gmon.out (gdb) quit GDB has now written the profiling data to a file called gmon.out on the host system. To analyze the performance data type: bash$ xscale-elf-gprof memtest.elf > memtest.res gprof has now generated a results file on the host system called memtest.res Application Note 35

36 Performance Analysis Tools Interpreting the results of gprof Table 7 shows the result file generated by gprof for memtest.elf. Table 7. memtest.elf - gprof Result File Each sample counts as 0.01 seconds. % Time Cumulative Seconds Self Seconds Calls Self ms/call Total ms/call The first line of data shows the function test_memset as taking 77% of the 33 seconds it took to run the program or seconds was spent in this function. This function is obviously the slowest to run and the source code shows that this function is a simple byte-by-byte copy in a for loop. This is where the developer needs to spend time optimizing code and making this function run as efficiently as possible. The second line shows the standard C-library function call to memset. This library call has been optimized and only takes about 7 seconds to run or about 22% of the CPUs time. Both the test_memset and memset functions are called 5000 times, as shown in the calls column. This is a simple way to fix the performance of the first function, that is by calling an optimized standard library function instead of writing one. The third line is for mcount_internal, this is the internal data collection part of gprof and does not exist when the application is not being profiled with gprof. The rest of the lines took 0.00 seconds to run, so they are of little concern. Name test_memset memset mcount_internal atexit do_global_ctors_aux get_memtop exit frame_dummy main 36 Application Note

37 Performance Analysis Tools 8.2 Intel VTune Performance Analyzer Version 6.0 The Intel VTune Performance Analyzer is a graphical performance profiler originally written for the Intel Architecture (x86 instruction set) line of microprocessors. Intel VTune is capable of doing time-based sampling, event-based sampling, counter monitor, call graphing and also has a built-in Intel Tuning Assistant that automatically suggests code improvements. Intel VTune for Intel XScale microarchitecture is coming soon, so please check our website for availability How to use the Intel VTune Analyzer The way Intel VTune works on an embedded system is different than the way it works on a Windows system. Set up two systems, the target system that is running the Intel XScale microarchitecture and a host system running the Windows OS. Collect data on the target and transfer the data to the Windows host. There is probably be a batch file that takes care of the details of getting data onto the host machine. This is an example of using Intel VTune, version 5.0 for performance profiling on a native IA-32 processor running Windows Figure 2. Intel VTune Performance Profiling Module View This is the module view of a profiling session done with Intel VTune on Windows Notice that Intel VTune has collected data on the operating system as well as the target application called vtunedemo. The largest bar is for the ntosknl.exe and notice that hal.dll (the hardware abstraction layer) also is taking some CPU time. But we are interested in the demo application called vtunedemo.exe. Double-clicking the vtunedemo.exe bar zooms into a HotSpot view of vtunedemo.exe by function. Application Note 37

38 Performance Analysis Tools Figure 3. Intel VTune Performance Profiling HotSpot View of vtunedemo.exe by Function This hotspot view shows the percentage of CPU time that function calls are taking and the red bar to the left is the function that is taking the most CPU time. Double-clicking on the red bar drills down further in the source code view. Figure 4. Intel VTune PerformanceProfilingSourceCodeView In the source code view, shows the exact line of code that is taking the most CPU cycles time during the run of that application. This code is where the developer wants to focus on optimization. 38 Application Note

Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture

Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture Application Note June 2001 Document Number: 273538-001 Information in this document is provided in connection with Intel products.

More information

ECC Handling Issues on Intel XScale I/O Processors

ECC Handling Issues on Intel XScale I/O Processors ECC Handling Issues on Intel XScale I/O Processors Technical Note December 2003 Order Number: 300311-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

Intel I/O Processor Chipset with Intel XScale Microarchitecture

Intel I/O Processor Chipset with Intel XScale Microarchitecture Intel 80310 I/O Processor Chipset with Intel XScale Microarchitecture Initialization Considerations White Paper July 2001 Order Number: 273454-001 Information in this document is provided in connection

More information

Intel 810 Embedded Client Reference Design DC/DC ATX Power Supply

Intel 810 Embedded Client Reference Design DC/DC ATX Power Supply Intel 810 Embedded Client Reference Design DC/DC ATX Power Supply Scalable Platform with Integrated Flat Panel Display Application Note June 2001 Order Number: 273549-001 Information in this document is

More information

Running RAM RedBoot to Move Flash from Outbound Direct Addressing Window

Running RAM RedBoot to Move Flash from Outbound Direct Addressing Window Running RAM RedBoot to Move Flash from Outbound Direct Addressing Window Application Note January 2002 Document Number: 273660-001 Information in this document is provided in connection with Intel products.

More information

Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware

Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware Application Note September 2004 Document Number: 254065-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation Application Note September 2004 Document Number: 300375-002 INFORMATION IN THIS DOCUMENT

More information

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options Application Note September 2004 Document Number: 254067-002 Contents INFORMATION IN THIS DOCUMENT IS

More information

Using the Intel IQ80310 Ethernet Connection Under RedBoot

Using the Intel IQ80310 Ethernet Connection Under RedBoot Using the Intel IQ80310 Ethernet Connection Under RedBoot Application Note March 5, 2002 Document Number: 273685-001 Information in this document is provided in connection with Intel products. No license,

More information

Third Party Hardware TDM Bus Administration

Third Party Hardware TDM Bus Administration Third Party Hardware TDM Bus Administration for Windows Copyright 2003 Intel Corporation 05-1509-004 COPYRIGHT NOTICE INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Intel PCI-X to Serial ATA Controller

Intel PCI-X to Serial ATA Controller Intel 31244 PCI-X to Controller Design Layout Review Checklist October 2002 Document Number: 273791-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR

More information

Intel I/O Processor Software Conversion to Intel I/O Processor

Intel I/O Processor Software Conversion to Intel I/O Processor Intel 80321 I/O Processor Software Conversion to Intel 80332 I/O Processor Application Note August 2004 Order Number: 273890-001US INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

More information

Intel I/O Processor

Intel I/O Processor Intel 80331 I/O Processor Application Accelerator Unit D-0 Addendum January, 2005 Document Number: 304496001US Intel 80331 I/O Processor Application Accelerator Unit D-0 Addendum Information in this document

More information

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors Application Note May 2008 Order Number: 319801; Revision: 001US INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH

More information

Techniques for Lowering Power Consumption in Design Utilizing the Intel EP80579 Integrated Processor Product Line

Techniques for Lowering Power Consumption in Design Utilizing the Intel EP80579 Integrated Processor Product Line Techniques for Lowering Power Consumption in Design Utilizing the Intel Integrated Processor Product Line Order Number: 320180-003US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Intel 6300ESB I/O Controller Hub (ICH)

Intel 6300ESB I/O Controller Hub (ICH) Intel 6300ESB I/O Controller Hub (ICH) Notice: The Intel 6300ESB ICH may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized

More information

Intel Parallel Amplifier Sample Code Guide

Intel Parallel Amplifier Sample Code Guide The analyzes the performance of your application and provides information on the performance bottlenecks in your code. It enables you to focus your tuning efforts on the most critical sections of your

More information

Getting Compiler Advice from the Optimization Reports

Getting Compiler Advice from the Optimization Reports Getting Compiler Advice from the Optimization Reports Getting Started Guide An optimizing compiler can do a lot better with just a few tips from you. We've integrated the Intel compilers with Intel VTune

More information

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Using the Intel VTune Amplifier 2013 on Embedded Platforms Using the Intel VTune Amplifier 2013 on Embedded Platforms Introduction This guide explains the usage of the Intel VTune Amplifier for performance and power analysis on embedded devices. Overview VTune

More information

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters Technical Brief v1.0 August 2011 Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

More information

Intel PXA27x Processor Family

Intel PXA27x Processor Family Intel PXA27x Processor Family Optimization Guide August, 2004 Order Number: 280004-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL

More information

Using Intel Inspector XE 2011 with Fortran Applications

Using Intel Inspector XE 2011 with Fortran Applications Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

Intel C++ Compiler Documentation

Intel C++ Compiler Documentation Document number: 304967-001US Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel C++ Studio XE 2013 for Windows* Installation Guide and Release Notes Document number: 323805-003US 26 June 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.1.1 Changes since Intel

More information

RAID on Motherboard (ROMB) Considerations Using Intel I/O Processor

RAID on Motherboard (ROMB) Considerations Using Intel I/O Processor RAID on Motherboard (ROMB) Considerations Using Intel 80321 I/O Processor Application Note June 2002 Document Number: 273456-006 Information in this document is provided in connection with Intel products.

More information

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Flash Programming

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Flash Programming Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Flash Programming Application Note October 2004 Document Number: 254273-002 INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Continuous Speech Processing API for Linux and Windows Operating Systems

Continuous Speech Processing API for Linux and Windows Operating Systems Continuous Speech Processing API for Linux and Windows Operating Systems Demo Guide November 2003 05-1701-003 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

VTune(TM) Performance Analyzer for Linux

VTune(TM) Performance Analyzer for Linux VTune(TM) Performance Analyzer for Linux Getting Started Guide The VTune Performance Analyzer provides information on the performance of your code. The VTune analyzer shows you the performance issues,

More information

Intel IT Director 1.7 Release Notes

Intel IT Director 1.7 Release Notes Intel IT Director 1.7 Release Notes Document Number: 320156-005US Contents What s New Overview System Requirements Installation Notes Documentation Known Limitations Technical Support Disclaimer and Legal

More information

Intel NetStructure IPT Series on Windows

Intel NetStructure IPT Series on Windows Intel NetStructure IPT Series on Windows Configuration Guide November 2002 05-1752-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL

More information

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen* How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen* Technical Brief v1.0 September 2011 Legal Lines and Disclaimers INFORMATION IN THIS

More information

Intel Platform Controller Hub EG20T

Intel Platform Controller Hub EG20T Intel Platform Controller Hub EG20T UART Controller Driver for Windows* Programmer s Guide Order Number: 324261-002US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Continuous Speech Processing API for Host Media Processing

Continuous Speech Processing API for Host Media Processing Continuous Speech Processing API for Host Media Processing Demo Guide April 2005 05-2084-003 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Installation Guide and Release Notes Document number: 321604-001US 19 October 2009 Table of Contents 1 Introduction... 1 1.1 Product Contents... 1 1.2 System Requirements... 2 1.3 Documentation... 3 1.4

More information

Intel Desktop Board DP55SB

Intel Desktop Board DP55SB Intel Desktop Board DP55SB Specification Update July 2010 Order Number: E81107-003US The Intel Desktop Board DP55SB may contain design defects or errors known as errata, which may cause the product to

More information

Intel(R) Threading Building Blocks

Intel(R) Threading Building Blocks Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the

More information

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance 7/27/12 1 GAP Guided Automatic Parallelism Key design ideas: Use compiler to help detect what is blocking optimizations in particular

More information

Intel Platform Controller Hub EG20T

Intel Platform Controller Hub EG20T Intel Platform Controller Hub EG20T Packet HUB Driver for Windows* Programmer s Guide February 2011 Order Number: 324265-002US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Intel Processor based on Intel XScale Microarchitecture

Intel Processor based on Intel XScale Microarchitecture Intel 80200 Processor based on Intel XScale Microarchitecture Datasheet - Commercial and Extended Temperature (80200T) Product Features High Performance Processor based on Intel XScale Microarchitecture

More information

Intel Dialogic Global Call Protocols Version 4.1 for Linux and Windows

Intel Dialogic Global Call Protocols Version 4.1 for Linux and Windows Intel Dialogic Global Call Protocols Version 4.1 for Linux and Windows Release Notes December 2003 05-1941-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

Product Change Notification

Product Change Notification Product Change Notification 113412-00 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

More information

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003 Intel 848P Chipset Specification Update Intel 82848P Memory Controller Hub (MCH) August 2003 Notice: The Intel 82848P MCH may contain design defects or errors known as errata which may cause the product

More information

Product Change Notification

Product Change Notification Product Change Notification 112087-00 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

More information

Intel Cache Acceleration Software for Windows* Workstation

Intel Cache Acceleration Software for Windows* Workstation Intel Cache Acceleration Software for Windows* Workstation Release 3.1 Release Notes July 8, 2016 Revision 1.3 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

Introduction to Intel Fortran Compiler Documentation. Document Number: US

Introduction to Intel Fortran Compiler Documentation. Document Number: US Introduction to Intel Fortran Compiler Documentation Document Number: 307778-003US Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Embedded Intel486 SX Processor

Embedded Intel486 SX Processor Embedded Intel486 SX Processor Product Features Datasheet 32-Bit RISC Technology Core 8-Kbyte Write-Through Cache Four Internal Write Buffers Burst Bus Cycles Dynamic Bus Sizing for 8- and 16-bit Data

More information

Intel Platform Controller Hub EG20T

Intel Platform Controller Hub EG20T Intel Platform Controller Hub EG20T Inter Integrated Circuit (I 2 C*) Driver for Windows* Programmer s Guide Order Number: 324258-002US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

LED Manager for Intel NUC

LED Manager for Intel NUC LED Manager for Intel NUC User Guide Version 1.0.0 March 14, 2018 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO

More information

Intel Desktop Board DZ68DB

Intel Desktop Board DZ68DB Intel Desktop Board DZ68DB Specification Update April 2011 Part Number: G31558-001 The Intel Desktop Board DZ68DB may contain design defects or errors known as errata, which may cause the product to deviate

More information

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 115107-00 Change Title: Intel Ethernet Converged Network Adapter X520 - DA1, E10G41BTDAPG1P5,, MM#927066, Intel Ethernet Converged Network Adapter X520

More information

Intel NetStructure SS7 Boards

Intel NetStructure SS7 Boards Intel NetStructure SS7 Boards SS7HD Migration Guide October 2003 05-2131-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 114547-01 Change Title: Intel Dual Band Wireless-AC 3165 SKUs: 3165.NGWG.I; 3165.NGWGA.I; 3165.NGWG.S; 3165.NGWG; 3165.NGWGA.S; 3165.NGWGA, PCN 114547-01,

More information

Intel Desktop Board DH55TC

Intel Desktop Board DH55TC Intel Desktop Board DH55TC Specification Update December 2011 Order Number: E88213-006 The Intel Desktop Board DH55TC may contain design defects or errors known as errata, which may cause the product to

More information

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Document number: 323803-001US 4 May 2011 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.2 Product Contents...

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 114137-00 Change Title: Intel Dual Band Wireless-AC 8260, Intel Dual Band Wireless-N 8260, SKUs: 8260.NGWMG.NVS, 8260.NGWMG.S, 8260.NGWMG, 8260.NGWMG.NV

More information

Intel Desktop Board D945GCCR

Intel Desktop Board D945GCCR Intel Desktop Board D945GCCR Specification Update January 2008 Order Number: D87098-003 The Intel Desktop Board D945GCCR may contain design defects or errors known as errata, which may cause the product

More information

Upgrading Intel Server Board Set SE8500HW4 to Support Intel Xeon Processors 7000 Sequence

Upgrading Intel Server Board Set SE8500HW4 to Support Intel Xeon Processors 7000 Sequence Upgrading Intel Server Board Set SE8500HW4 to Support Intel Xeon Processors 7000 Sequence January 2006 Enterprise Platforms and Services Division - Marketing Revision History Upgrading Intel Server Board

More information

Intel MPI Library for Windows* OS

Intel MPI Library for Windows* OS Intel MPI Library for Windows* OS Getting Started Guide The Intel MPI Library is a multi-fabric message passing library that implements the Message Passing Interface, v2 (MPI-2) specification. Use it to

More information

Product Change Notification

Product Change Notification Product Change Notification 111962-00 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

More information

Product Change Notification

Product Change Notification Product Change Notification 112177-01 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

More information

Product Change Notification

Product Change Notification Product Notification Notification #: 114712-01 Title: Intel SSD 750 Series, Intel SSD DC P3500 Series, Intel SSD DC P3600 Series, Intel SSD DC P3608 Series, Intel SSD DC P3700 Series, PCN 114712-01, Product

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

Intel 945(GM/GME)/915(GM/GME)/ 855(GM/GME)/852(GM/GME) Chipsets VGA Port Always Enabled Hardware Workaround

Intel 945(GM/GME)/915(GM/GME)/ 855(GM/GME)/852(GM/GME) Chipsets VGA Port Always Enabled Hardware Workaround Intel 945(GM/GME)/915(GM/GME)/ 855(GM/GME)/852(GM/GME) Chipsets VGA Port Always Enabled Hardware Workaround White Paper June 2007 Order Number: 12608-002EN INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 115338-00 Change Title: Intel Dual Band Wireless-AC 7265 and Intel Dual Band Wireless-N 7265 SKUs: 7265.NGWANG.W; 7265.NGWG.NVBR; 7265.NGWG.NVW; 7265.NGWG.W;

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 115169-01 Change Title: Intel Dual Band Wireless-AC 8265 SKUs: 8265.D2WMLG; 8265.D2WMLG.NV; 8265.D2WMLG.NVH; 8265.D2WMLGH; 8265.D2WMLG.NVS; 8265.D2WMLG.S;

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 114332-00 Change Title: Intel Dual Band Wireless-AC 7260, Intel Dual Band Wireless-N 7260, Intel Wireless-N 7260, SKUs: 7260.NGIANG, 7260.NGIG, 7260.NGINBG,

More information

Intel Thread Checker 3.1 for Windows* Release Notes

Intel Thread Checker 3.1 for Windows* Release Notes Page 1 of 6 Intel Thread Checker 3.1 for Windows* Release Notes Contents Overview Product Contents What's New System Requirements Known Issues and Limitations Technical Support Related Products Overview

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 114216-00 Change Title: Intel SSD 730 Series (240GB, 480GB, 2.5in SATA 6Gb/s, 20nm, MLC) 7mm, Generic Single Pack, Intel SSD 730 Series (240GB, 480GB,

More information

What's new in VTune Amplifier XE

What's new in VTune Amplifier XE What's new in VTune Amplifier XE Naftaly Shalev Software and Services Group Developer Products Division 1 Agenda What s New? Using VTune Amplifier XE 2013 on Xeon Phi coprocessors New and Experimental

More information

Intel IXP400 Software: VLAN and QoS Application Version 1.0

Intel IXP400 Software: VLAN and QoS Application Version 1.0 Intel IXP400 Software: VLAN and QoS Application Version 1.0 Programmer s Guide September 2004 Document Number: 301925-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 114258-00 Change Title: Intel SSD DC S3710 Series (200GB, 400GB, 800GB, 1.2TB, 2.5in SATA 6Gb/s, 20nm, MLC) 7mm, Generic 50 Pack Intel SSD DC S3710 Series

More information

Intel XScale Microarchitecture

Intel XScale Microarchitecture D Intel XScale Microarchitecture Product Features Technical Summary 7-8 stage Intel Superpipelined RISC Technology achieves high speed and ultra low power Intel Dynamic Voltage Management. Dynamic voltage

More information

Intel Desktop Board D945GCLF2

Intel Desktop Board D945GCLF2 Intel Desktop Board D945GCLF2 Specification Update July 2010 Order Number: E54886-006US The Intel Desktop Board D945GCLF2 may contain design defects or errors known as errata, which may cause the product

More information

AAU Support Library for the Intel I/O Processor Chipset and I/O Processor

AAU Support Library for the Intel I/O Processor Chipset and I/O Processor AAU Support Library for the Intel 80310 I/O Processor Chipset and 80321 I/O Processor Reference Manual April 2002 Order Number: 273721-001 Information in this document is provided in connection with Intel

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Installation Guide and Release Notes Document number: 321604-002US 9 July 2010 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 What s New... 2 1.3 System Requirements... 2 1.4 Documentation...

More information

Intel(R) Threading Building Blocks

Intel(R) Threading Building Blocks Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the

More information

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes Document number: 323804-002US 21 June 2012 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.2 Product Contents...

More information

Intel Desktop Board DG31PR

Intel Desktop Board DG31PR Intel Desktop Board DG31PR Specification Update May 2008 Order Number E30564-003US The Intel Desktop Board DG31PR may contain design defects or errors known as errata, which may cause the product to deviate

More information

Intel Desktop Board DG41CN

Intel Desktop Board DG41CN Intel Desktop Board DG41CN Specification Update December 2010 Order Number: E89822-003US The Intel Desktop Board DG41CN may contain design defects or errors known as errata, which may cause the product

More information

Native Configuration Manager API for Windows Operating Systems

Native Configuration Manager API for Windows Operating Systems Native Configuration Manager API for Windows Operating Systems Library Reference December 2003 05-1903-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

OA&M API for Linux Operating Systems

OA&M API for Linux Operating Systems OA&M API for Linux Operating Systems Library Reference August 2005 05-1841-004 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR

More information

Intel Integrated Performance Primitives for Intel Architecture. Using Intel Integrated Performance Primitives in C++ Applications

Intel Integrated Performance Primitives for Intel Architecture. Using Intel Integrated Performance Primitives in C++ Applications Intel Integrated Performance Primitives for Intel Architecture Using Intel Integrated Performance Primitives in C++ Applications Version 2.0 June, 2004 Information in this document is provided in connection

More information

Intel Desktop Board D945GCLF

Intel Desktop Board D945GCLF Intel Desktop Board D945GCLF Specification Update July 2010 Order Number: E47517-008US The Intel Desktop Board D945GCLF may contain design defects or errors known as errata, which may cause the product

More information

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC White Paper August 2017 Document Number: 052889-1.2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

3 Volt Intel StrataFlash Memory to Motorola MC68060 CPU Design Guide

3 Volt Intel StrataFlash Memory to Motorola MC68060 CPU Design Guide 3 Volt Intel StrataFlash Memory to Motorola MC68060 CPU Design Guide Application Note 703 April 2000 Document Number: 292251-002 Information in this document is provided in connection with Intel products.

More information

Product Change Notification

Product Change Notification Product Change Notification 113028-02 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

More information

Application Note Software Device Drivers for the M29Fxx Flash Memory Device

Application Note Software Device Drivers for the M29Fxx Flash Memory Device Introduction Application Note Software Device Drivers for the M29Fxx Flash Memory Device Introduction This application note provides library source code in C for the M29Fxx Flash memory using the Flash

More information

Intel Desktop Board DG41RQ

Intel Desktop Board DG41RQ Intel Desktop Board DG41RQ Specification Update July 2010 Order Number: E61979-004US The Intel Desktop Board DG41RQ may contain design defects or errors known as errata, which may cause the product to

More information

Product Change Notification

Product Change Notification Product Change Notification 110867-00 Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property

More information

Interrupt/Timer/DMA 1

Interrupt/Timer/DMA 1 Interrupt/Timer/DMA 1 Exception An exception is any condition that needs to halt normal execution of the instructions Examples - Reset - HWI - SWI 2 Interrupt Hardware interrupt Software interrupt Trap

More information

Intel EP80579 Software Drivers for Embedded Applications

Intel EP80579 Software Drivers for Embedded Applications Intel EP80579 Software Drivers for Embedded Applications Package Version 1.0 Release Notes October 2008 Order Number: 320150-005US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Product Change Notification

Product Change Notification Product Change Notification Change Notification #: 115007-00 Change Title: Select Intel SSD 530 Series, Intel SSD 535 Series, Intel SSD E 5410s Series, Intel SSD E 5420s Series, Intel SSD PRO 2500 Series,

More information

C Language Constructs for Parallel Programming

C Language Constructs for Parallel Programming C Language Constructs for Parallel Programming Robert Geva 5/17/13 1 Cilk Plus Parallel tasks Easy to learn: 3 keywords Tasks, not threads Load balancing Hyper Objects Array notations Elemental Functions

More information

GUID Partition Table (GPT)

GUID Partition Table (GPT) GUID Partition Table (GPT) How to install an Operating System (OS) using the GUID Disk Partition Table (GPT) on an Intel Hardware RAID (HWR) Array under uefi environment. Revision 1.0 December, 2009 Enterprise

More information

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool INTEL PERCEPTUAL COMPUTING SDK How To Use the Privacy Notification Tool LEGAL DISCLAIMER THIS DOCUMENT CONTAINS INFORMATION ON PRODUCTS IN THE DESIGN PHASE OF DEVELOPMENT. INFORMATION IN THIS DOCUMENT

More information

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting Important OpenCL*-related Metrics with Intel GPA System Analyzer Introduction Intel SDK for OpenCL* Applications

More information

Intel 815 Chipset Family: Graphics and Memory Controller Hub (GMCH)

Intel 815 Chipset Family: Graphics and Memory Controller Hub (GMCH) Intel 815 Chipset Family: 82815 Graphics and Memory Controller Hub (GMCH) Specification Update May 2001 Notice: The Intel 82815 GMCH may contain design defects or errors known as errata which may cause

More information

Intel Desktop Board D915GUX Specification Update

Intel Desktop Board D915GUX Specification Update Intel Desktop Board D915GUX Specification Update Release Date: July 2006 Order Number: C80894-005US The Intel Desktop Board D915GUX may contain design defects or errors known as errata, which may cause

More information