HPM Hardware Performance Monitor for Bluegene/Q

Size: px

Start display at page:

Download "HPM Hardware Performance Monitor for Bluegene/Q"

Kellie Newton
6 years ago
Views:

1 HPM Hardware Performance Monitor for Bluegene/Q PRASHOBH BALASUNDARAM I-HSIN CHUNG KRIS DAVIS JOHN H MAGERLEIN The Hardware performance monitor (HPM) is a component of IBM high performance computing toolkit. The HPM is used to instrument applications for extracting hardware performance counters data on Bluegene/Q. The Bluegene/Q s hardware performance counter design a marked deviation from the counters on Bluegene/L and Bluegene/P systems. As the number of cores on a multi-core processor increases the design of hardware performance counters as well as other performance analysis tools needs modifications. This document presents the design, functionality, performance characteristics, and usage of the hardware performance counters on Bluegene/Q. General Terms: hardware counters, performance monitoring, Bluegene/Q Additional Key Words and Phrases: counters, metrics, tools 1. BLUEGENE/Q PROCESSOR OVERVIEW The Bluegene/Q processor is a system-on-chip multi-core processor and hosts 18 cores. 16 of these powerpc 64 bit cores are dedicated for computations, while the 17 th core offloads operating system tasks. The 18 th core is used as a backup to improve yield of processors during manufacturing stages. The Bluegene/Q processor incorporates the direct memory access engine as well as the interconnection network and includes the IO unit network unit and messaging unit in addition to the processor unit. The processor core is a 4-way multi-threaded 64bit powerpc microprocessor operates at 1.6 GHZ and is capable of executing instructions in-order in two pipelines. While one of these pipelines executes all integer control and memory access instructions, the other pipeline executes all floating point arithmetic instructions. The floating point unit is 4- wide SIMD double precision unit known as the QPU. This unit is capable of delivering upto four fused multiply-add results per processor clock. The L1 (first level) data cache is 16 kilo bytes with 64 byte lines. The load-store interface of this processor allows loading and storing of 32 bytes per cycle. Each core has a private level 1 prefetch unit. This unit accepts, decodes and dispatches all requests raised by the core. Each entry in l1 prefetch unit can hold 128 bytes. The L2 (second level) cache is a shared cache. All 16 compute cores can access this cache. The L2 cache is comprised of 16 slices connected via a shared cross-bar switch. Each L2 cache slice is 16 way set associative, write-back with capacity of 2 Mega bytes. To achieve even slice utilization, physical addresses are scattered across the slices using a programmable hash function. A direct consequence of this design on hardware performance analysis tools including HPM is that the L2 cache related hardware performance metrics are available reliably at the node level only. The L2 cache supports novel features like memory speculation and atomic memory updates. The prefetch on the L2 cache is directed by hints from the L1 cache prefetch unit. The aggregate read bandwidth from all slices is GB/s while the write bandwidth is GB/s. The list prefetch and stream prefetch modes of the l1 prefetch is another interesting innovation. The goal of this feature is to provide more control over the data prefetch. There are two modes of data prefetch aiming to reduce latency of access to the shared l2 cache - the stream prefetch mode and the list prefetch mode. Data is prefetched into the l1 prefetch buffer (128B lines, 32 prefetch lines). If the memory access pattern is

HPM:2 International Business Machines contiguous, the stream prefetch unit can offer low latency access to data by prefetching additional contiguous data in the same sequence.

2 HPM:2 International Business Machines contiguous, the stream prefetch unit can offer low latency access to data by prefetching additional contiguous data in the same sequence. The stream prefetch unit is adaptive. Since the storage requirements of 4 threads per core accessing multiple streams can cause increased pressure on the l1 cache, the depth of stream prefetching is modified to keep the total depth of all streams fit within the 32 prefetch lines. The stream prefetch engine can be programmed to reduce the length of the prefetch streams programmatically. When the access to memory locations is not contiguous but the same pattern is repeated many times then list prefetch is useful. This type of access pattern is often encountered in scientific computing codes like iterative solvers. The stream prefetch unit provides guidance to the list prefetch unit by recording the memory address pattern where cache misses occur. Once this pattern is recorded the list prefetch unit can target the prefetch of memory addresses where cache misses are anticipated. In addition to the traditional hardware performance monitoring metrics, the HPM of Bluegene/Q incorporates counters and derived metrics composed of data from the list prefetch, stream prefetch and other features. Though the Bluegene/Q processor is designed to be a throughput processor favouring multithreaded application performance over single thread performance, there are several innovations to improve performance of threaded regions. Programming interfaces and compiler support for these innovative features are available. The HPM also supports analysis of counter data for a subset of these features. The first feature designed to improve multi-thread performance is speculative memory. The L2 cache on Bluegene/Q is multi-versioning, and is capable of separating state changes caused by a speculative thread and keeping these state changes separate from the main memory state. At the end of the speculative section, these changes are either invalidated or committed. In case of conflicts to same memory locations from various threads, the L2 can detect and the speculative runtime system can reason about the accuracy of the concurrent execution. This enables features like transactional memory and speculative execution. The IBM XL compilers for Bluegene/Q supports speculative execution and transactional memory. The effective use of the quad floating point unit of the Bluegene/Q processor is essential for maximizing the performance of this MPP system. The QPX architecure and instruction set is a new bluegene/q specific development. The QPX follows a vector single instruction multiple data computing model. There are 4 elements per vector, four execution lanes supported by a register file of bit registers containing 4 64 bit vector elements. Data manipulation across the 4 execution lanes is also enabled by the new instruction set by supporting compare, convert and move instructions. Complex arithmetic Fig. 1. Layout of UPC counters in a ring instructions allows arithmetic operation across adjacent lanes.

3 Hardware Performance Monitor for Bluegene/Q HPM:3 Fig. 2. Hardware counter multiplexing 2. UNIVERSAL PERFORMANCE COUNTERS AND THE BGPM API The universal performance counter (UPC) is a processor hardware component designed to collect performance events from all 17 processor cores, L1P units, 16 L2 units, as well as the message unit. The UPC hardware has central (node-shared) as well as distributed (core-local) components. This is a major design change from the UPC unit on the Bluegene/P. The local UPC unit is named the UPC P, while the central UPC unit is named UPC C. UPC P has bit counters to count processor core, L1P, completed instructions and floating point operations. The UPC L2 unit has 16 counters for each L2 slice while the IO units have 43 counters. The UPC C has 64 bit counters within the central UPC unit (UPC C) to which the UPC P counters are accumulated. The UPC C provides overflow detection and counter aggregation operations. In Figure 1, the layout of UPC counter components is illustrated. The UPC P and UPC P units together provides the necessary functionality to implement 3 modes of operation. The HPM uses the distributed mode which collects 24 counters from each UPC P unit. In Figure 2, the details of the processor event selection logic configurable by the punit logic is illustrated. BlueGene/Q s hardware performance monitor implements more than 600 events and other selected filtered opcodes. All these events contend for the 24 counter registers per UPC P unit. Not all combinations of the 24 events can be counted simultaneously. These restrictions necessitates design changes in HPM implementation. The performance monitoring API on the Bluegene/Q is organized hierarchically as illustrated in Figure 3. The software layers interacts with the compute node kernel (operating system on Bluegene/Q) and other system software components through a hierarchy of application programming interfaces. The BGPM (Bluegene performance monitoring) API is intended for use by developers of performance tools. The HPM and PAPI are intended for use by application developers and implements instrumentation functionalities. The BGPM layer is used by both HPM and PAPI for accessing the hardware performance monitoring units. The HPM is the end user facing component and implements functionality to instrument application code and to control formatting of output.

4 HPM:4 International Business Machines Fig. 3. HPM s software layers The software layers used by HPM ( illustrated in Figure 3 ) include BGPM, Universal performance counter interface, the compute node kernel, and UPC low level SPI. The UPC low level SPI layer interfaces with hardware components to expose functionality to higher layers of system software. The system programming interface named UPCI (universal performance counter interface) can uses both the UPC low level interface and the compute node kernel s functions. The BGPM API uses system calls to the compute node kernel as well as UPCI API functions. The HPM software is built on the software distributed mode implemented in the BGPM API. 3. HARDWARE PERFORMANCE MONITOR Corresponding to the thread safe and non thread safe versions of the IBM XL compilers, there are two versions of the HPM - the process only version and a thread safe version. The process only version named libmpihpm is intended for use in MPI-Only programs compiled using the non thread safe compiler. The libmpihpm r is the threadsafe version and is designed for use in OpenMP applications. The HPM libraries use BGPM to initialize, start,stop and read the hardware counters. Depending on the user selected options, the HPM calculates counter values, derived metrics and formats the output files. The HPM libraries are used in a C/C++ or FORTRAN application by instrumenting code segments within HPM API calls. The initialization step of HPM sets up the data structures necessary for HPM to operate. HPM initalization can be performed from an MPI application. The hpminit() call from C and f hpminit call from FORTRAN initalizes HPM s data structures. The hpmstart (C) call or f hpmstart (FORTRAN) starts the HPM s counter recording functionality. If no environment variables are set, the default event set is enabled for counting. The hpmstart and hpmstop calls enclose the code segment to be profiled. In the multi-threaded version, the hpmstart and hpm- Stop calls enclose a threaded section of code. The threaded parallel region must be

5 Hardware Performance Monitor for Bluegene/Q HPM:5 started after hpmstart and stopped before hpmstop. The hpmstop function stops the currently selected sets of events and records the final counter values. An example C code is listed in Figure 4 while Figure 5 lists the FORTRAN example // Include the HPM Header f i l e s 3 #include hpm. h 4 #include mpi. h // i n i t i a l i z e MPI 8 i e r r = MPI Init(&argc, &argv ) ; 9 i e r r = MPI Comm rank(mpi COMM WORLD, &my id ) ; 10 i e r r = MPI Comm size (MPI COMM WORLD, &numprocs ) ; 11 hpminit ( ) ; hpmstart ( MainLoop ) ; // MainLoop = l a b e l f o r code segment //Code to be p r o f i l e d hpmstop ( MainLoop ) ; // MainLoop stop the code segment 19 // Note Code l a b e l s started must match with stopped code segments //Terminate HPM 23 hpmterminate ( ) ; 24 MPI Finalize ( ) ; } Fig. 4. Example usage of HPM from C code 1! Fortan code to be pre processed using CPP 2! Naming input f i l e as filename. F or filename. F90 enables preprocessing 3 #include f hpm. h Call mpi init ( i e r r ) 6 Call mpi comm rank ( mpi comm world, my id, i e r r ) 7 Call mpi comm size ( mpi comm world, numprocs, i e r r ) 8 Call f hpminit ( ) ! MainLoop = l a b e l name and 8 = length of l a b e l name 11 c a l l f hpmstart ( MainLoop, 8 ) //Code to be p r o f i l e d c a l l f hpmstop ( MainLoop, 8 ) c a l l f hpmterminate ( ) 18 Call m p i f i n a l i z e ( i e r r ) Fig. 5. Example usage of HPM from FORTRAN code The instrumented application needs to be linked to HPM prior to executing it. An example Makefile s sections for linking HPM to the application is in Figure 7.

6 HPM:6 International Business Machines // Include the HPM Header f i l e s 3 #include hpm. h 4 #include mpi. h // i n i t i a l i z e MPI 8 i e r r = MPI Init(&argc, &argv ) ; 9 i e r r = MPI Comm rank(mpi COMM WORLD, &my id ) ; 10 i e r r = MPI Comm size (MPI COMM WORLD, &numprocs ) ; 11 hpminit ( ) ; hpmstart ( OuterLoop ) ; // MainLoop = l a b e l f o r code segment //Code to be p r o f i l e d hpmstart ( InnerLoop ) ; // MainLoop = l a b e l f o r code segment 19 //Code to be p r o f i l e d 20 hpmstop ( InnerLoop ) ; // MainLoop stop the code segment 21 //Code to be p r o f i l e d 22 hpmstop ( OuterLoop ) ; // MainLoop stop the code segment 23 // Note Code l a b e l s started must match with stopped code segments //Terminate HPM 27 hpmterminate ( ) ; 28 MPI Finalize ( ) ; } Fig. 6. Nesting of HPM calls 1 # Declare the make variables 2 CC= b g x l c r I / bgsys / drivers / ppcfloor /comm/ xl / include Ilibmpihpm r / src 3 CFLAGS= O3 qarch=qp qtune=qp qhot qsimd=auto qsmp=omp: noauto 4 5 LFLAGS= $ (CFLAGS) 6 # MPI LIBRARIES 7 MPI LIBS= L / bgsys / drivers / ppcfloor /comm/ gcc / l i b lmpich lmpl lopa \ 8 L / bgsys / drivers / ppcfloor /comm/ sys / l i b lpami \ 9 L / bgsys / drivers / ppcfloor / spi / l i b lspi cnk lspi upci cnk l r t lpthread lm SYS INC = $ (MPI INC) I / bgsys / drivers / $ (FLOOR) I / bgsys / drivers / ppcfloor / spi / include / kernel / cnk 12 SYS LIB = / bgsys / drivers / $ (FLOOR) / bgpm / l i b / libbgpm. a 13 HPM LIB=.. /.. / libmpihpm / src / libhpm. a # Linking executable step 16 # Link e x e c i t a b l e to HPM LIB SYS LIB and MPI LIB Fig. 7. Linking HPM Libraries 3.1. Environment variables The HPM s functionalities can be controlled using several environment variables. The list that follows documents the HPM environment variables and their functionalities. HPM EVENT SET =

7 Hardware Performance Monitor for Bluegene/Q HPM:7-1 is the default setting, and corresponds to a basic set of non multiplexed counters 0 multiplexed event set providing information about total cycles, instructions and LSU events 1 multiplexed event set to explore branch prediction 2 multiplexed presents data about floating point instruction mix 3 multiplexed event set with a mix of different counters 4 multiplexed event set for stream prefetching events HPM SCOPE - This environment variable applies only for non-threaded version process - Prints ouptut per MPI process node - Prints output per Node HPM ASC OUTPUT : prints output as a text file yes (default) no HPM CSV OUTPUT : prints output as a csv file yes no (default) HPM VIZ OUTPUT : prints output as a viz file yes no (default) The OUTPUT related environment variables are mutually exclusive, and allows HPM to write output in several formats at the same time. HPM EXCLUSIVE : calculates output of outer nested sections exclusive to inner nested sections yes no (default) HPM IO BATCH : Reduces the number of output files simultaneously opened by HPM yes no (default) 3.2. HPM multiplexing The BGPM layer implements a feature to multiplex hardware counters. This feature switches different sets of events to measure counter data. The user still selects an eventset, and enables multiplexing. Since the counter conflicts on BlueGene/Q is related to the number of events, the scope of measurement (Node, process, thread levels) the BGPM layer implements multiplexing and can be reused by user tools. While enabling multiplexing, BGPM chooses the number of groups to be used for multiplexing. The period of multiplexing is set by HPM using BGPM API calls to minimize overhead while retaining accuracy. The HPM output is accurate if the compute time is as low as a second Exploring stream, list prefetch using HPM The stream prefetching on Bluegene/Q can be programmatically modified. When the user selects HPM EVENT SET=4 these prefetch events are enabled for counting. The available stream and list prefetch events are L1P Streams Established, L1P Stream write invalidate, L1P List Hit, Stream depth adaptation Events. The L1P list hit metric can be used to explore if the list prefetching is effective. The L1P stream depth adaptation events gives an indication of the number of times the stream depth had to be modified. This event along with L1P stream write invalidate event is an indicator of effectiveness of stream prefetch.

8 HPM:8 International Business Machines 3.4. HPM Design The HPM header file included defines a set of macros which replace HPM function calls with calls incorporating the details of location from which the HPM calls are made. The hpminit function initializes a set of common data structures used by HPM. The hpminit function also initializes the BGPM data structures. The main data structures used by HPM are defined in a common header file. Arrays to track the code block s labels, number of times the code block is started and stopped, wall time per code block, counter data per code block are declared in the common header file. A function named index from label converts the string label entered by user to a numeric index. This index is used to tally the number of times a code block is visited as the program executes. The recording of the counter data is tracked per process for ever code block. The maximum number of code blocks allowed by HPM is 20. The maximum length of the string for the code block label is 80 characters. In order to keep track of the process number and the number of cores used by a process a set of mask arrays are defined. Each process maintains information about which slot it should be updating its counter data to in an array named myslots. Once the counter data is read, the slots corresponding to the process is updated. HPM prints out the labels of the counters used as a descriptive text. Inorder to translate the counter number to the descriptive text, the counter labels are mapped to counter numbers. This functionality is defined in a function named set labels. When hpmstart function is called, the hpmstart function converts the code block label to code block index. It registers the location of the start function s source code file and line number. The hpmstart function also registers the nesting level of this hpmstart function. The slots which corresponds to the process is identified. Then the starting counter values as well as the wall clock time values are recorded for the current code block. The number of events in a specified event set is known in advance since the event sets are hard coded. The starting values of events recorded are updated per code block index. The hpmstop function performs steps similar to hpmstart. The hpmstop function converts the code block to code block index. It then reads the counter value and then records them in the corresponding array location of the specified code block index. The hpmterminate function processes the data extracted by hpm and converts them to a format which can be printed in various formats requested by the user. This function processes the environment variables related to output formats, calculation of exclusive and inclusive metrics and writing of output in batches to reduce file system impact. hpmterminate calls a function named printderived to compute and print the derived metrics. The print derived function checks if the counters it needs to compute each derived metric is available in the event set selected. If all components are available it prints the output HPM s overhead measurements The HPM s overhead of instrumentation is minimal since it uses hardware performance counters for measurements. To validate the low overhead a test application was run at varying lengths of compute time from within 3 nested HPM regions. The overhead of HPM was measured for the original code. Then the application was instrumented using HPM and the run time of the application was measured for a nonmultiplexed event set. The same section of code was instrumented using multiplexed event set. The difference in runtimes of this application is presented in Figure 8.

9 Hardware Performance Monitor for Bluegene/Q HPM:9 Overhead Measurement of HPM Time in seconds Without HPM Non Multiplexed event set Multiplexed event set Number of Processes Overhead Measurement of HPM Larger compute time 25 Time in seconds Without HPM Non Multiplexed event set Multiplexed event set Number of Processes Fig. 8. HPM s overhead measurements HPM Derived metric definitions Derived Metrics Definition Million Instructions Per Second Total Number of instructions in millions by total time in seconds Cycles Per Instruction Total number of cycles by Total number of instructions MegaFlop per second Total Flops (in millions) Total AXU Instructions Total number of AXU instructions (in millions) Total AXU Instructions committed per sec Total number of AXU instructions committed per second Percentage of FP instructions to total instructions Total number of AXU instructions per total number of all instructions expressed as a percentage Instructions per load store Total number of instructions by the total number of loads and store instructions L1 loads per load miss Total number of L1 loads per load miss L1 Stores per Store miss Total number of L1 stores per store miss L1 cache hit percentage (1.0 - (Total L1 misses/ Total L1 accesses ))*100 Percentage of branch mispredictions Total number of branches mispredicted by Total number of branches expressed as a percentage Fig. 9. List of derived metrics

10 HPM:10 International Business Machines 3.6. HPM Derived metrics Derived metrics are values computed from one or more hardware counter data. Since raw hardware counter data is hard to interpret, the derived metrics are calculated by HPM. The derived metrics is available in ASCII and CSV formats. Derived metrics are computed by HPM automatically if the components of the derived metric is available. The definition of the derived metrics is listed in Figure HPM Peekperf output Fig. 10. Peekperf view of HPM Output Peekperf is a graphic user interface to display HPM data in a browsable format. When the HPM VIZ OUTPUT environment variable is set, HPM produces a peekperf compatible output. The peekperf output displays the source code and the counter output in an integrated view. 4. REFERENCES

Performance analysis basics

Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis