IBM PSSC Montpellier Customer Center. Content

Size: px

Start display at page:

Download "IBM PSSC Montpellier Customer Center. Content"

Jessie Gibbs
5 years ago
Views:

1 Content Profiling GNU Profiler (Gprof) Vprof Communications Tracing MPI Trace Library Hardware Performance Monitors Universal Performance Counters (UPC) HPM Library IBM System Blue Gene/P Specifics Personality Kernel Interface Multi-Purpose Toolkits HPC Toolkit Major Open-Source Tools SCALASCA TAU 1 PSSC Montpellier Deep Computing Team

2 Code Profiling Purpose Identify most-consuming routines of a binary In order to determine where the optimization effort has to take place Standard Features Construct a display of the functions within an application Help users identify functions that are the most CPU-intensive Charge execution time to source lines Methods & Tools GNU Profiler Vprof Notes Profiling can be used to profile both serial and parallel applications Based on sampling (support from both compiler and kernel) 2 PSSC Montpellier Deep Computing Team

3 GNU Profiler (Gprof) How-To Collection Compile the program with options: -g -pg -qfullpath Will create symbols required for debugging / profiling Execute the program Standard way Execution generates profiling files in execution directory gmon.out.<mpi Rank> Binary files, not human readable Nb files depends on environment variable 1 Profiling File / Process 3 Profiling Files only One file for the slowest / fastest / median process Two options for output files interpretation GNU Profiler (Command-line utility) Xprofiler (Graphical utility / Part of HPC Toolkit) 3 PSSC Montpellier Deep Computing Team

4 GNU Profiler (Gprof) How-To Visualization Allows profiling report generation From profiling output files Standard Usage gprof <Binary> gmon.out.<mpi Rank> > gprof.out.<mpi Rank> Profiling report limited compared to standard Unix/Linux The subroutines and their relative importance Number of calls 4 PSSC Montpellier Deep Computing Team

5 VProf (Visual Profiler) Definition The Visual Profiler, VProf, is a project developed for optimizing the performance of programs and evaluating algorithm efficiency Provides Routines to collect statistical profiling information Programs to view execution profiles Graphical and command-line Profile data is used to generate performance summaries sorted by source code line, by file, and by function Advantages over Gprof Recompilation not necessary (linking only) Performance overhead significantly lower Profiling visualisation through cprof is very clear Development Apparently not too much maintained anymore (Sandia as last known owner) Integrated into IBM MPI Trace Library 5 PSSC Montpellier Deep Computing Team

6 VProf (Visual Profiler) How-To Link to IBM MPI Trace Library Set appropriate environment variable VPROF_PROFILE=yes Enables profiling One profiling file per MPI Trace file Execution produces profiling files vmon.out.<mpi Rank> Analyzes profiling files through cprof cprof -e <Binary> <Profiling File> Final profiling file contains four different sections File Summary Function Summary Line Summary Source code annotations 6 PSSC Montpellier Deep Computing Team

7 VProf (Visual Profiler) Cprof Command Line Options Usage: cprof [options] executable [vmon_file...] -d, --directory dir Search dir for source files -D, --recursive-directory dir Search dir recursively for source files -e, --everything Show all information -a, --annotate file Annotate file -n, --number Show number of samples (not %) -s, --show thres Set threshold for showing aggregate data -H, --html dir Output HTML into directory dir 7 PSSC Montpellier Deep Computing Team

8 IBM MPI Trace Library Principles MPI Trace Features Collects all MPI communications of an application Measures time spent in the MPI routines Provides call graph for communication subroutines Usage Link with library /bgp/usermisc/hhhibm11/libraries/libmpitrace/libmpitrace.a Execute program Various environment variables can be specified (cf. next slide) Analyze trace files Text files, humand-readable 8 PSSC Montpellier Deep Computing Team

9 IBM MPI Trace Library Environment Variables Environment Variables Data Collection Settings SWAP_BYTES={no* yes} Switches output file endianness TRACE_DIR=<Directory> Output directory TRACE_BUFFER_SIZE=<Size> Buffer size SAVE_ALL_TASKS={no* yes} Save all MPI Tasks or maximum / median / minimum only Communications Profiling PROFILE_BY_CALL_SITE={no* yes} Provides call stack for MPI primitives TRACE_ALL_EVENTS={no* yes} TRACE_MAX_RANK=<Rank> TRACE_SEND_PATTERN={no* yes} Builds Point-To-Point communication matrix Output Files mpi.profile.<process ID>.<MPI Task #> events.trc hpmdata.x_y_z.<process ID> 9 PSSC Montpellier Deep Computing Team

10 IBM MPI Trace Library Sample Output MPI Routine #calls avg. bytes time(sec) MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Reduce total communication time = seconds. total elapsed time = seconds. user cpu time = seconds. system time = seconds. maximum memory size = KBytes Message size distributions: MPI_Isend MPI_Irecv MPI_Reduce #calls avg. bytes time(sec) #calls avg. bytes time(sec) #calls avg. bytes time(sec) PSSC Montpellier Deep Computing Team

11 Hardware Performance Monitors (HPM) Definition Definition Extra logic inserted in the processor to count specific events Updated at every cycle Strengths Non-intrusive Very accurate Low overhead Weakness Provides only hard counts Specific for each processor Access is not well documented Lack of standard and documentation on what is counted Purpose Provides comprehensive reports of events that are critical to performance on IBM systems Gathers critical hardware performance metrics Number of misses on all cache levels Number of floating point instructions executed Number of instruction loads that cause TLB misses Helps to identify and eliminate performance bottlenecks 11 PSSC Montpellier Deep Computing Team

12 Universal Performance Counters (UPC) Principles 256 counters, 64 bits each Hardware unit on the BG/P chip 72 counters are in the clock-x1 domain PowerPC 450 core: FPU, FP load/store Counters specific to each core 184 counters are in the clock-x2 domain L2, L3, memory, networks Counters mostly shared across the node BGP counters are tied to hardware resources, either specific to a core or shared across the node There is no process or thread-level context, But processes and threads are pinned to specific cores 12 PSSC Montpellier Deep Computing Team

13 Universal Performance Counters (UPC) Principles The counter mode and trigger method are programmable: Mode 0 : info on cores 0 and 1 for the clock-x1 counters plus a set of 184 counters in the clock-x2 domain Mode 1 : info on cores 2 and 3 for the clock-x1 counters plus a different set of 184 counters in the clock-x2 domain Modes 2 and 3 : primarily intended for hardware designers Trigger methods: rising edge, default edge, falling edge, level high, level low the counters are basically looking at a voltage that can be "high" or "low The edge modes can be set to count the number of events (for instance the counter increments at the rising edge of a low => high transition) The level (high or low) mode can be set to count p-clock cycles while the voltage is either high or low This mode allows to know how many p-clocks went by while waitin a load request (instead of the number of load requests) The first 72 counters are not affected by the trigger, but the memory counters can be different It advices to use the default edge or level high 13 PSSC Montpellier Deep Computing Team

14 Universal Performance Counters (UPC) Counter Access The BGP_UPC interface definitions and list of events are in: /bgsys/drivers/ppcfloor/arch/include/spi/upc.h /bgsys/drivers/ppcfloor/arch/include/spi/upc_events.h // every process on the node calls BGP_UPC_Initialize() BGP_UPC_Initialize(); // just one rank per node sets the counter config and zeros the counters if (local_rank == 0) { BGP_UPC_Initialize_Counter_Config(counter_mode, counter_trigger); BGP_UPC_Zero_Counter_Values(); BGP_UPC_Start(0); } MPI_Barrier(local_comm); // communicator local to the node do work MPI_Barrier(local_comm); if (local_rank == 0) { BGP_UPC_Stop(); BGP_UPC_Read_Counter_Values(&counter_data, sizeof(struct CounterStruct), BGP_UPC_READ_EXCLUSIVE); Save the counter values from the counter_data structure BGP_UPC_Start(0); } struct CounterStruct { int32_t rank; // Rank int32_t core; // Core int32_t upc_number; // UPC Number int32_t number_processes_per_upc; // Number of processes per UPC unit BGP_UPC_Mode_t mode; // User mode int32_t number_of_counters; // Number of counter values returned char location[24]; // Location int64_t elapsed_time; // Elapsed time uint32_t reserved_1; // Reserved for alignment uint32_t reserved_2; // Reserved for alignment int64_t values[256]; // Counter values } counter_data; counter_mode = 0, 1, 2, 3 (plus some others see UPC.h) counter_trigger = BGP_UPC_CFG_LEVEL_HIGH, BGP_UPC_CFG_EDGE_DEFAULT 14 PSSC Montpellier Deep Computing Team

15 Universal Performance Counters (UPC) Usage Basic operation is BGP_UPC_Read_Counter_Values(&counter_data, ) Fills out a structure including 256 counter values (64 bits each) Caveats Reading all of the counters takes a long time of order 10**4 cycles Consequence: in practice, you can only use the counters for coarsegrained measurements The BGP headers (UPC.h) require the GNU compiler (mpicc, powerpc-bgp-linux-gcc) for compilation Consequence: it is best to wrap the counter routines in separately compiled source some counters count events, and other counters count cycles, but one cycle in the clock-x2 domain = two processor cycles Consequence: multiply by two the value obtained to get processor cycles Example: counter 80 (mode 0) BGP_PU0_L2_CYCLES_READ_REQUEST_PENDING with trigger = level high, is the number of memory-bus cycles where the L2 unit attached to core 0 is waiting on a read request Any process or thread running on the node can (will) trigger the shared counters in the clock-x2 domain Consequence: this needs to be remembered to properly interpret the data. Cf. Bob Walkup s documentation for recommendations 15 PSSC Montpellier Deep Computing Team

16 Universal Performance Counters (UPC) Other Counter Interfaces All hardware counter interfaces for BG/P are layered on top of BGP_UPC The BGP_UPC layer is provided, so you can write your own interfaces PAPI 4.0 has been ported to BG/P. Some information has been posted by Argonne National Labs: HPC Toolkit provides documentation in HPM_ug.pdf ; there is no hpmcount or hpmstat for BG/P, just libhpm.a. The env variable HPM_EVENT_SET is used to set the counter mode 0, 1, 2, 3; default value is 0. The default trigger method was previously edge rise (can be set by the user). #include <libhpm.h> hpminit(rank, program); hpmtstart(number, label); do_work(); hpmtstop(number); hpmterminate(rank); // prints counter values etc. 16 PSSC Montpellier Deep Computing Team

17 Universal Performance Counters (UPC) Higher-Level Interfaces All hardware counter interfaces for BG/P are layered on top of BGP_UPC You can write your own interfaces PAPI 4.0 has been ported to BG/P Requires application of patch 003 Provides common interface to many third-party tools Scalasca, TAU Useful information been posted by Argonne National Labs 17 PSSC Montpellier Deep Computing Team

18 HPM IBM MPI Library How-To Principle IBM MPI Library provides a very easy to use implementation of the HPM extraction Usage Link with IBM MPI Library Execute with following environment variables BGP_STATS={0* 1} BGP_STATS_INTERVAL=<Interval (Seconds)> Execution produces one HPM file per MPI task hpmdata.<coordinates>.<job ID> 18 PSSC Montpellier Deep Computing Team

19 HPM HPM Library How-To Instrument code for HPM measurement call hpm_init() Initializes counters call hpm_start('label') Start counting a labeled block call hpm_stop ('label') Stop counting a labeled block call hpm_print() Print counter values and labels Link with HPM library Execution produces one HPM data file per MPI task hpm_data.<mpi Rank> 19 PSSC Montpellier Deep Computing Team

20 HPM Library How-To libmpihpm.a Uses the MPI profiling interface, starts the BGP UPC counters in MPI_Init(), stops them in MPI_Finalize(), and produces two counter ouput files: One text summary with min, max, avg counter values One binary file with all counter data from every node The command getcounts can be used to pull out the data for a given node from the aggregate binary file. Nodes are numbered in x, y, z order on the partition that the job ran on. Environement variables: BGP_COUNTER_MODE=0,1,2,3 (default = 0) BGP_COUNTER_TRIGGER={edge, high} (default method = high) This utility provides aggregate Flops for the whole job, from start to finish, along with MPI statistics; but can t be used to measure specific code blocks. A simple start/stop interface that can be called from Fortran, C, C++ to get counts around specific code blocks, with one output file per node. Fortran interface: call hpm_init()! one time to initialize counters call hpm_start('label')! start counting a labeled block call hpm_stop ('label')! stop counting a labeled block call hpm_print()! print counter values and labels once at the end C interface (add extern C for C++): void HPM_Init(void); void HPM_Start(char * label); void HPM_Stop(char * label); void HPM_Print(void); 20 PSSC Montpellier Deep Computing Team

21 Automatically Available Performance Counters Principle By providing a hook into MPI_Init and MPI_Finalize functions, counters will be enabled before an application runs, and the results will be collected and summarized before the application exits Once this feature is enabled, no user intervention will be required to collect this performance counter data, but options will be provided at run time to change counter modes, counter triggers, and counter data output directories It is also possible to disable the collection of performance counter data at run time How-To Source file /bgsys/drivers/ppcfloor/tools/autoperfcounters/enableautoperfcounters 21 PSSC Montpellier Deep Computing Team

22 Personality Definition Double Definition Static data given to every Compute Node and I/O Node at boot time by the control system Personality data contains information that is specific to the node Set of C language structures and functions that allows querying personality data from the node Useful to determine, at run time, where the tasks of the application are running Might be used to tune certain aspects of the application at run time, such as determining which set of tasks share the same I/O Node and then optimizing the network traffic from the Compute Nodes to that I/O Node 22 PSSC Montpellier Deep Computing Team

23 Personality Usage Elements Two Include Files #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> In Directory: /bgsys/drivers/ppcfloor/arch/include Structure _BGP_Personality_t personality; Query Function Kernel_GetPersonality(&personality, sizeof(personality)); 23 PSSC Montpellier Deep Computing Team

24 Personality Provided Information personality.network_config.[x Y Z]nodes Nb X / Y / Z Nodes in Torus personality.network_config.[x Y Z]coord X / Y / Z Node Coordinates in Torus Kernel_PhysicalProcessorID() Core ID on Compute Node (0, 1, 2, 3) BGP_Personality_getLocationString(&personality, location) Location string Rxx-Mx-Nxx-Jxx 24 PSSC Montpellier Deep Computing Team

25 Personality Example #include <spi/kernel_interface.h> #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> int main(int argc, char * argv[]) { int taskid, ntasks; int memory_size_mbytes; _BGP_Personality_t personality; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); Kernel_GetPersonality(&personality, sizeof(personality)); memory_size_mbytes = personality.ddr_config.ddrsizemb; printf("memory size = %d MBytes\n", memory_size_mbytes); node_config = personality.kernel_config.processconfig; if (node_config == _BGP_PERS_PROCESSCONFIG_SMP) printf("smp mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_VNM) printf("virtual-node mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_2x2) printf("dual mode\n"); else printf("unknown mode\n"); } xcoord = personality.network_config.xcoord; ycoord = personality.network_config.ycoord; zcoord = personality.network_config.zcoord; xsize = personality.network_config.xnodes; ysize = personality.network_config.ynodes; zsize = personality.network_config.znodes; pset_num = personality.network_config.psetnum; pset_size = personality.network_config.psetsize; pset_rank = personality.network_config.rankinpset; BGP_Personality_getLocationString(&personality, location); procid = Kernel_PhysicalProcessorID(); 25 PSSC Montpellier Deep Computing Team

26 Kernel Interface Main Include File #include <spi/kernel_interface.h> In Directory: /bgsys/drivers/ppcfloor/arch/include Query Functions Kernel_GetMemorySize Retrieves Memory Information from Kernel Kernel_ActiveProcesseCount Retrieves number of active processes in the Kernel 26 PSSC Montpellier Deep Computing Team

27 IBM HPC Toolkit Toolkit Content Hardware (CPU) Performance Xprofiler HPM Toolkit Message-Passing Performance MPI Profiler / Tracer I/O Performance Modular I/O (MIO) Performance Visualization PeekPerf Supported Platforms AIX: AIX 5.3, AIX 6.1 Linux on POWER: Red Hat 5.2, SLES 10 IBM System Blue Gene/P Support via Advanced Computing Technology Center in Research (ACTC) Link ct.index.html 27 PSSC Montpellier Deep Computing Team

28 HPC Toolkit Xprofiler Visualize CPU time profiling data Compile and link with -g -pg flags + optimization Code execution generates gmon.out file MPI applications generate gmon.out.1,, gmon.out.n Analyze gmon.out file with Xprofiler Xprofiler a.out gmon.out Important factors On AIX time-sampling interval is 0.01 sec Profiling introduces overhead due to function calls 28 PSSC Montpellier Deep Computing Team

29 HPC Toolkit Xprofiler GUI Overview Window Width of a bar: time including called routines Height of a bar: time excluding called routines Call arrows labeled with number of calls Overview window for easy navigation (View Overview) 29 PSSC Montpellier Deep Computing Team

30 HPC Toolkit Xprofiler GUI Source Code Window Source code window displays source code with time profile (in ticks=.01 sec) Access Select function in main display Context Menu Select function in flat profile Code Display Show Source Code 30 PSSC Montpellier Deep Computing Team

31 HPC Toolkit Xprofiler GUI Disassembler Code 31 PSSC Montpellier Deep Computing Team

32 HPC Toolkit Message-Passing Performance MP_Profiler Library Captures summary data for MPI calls Source code traceback User MUST call MPI_Finalize() in order to get output files. No changes to source code MUST compile with -g to obtain source line number information MP_Tracer Library Captures timestamped data for MPI calls Source traceback 32 PSSC Montpellier Deep Computing Team

33 MP_Profiler Output with Peekperf 33 PSSC Montpellier Deep Computing Team

34 MP_Profiler Message Size Distribution MPI Function #Calls Message Size #Bytes Walltime MPI Function #Calls Message Size #Bytes Walltime MPI_Comm_size 1 (1) E-07 MPI_Irecv 2 (1) E-06 MPI_Comm_rank 1 (1) E-07 MPI_Irecv 2 (2) E-06 MPI_Isend 2 (1) MPI_Irecv 2 (3) E-06 MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend 2 (2) 2 (3) 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E E-06 9E-07 MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Waitall MPI_Barrier 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) 21 (1) 5 (1) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E PSSC Montpellier Deep Computing Team

35 MP_Tracer Output with Peekperf 35 PSSC Montpellier Deep Computing Team

36 HPM Toolkit Components libhpc Library for program (including multi-thread) section instrumentation Environment Variables HPM_EVENT_SET=[0-3] HPM_UNIQUE_FILE_NAME={ 0 1 } Not available on Blue Gene/P hpccount Starts application and provides Wall clock time Hardware performance counter information Resource utilization statistics Not available on Blue Gene/P hpcstat Provides system wide reports for root Not available on Blue Gene/P 36 PSSC Montpellier Deep Computing Team

37 HPM Toolkit libhpm Insert libhpc library calls in the source code and instrument different sections independently Supports Fortran, C, and C++ Provides for each instrumented section Total count & duration (wall clock time) Hardware performance counters information Derived metrics Provides resource usage statistics for the total execution of the instrumented program Supports MPI, OpenMP, & pthreads Multiple instrumentation points Nested instrumentation Multiple calls to an instrumented point 37 PSSC Montpellier Deep Computing Team

38 HPM Toolkit libhpm Output Textual Summary report for each task perfhpm<taskid>.<pid> libhpm (V 2.6.0) summary Total execution time of instrumented code (wall time): seconds Instrumented section: 3 - Label: job 1 - process: 1 file: sanity.c, lines: 33 <--> 70 Count: 1 Wall Clock Time: seconds BGL_FPU_ARITH_MULT_DIV (Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)) : 0 BGL_FPU_LDST_DBL_ST ( ) : 23 BGL_UPC_L3_WRBUF_LINE_ALLOC (Write buffer line was allocated) :1702 Peekperf performance file hpm<taskid>_<progname>_<pid>.viz Table performance file tb_hpm<taskid>.<pid> 38 PSSC Montpellier Deep Computing Team

39 HPM Toolkit libhpm Output Peekperf 39 PSSC Montpellier Deep Computing Team

40 Environment Flags HPM_EVENT_SET Select the event set to be recorded Integer (0 15) HPM_NUM_INST_PTS Overwrite the default of 100 instrumentation sections in the app. Integer value > 0 HPM_WITH_MEASUREMENT_ERROR Deactivate the procedure that removes measurement errors. True or False (0 or 1). HPM_OUTPUT_NAME Define an output file name different from the default. String HPM_VIZ_OUTPUT Indicate if.viz file (for input to PeekPerf) should be generated or not. True or False (0 or 1). HPM_TABLE_OUTPUT Indicate table text file should be generated or not. True or False (0 or 1). 40 PSSC Montpellier Deep Computing Team

41 Peekperf IBM PSSC Montpellier Customer Center Visualization and analysis tool Offline analysis and viewing capability Supported platforms AIX Linux (Power/Intel) Windows (Intel) Blue Gene 41 PSSC Montpellier Deep Computing Team

42 MP_Profiler Visualization Using PeekPerf 42 PSSC Montpellier Deep Computing Team

43 MP_Tracer Visualization Using PeekPerf 43 PSSC Montpellier Deep Computing Team

44 HPM Visualization Using PeekPerf 44 PSSC Montpellier Deep Computing Team

45 Modular I/O Performance Tool (MIO) I/O Analysis Trace module Summary of File I/O Activity + Binary Events File Low CPU overhead I/O Performance Enhancement Library Prefetch module (optimizes asynchronous prefetch and write-behind) System Buffer Bypass capability User controlled pages (size and number) Recoverable Error Handling Recover module (monitors return values and errno + reissues failed requests) Remote Data Server Remote module (simple socket protocol for moving data) Shared object library for AIX 45 PSSC Montpellier Deep Computing Team

46 Performance Visualization file position ( bytes ) JFS performance vmtune -p20 -P80 -f120 -F128 -r2 -R8 writes reads time (seconds) PSSC Montpellier Deep Computing Team

Scalasca Definition Scalasca = SCalable performance Analysis of LArge SCale Applications Performance tool measurement and analysis developed by the Innovative Computing Laboratory (ICL) and the

47 Scalasca Definition Scalasca = SCalable performance Analysis of LArge SCale Applications Performance tool measurement and analysis developed by the Innovative Computing Laboratory (ICL) and the Jülich Supercomputing Centre (JSC) Scalable trace analysis tool Link SCALASCA analyzes separate local trace files in parallel by replaying the original communication on as many CPUs as have been used to execute the target application itself 47 PSSC Montpellier Deep Computing Team

48 Scalasca Usage Easy Use No source code modification Wrapper functions for compilation and execution Recompilation required Three-Stage Process Instrument Prepare application objects and executable for measurement scalasca -instrument [options] <compile-or-link-command> Analyze Run application under control of measurement system scalasca -analyze [options] <application-launch-command> Examine Interactively explore measurement analysis report scalasca -examine [options] <experiment-archive report> 48 PSSC Montpellier Deep Computing Team

49 Scalasca Graphical User Interface 49 PSSC Montpellier Deep Computing Team

50 Scalasca Personal Experience Feedback Hard to install but easy to use Exception: the mpirun command line is a pain in the neck More useful than a standard profiling / MPI Trace? Rich but complex GUI Requires X11 forwarding or VNC Probably mandatory for very large number of nodes Where standard profiling reaches its limits But performance analysis often performed on lower number of nodes 50 PSSC Montpellier Deep Computing Team

51 TAU TAU = Tuning and Analysis Utility Program and performance analysis tool framework being developed for the DOE Office of Science, ASC initiatives at LLNL, the ZeptoOS project at ANL, and the Los Alamos National Laboratory Provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications Link 51 PSSC Montpellier Deep Computing Team

Blue Gene/P Universal Performance Counters

Blue Gene/P Universal Performance Counters Bob Walkup (walkup@us.ibm.com) 256 counters, 64 bits each; hardware unit on the BG/P chip 72 counters are in the clock-x1 domain (ppc450 core: fpu, fp load/store,