IBM PSSC Montpellier Customer Center. Content

Size: px
Start display at page:

Download "IBM PSSC Montpellier Customer Center. Content"

Transcription

1 Content Profiling GNU Profiler (Gprof) Vprof Communications Tracing MPI Trace Library Hardware Performance Monitors Universal Performance Counters (UPC) HPM Library IBM System Blue Gene/P Specifics Personality Kernel Interface Multi-Purpose Toolkits HPC Toolkit Major Open-Source Tools SCALASCA TAU 1 PSSC Montpellier Deep Computing Team

2 Code Profiling Purpose Identify most-consuming routines of a binary In order to determine where the optimization effort has to take place Standard Features Construct a display of the functions within an application Help users identify functions that are the most CPU-intensive Charge execution time to source lines Methods & Tools GNU Profiler Vprof Notes Profiling can be used to profile both serial and parallel applications Based on sampling (support from both compiler and kernel) 2 PSSC Montpellier Deep Computing Team

3 GNU Profiler (Gprof) How-To Collection Compile the program with options: -g -pg -qfullpath Will create symbols required for debugging / profiling Execute the program Standard way Execution generates profiling files in execution directory gmon.out.<mpi Rank> Binary files, not human readable Nb files depends on environment variable 1 Profiling File / Process 3 Profiling Files only One file for the slowest / fastest / median process Two options for output files interpretation GNU Profiler (Command-line utility) Xprofiler (Graphical utility / Part of HPC Toolkit) 3 PSSC Montpellier Deep Computing Team

4 GNU Profiler (Gprof) How-To Visualization Allows profiling report generation From profiling output files Standard Usage gprof <Binary> gmon.out.<mpi Rank> > gprof.out.<mpi Rank> Profiling report limited compared to standard Unix/Linux The subroutines and their relative importance Number of calls 4 PSSC Montpellier Deep Computing Team

5 VProf (Visual Profiler) Definition The Visual Profiler, VProf, is a project developed for optimizing the performance of programs and evaluating algorithm efficiency Provides Routines to collect statistical profiling information Programs to view execution profiles Graphical and command-line Profile data is used to generate performance summaries sorted by source code line, by file, and by function Advantages over Gprof Recompilation not necessary (linking only) Performance overhead significantly lower Profiling visualisation through cprof is very clear Development Apparently not too much maintained anymore (Sandia as last known owner) Integrated into IBM MPI Trace Library 5 PSSC Montpellier Deep Computing Team

6 VProf (Visual Profiler) How-To Link to IBM MPI Trace Library Set appropriate environment variable VPROF_PROFILE=yes Enables profiling One profiling file per MPI Trace file Execution produces profiling files vmon.out.<mpi Rank> Analyzes profiling files through cprof cprof -e <Binary> <Profiling File> Final profiling file contains four different sections File Summary Function Summary Line Summary Source code annotations 6 PSSC Montpellier Deep Computing Team

7 VProf (Visual Profiler) Cprof Command Line Options Usage: cprof [options] executable [vmon_file...] -d, --directory dir Search dir for source files -D, --recursive-directory dir Search dir recursively for source files -e, --everything Show all information -a, --annotate file Annotate file -n, --number Show number of samples (not %) -s, --show thres Set threshold for showing aggregate data -H, --html dir Output HTML into directory dir 7 PSSC Montpellier Deep Computing Team

8 IBM MPI Trace Library Principles MPI Trace Features Collects all MPI communications of an application Measures time spent in the MPI routines Provides call graph for communication subroutines Usage Link with library /bgp/usermisc/hhhibm11/libraries/libmpitrace/libmpitrace.a Execute program Various environment variables can be specified (cf. next slide) Analyze trace files Text files, humand-readable 8 PSSC Montpellier Deep Computing Team

9 IBM MPI Trace Library Environment Variables Environment Variables Data Collection Settings SWAP_BYTES={no* yes} Switches output file endianness TRACE_DIR=<Directory> Output directory TRACE_BUFFER_SIZE=<Size> Buffer size SAVE_ALL_TASKS={no* yes} Save all MPI Tasks or maximum / median / minimum only Communications Profiling PROFILE_BY_CALL_SITE={no* yes} Provides call stack for MPI primitives TRACE_ALL_EVENTS={no* yes} TRACE_MAX_RANK=<Rank> TRACE_SEND_PATTERN={no* yes} Builds Point-To-Point communication matrix Output Files mpi.profile.<process ID>.<MPI Task #> events.trc hpmdata.x_y_z.<process ID> 9 PSSC Montpellier Deep Computing Team

10 IBM MPI Trace Library Sample Output MPI Routine #calls avg. bytes time(sec) MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Reduce total communication time = seconds. total elapsed time = seconds. user cpu time = seconds. system time = seconds. maximum memory size = KBytes Message size distributions: MPI_Isend MPI_Irecv MPI_Reduce #calls avg. bytes time(sec) #calls avg. bytes time(sec) #calls avg. bytes time(sec) PSSC Montpellier Deep Computing Team

11 Hardware Performance Monitors (HPM) Definition Definition Extra logic inserted in the processor to count specific events Updated at every cycle Strengths Non-intrusive Very accurate Low overhead Weakness Provides only hard counts Specific for each processor Access is not well documented Lack of standard and documentation on what is counted Purpose Provides comprehensive reports of events that are critical to performance on IBM systems Gathers critical hardware performance metrics Number of misses on all cache levels Number of floating point instructions executed Number of instruction loads that cause TLB misses Helps to identify and eliminate performance bottlenecks 11 PSSC Montpellier Deep Computing Team

12 Universal Performance Counters (UPC) Principles 256 counters, 64 bits each Hardware unit on the BG/P chip 72 counters are in the clock-x1 domain PowerPC 450 core: FPU, FP load/store Counters specific to each core 184 counters are in the clock-x2 domain L2, L3, memory, networks Counters mostly shared across the node BGP counters are tied to hardware resources, either specific to a core or shared across the node There is no process or thread-level context, But processes and threads are pinned to specific cores 12 PSSC Montpellier Deep Computing Team

13 Universal Performance Counters (UPC) Principles The counter mode and trigger method are programmable: Mode 0 : info on cores 0 and 1 for the clock-x1 counters plus a set of 184 counters in the clock-x2 domain Mode 1 : info on cores 2 and 3 for the clock-x1 counters plus a different set of 184 counters in the clock-x2 domain Modes 2 and 3 : primarily intended for hardware designers Trigger methods: rising edge, default edge, falling edge, level high, level low the counters are basically looking at a voltage that can be "high" or "low The edge modes can be set to count the number of events (for instance the counter increments at the rising edge of a low => high transition) The level (high or low) mode can be set to count p-clock cycles while the voltage is either high or low This mode allows to know how many p-clocks went by while waitin a load request (instead of the number of load requests) The first 72 counters are not affected by the trigger, but the memory counters can be different It advices to use the default edge or level high 13 PSSC Montpellier Deep Computing Team

14 Universal Performance Counters (UPC) Counter Access The BGP_UPC interface definitions and list of events are in: /bgsys/drivers/ppcfloor/arch/include/spi/upc.h /bgsys/drivers/ppcfloor/arch/include/spi/upc_events.h // every process on the node calls BGP_UPC_Initialize() BGP_UPC_Initialize(); // just one rank per node sets the counter config and zeros the counters if (local_rank == 0) { BGP_UPC_Initialize_Counter_Config(counter_mode, counter_trigger); BGP_UPC_Zero_Counter_Values(); BGP_UPC_Start(0); } MPI_Barrier(local_comm); // communicator local to the node do work MPI_Barrier(local_comm); if (local_rank == 0) { BGP_UPC_Stop(); BGP_UPC_Read_Counter_Values(&counter_data, sizeof(struct CounterStruct), BGP_UPC_READ_EXCLUSIVE); Save the counter values from the counter_data structure BGP_UPC_Start(0); } struct CounterStruct { int32_t rank; // Rank int32_t core; // Core int32_t upc_number; // UPC Number int32_t number_processes_per_upc; // Number of processes per UPC unit BGP_UPC_Mode_t mode; // User mode int32_t number_of_counters; // Number of counter values returned char location[24]; // Location int64_t elapsed_time; // Elapsed time uint32_t reserved_1; // Reserved for alignment uint32_t reserved_2; // Reserved for alignment int64_t values[256]; // Counter values } counter_data; counter_mode = 0, 1, 2, 3 (plus some others see UPC.h) counter_trigger = BGP_UPC_CFG_LEVEL_HIGH, BGP_UPC_CFG_EDGE_DEFAULT 14 PSSC Montpellier Deep Computing Team

15 Universal Performance Counters (UPC) Usage Basic operation is BGP_UPC_Read_Counter_Values(&counter_data, ) Fills out a structure including 256 counter values (64 bits each) Caveats Reading all of the counters takes a long time of order 10**4 cycles Consequence: in practice, you can only use the counters for coarsegrained measurements The BGP headers (UPC.h) require the GNU compiler (mpicc, powerpc-bgp-linux-gcc) for compilation Consequence: it is best to wrap the counter routines in separately compiled source some counters count events, and other counters count cycles, but one cycle in the clock-x2 domain = two processor cycles Consequence: multiply by two the value obtained to get processor cycles Example: counter 80 (mode 0) BGP_PU0_L2_CYCLES_READ_REQUEST_PENDING with trigger = level high, is the number of memory-bus cycles where the L2 unit attached to core 0 is waiting on a read request Any process or thread running on the node can (will) trigger the shared counters in the clock-x2 domain Consequence: this needs to be remembered to properly interpret the data. Cf. Bob Walkup s documentation for recommendations 15 PSSC Montpellier Deep Computing Team

16 Universal Performance Counters (UPC) Other Counter Interfaces All hardware counter interfaces for BG/P are layered on top of BGP_UPC The BGP_UPC layer is provided, so you can write your own interfaces PAPI 4.0 has been ported to BG/P. Some information has been posted by Argonne National Labs: HPC Toolkit provides documentation in HPM_ug.pdf ; there is no hpmcount or hpmstat for BG/P, just libhpm.a. The env variable HPM_EVENT_SET is used to set the counter mode 0, 1, 2, 3; default value is 0. The default trigger method was previously edge rise (can be set by the user). #include <libhpm.h> hpminit(rank, program); hpmtstart(number, label); do_work(); hpmtstop(number); hpmterminate(rank); // prints counter values etc. 16 PSSC Montpellier Deep Computing Team

17 Universal Performance Counters (UPC) Higher-Level Interfaces All hardware counter interfaces for BG/P are layered on top of BGP_UPC You can write your own interfaces PAPI 4.0 has been ported to BG/P Requires application of patch 003 Provides common interface to many third-party tools Scalasca, TAU Useful information been posted by Argonne National Labs 17 PSSC Montpellier Deep Computing Team

18 HPM IBM MPI Library How-To Principle IBM MPI Library provides a very easy to use implementation of the HPM extraction Usage Link with IBM MPI Library Execute with following environment variables BGP_STATS={0* 1} BGP_STATS_INTERVAL=<Interval (Seconds)> Execution produces one HPM file per MPI task hpmdata.<coordinates>.<job ID> 18 PSSC Montpellier Deep Computing Team

19 HPM HPM Library How-To Instrument code for HPM measurement call hpm_init() Initializes counters call hpm_start('label') Start counting a labeled block call hpm_stop ('label') Stop counting a labeled block call hpm_print() Print counter values and labels Link with HPM library Execution produces one HPM data file per MPI task hpm_data.<mpi Rank> 19 PSSC Montpellier Deep Computing Team

20 HPM Library How-To libmpihpm.a Uses the MPI profiling interface, starts the BGP UPC counters in MPI_Init(), stops them in MPI_Finalize(), and produces two counter ouput files: One text summary with min, max, avg counter values One binary file with all counter data from every node The command getcounts can be used to pull out the data for a given node from the aggregate binary file. Nodes are numbered in x, y, z order on the partition that the job ran on. Environement variables: BGP_COUNTER_MODE=0,1,2,3 (default = 0) BGP_COUNTER_TRIGGER={edge, high} (default method = high) This utility provides aggregate Flops for the whole job, from start to finish, along with MPI statistics; but can t be used to measure specific code blocks. A simple start/stop interface that can be called from Fortran, C, C++ to get counts around specific code blocks, with one output file per node. Fortran interface: call hpm_init()! one time to initialize counters call hpm_start('label')! start counting a labeled block call hpm_stop ('label')! stop counting a labeled block call hpm_print()! print counter values and labels once at the end C interface (add extern C for C++): void HPM_Init(void); void HPM_Start(char * label); void HPM_Stop(char * label); void HPM_Print(void); 20 PSSC Montpellier Deep Computing Team

21 Automatically Available Performance Counters Principle By providing a hook into MPI_Init and MPI_Finalize functions, counters will be enabled before an application runs, and the results will be collected and summarized before the application exits Once this feature is enabled, no user intervention will be required to collect this performance counter data, but options will be provided at run time to change counter modes, counter triggers, and counter data output directories It is also possible to disable the collection of performance counter data at run time How-To Source file /bgsys/drivers/ppcfloor/tools/autoperfcounters/enableautoperfcounters 21 PSSC Montpellier Deep Computing Team

22 Personality Definition Double Definition Static data given to every Compute Node and I/O Node at boot time by the control system Personality data contains information that is specific to the node Set of C language structures and functions that allows querying personality data from the node Useful to determine, at run time, where the tasks of the application are running Might be used to tune certain aspects of the application at run time, such as determining which set of tasks share the same I/O Node and then optimizing the network traffic from the Compute Nodes to that I/O Node 22 PSSC Montpellier Deep Computing Team

23 Personality Usage Elements Two Include Files #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> In Directory: /bgsys/drivers/ppcfloor/arch/include Structure _BGP_Personality_t personality; Query Function Kernel_GetPersonality(&personality, sizeof(personality)); 23 PSSC Montpellier Deep Computing Team

24 Personality Provided Information personality.network_config.[x Y Z]nodes Nb X / Y / Z Nodes in Torus personality.network_config.[x Y Z]coord X / Y / Z Node Coordinates in Torus Kernel_PhysicalProcessorID() Core ID on Compute Node (0, 1, 2, 3) BGP_Personality_getLocationString(&personality, location) Location string Rxx-Mx-Nxx-Jxx 24 PSSC Montpellier Deep Computing Team

25 Personality Example #include <spi/kernel_interface.h> #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> int main(int argc, char * argv[]) { int taskid, ntasks; int memory_size_mbytes; _BGP_Personality_t personality; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); Kernel_GetPersonality(&personality, sizeof(personality)); memory_size_mbytes = personality.ddr_config.ddrsizemb; printf("memory size = %d MBytes\n", memory_size_mbytes); node_config = personality.kernel_config.processconfig; if (node_config == _BGP_PERS_PROCESSCONFIG_SMP) printf("smp mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_VNM) printf("virtual-node mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_2x2) printf("dual mode\n"); else printf("unknown mode\n"); } xcoord = personality.network_config.xcoord; ycoord = personality.network_config.ycoord; zcoord = personality.network_config.zcoord; xsize = personality.network_config.xnodes; ysize = personality.network_config.ynodes; zsize = personality.network_config.znodes; pset_num = personality.network_config.psetnum; pset_size = personality.network_config.psetsize; pset_rank = personality.network_config.rankinpset; BGP_Personality_getLocationString(&personality, location); procid = Kernel_PhysicalProcessorID(); 25 PSSC Montpellier Deep Computing Team

26 Kernel Interface Main Include File #include <spi/kernel_interface.h> In Directory: /bgsys/drivers/ppcfloor/arch/include Query Functions Kernel_GetMemorySize Retrieves Memory Information from Kernel Kernel_ActiveProcesseCount Retrieves number of active processes in the Kernel 26 PSSC Montpellier Deep Computing Team

27 IBM HPC Toolkit Toolkit Content Hardware (CPU) Performance Xprofiler HPM Toolkit Message-Passing Performance MPI Profiler / Tracer I/O Performance Modular I/O (MIO) Performance Visualization PeekPerf Supported Platforms AIX: AIX 5.3, AIX 6.1 Linux on POWER: Red Hat 5.2, SLES 10 IBM System Blue Gene/P Support via Advanced Computing Technology Center in Research (ACTC) Link ct.index.html 27 PSSC Montpellier Deep Computing Team

28 HPC Toolkit Xprofiler Visualize CPU time profiling data Compile and link with -g -pg flags + optimization Code execution generates gmon.out file MPI applications generate gmon.out.1,, gmon.out.n Analyze gmon.out file with Xprofiler Xprofiler a.out gmon.out Important factors On AIX time-sampling interval is 0.01 sec Profiling introduces overhead due to function calls 28 PSSC Montpellier Deep Computing Team

29 HPC Toolkit Xprofiler GUI Overview Window Width of a bar: time including called routines Height of a bar: time excluding called routines Call arrows labeled with number of calls Overview window for easy navigation (View Overview) 29 PSSC Montpellier Deep Computing Team

30 HPC Toolkit Xprofiler GUI Source Code Window Source code window displays source code with time profile (in ticks=.01 sec) Access Select function in main display Context Menu Select function in flat profile Code Display Show Source Code 30 PSSC Montpellier Deep Computing Team

31 HPC Toolkit Xprofiler GUI Disassembler Code 31 PSSC Montpellier Deep Computing Team

32 HPC Toolkit Message-Passing Performance MP_Profiler Library Captures summary data for MPI calls Source code traceback User MUST call MPI_Finalize() in order to get output files. No changes to source code MUST compile with -g to obtain source line number information MP_Tracer Library Captures timestamped data for MPI calls Source traceback 32 PSSC Montpellier Deep Computing Team

33 MP_Profiler Output with Peekperf 33 PSSC Montpellier Deep Computing Team

34 MP_Profiler Message Size Distribution MPI Function #Calls Message Size #Bytes Walltime MPI Function #Calls Message Size #Bytes Walltime MPI_Comm_size 1 (1) E-07 MPI_Irecv 2 (1) E-06 MPI_Comm_rank 1 (1) E-07 MPI_Irecv 2 (2) E-06 MPI_Isend 2 (1) MPI_Irecv 2 (3) E-06 MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend 2 (2) 2 (3) 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E E-06 9E-07 MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Waitall MPI_Barrier 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) 21 (1) 5 (1) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E PSSC Montpellier Deep Computing Team

35 MP_Tracer Output with Peekperf 35 PSSC Montpellier Deep Computing Team

36 HPM Toolkit Components libhpc Library for program (including multi-thread) section instrumentation Environment Variables HPM_EVENT_SET=[0-3] HPM_UNIQUE_FILE_NAME={ 0 1 } Not available on Blue Gene/P hpccount Starts application and provides Wall clock time Hardware performance counter information Resource utilization statistics Not available on Blue Gene/P hpcstat Provides system wide reports for root Not available on Blue Gene/P 36 PSSC Montpellier Deep Computing Team

37 HPM Toolkit libhpm Insert libhpc library calls in the source code and instrument different sections independently Supports Fortran, C, and C++ Provides for each instrumented section Total count & duration (wall clock time) Hardware performance counters information Derived metrics Provides resource usage statistics for the total execution of the instrumented program Supports MPI, OpenMP, & pthreads Multiple instrumentation points Nested instrumentation Multiple calls to an instrumented point 37 PSSC Montpellier Deep Computing Team

38 HPM Toolkit libhpm Output Textual Summary report for each task perfhpm<taskid>.<pid> libhpm (V 2.6.0) summary Total execution time of instrumented code (wall time): seconds Instrumented section: 3 - Label: job 1 - process: 1 file: sanity.c, lines: 33 <--> 70 Count: 1 Wall Clock Time: seconds BGL_FPU_ARITH_MULT_DIV (Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)) : 0 BGL_FPU_LDST_DBL_ST ( ) : 23 BGL_UPC_L3_WRBUF_LINE_ALLOC (Write buffer line was allocated) :1702 Peekperf performance file hpm<taskid>_<progname>_<pid>.viz Table performance file tb_hpm<taskid>.<pid> 38 PSSC Montpellier Deep Computing Team

39 HPM Toolkit libhpm Output Peekperf 39 PSSC Montpellier Deep Computing Team

40 Environment Flags HPM_EVENT_SET Select the event set to be recorded Integer (0 15) HPM_NUM_INST_PTS Overwrite the default of 100 instrumentation sections in the app. Integer value > 0 HPM_WITH_MEASUREMENT_ERROR Deactivate the procedure that removes measurement errors. True or False (0 or 1). HPM_OUTPUT_NAME Define an output file name different from the default. String HPM_VIZ_OUTPUT Indicate if.viz file (for input to PeekPerf) should be generated or not. True or False (0 or 1). HPM_TABLE_OUTPUT Indicate table text file should be generated or not. True or False (0 or 1). 40 PSSC Montpellier Deep Computing Team

41 Peekperf IBM PSSC Montpellier Customer Center Visualization and analysis tool Offline analysis and viewing capability Supported platforms AIX Linux (Power/Intel) Windows (Intel) Blue Gene 41 PSSC Montpellier Deep Computing Team

42 MP_Profiler Visualization Using PeekPerf 42 PSSC Montpellier Deep Computing Team

43 MP_Tracer Visualization Using PeekPerf 43 PSSC Montpellier Deep Computing Team

44 HPM Visualization Using PeekPerf 44 PSSC Montpellier Deep Computing Team

45 Modular I/O Performance Tool (MIO) I/O Analysis Trace module Summary of File I/O Activity + Binary Events File Low CPU overhead I/O Performance Enhancement Library Prefetch module (optimizes asynchronous prefetch and write-behind) System Buffer Bypass capability User controlled pages (size and number) Recoverable Error Handling Recover module (monitors return values and errno + reissues failed requests) Remote Data Server Remote module (simple socket protocol for moving data) Shared object library for AIX 45 PSSC Montpellier Deep Computing Team

46 Performance Visualization file position ( bytes ) JFS performance vmtune -p20 -P80 -f120 -F128 -r2 -R8 writes reads time (seconds) PSSC Montpellier Deep Computing Team

47 Scalasca Definition Scalasca = SCalable performance Analysis of LArge SCale Applications Performance tool measurement and analysis developed by the Innovative Computing Laboratory (ICL) and the Jülich Supercomputing Centre (JSC) Scalable trace analysis tool Link SCALASCA analyzes separate local trace files in parallel by replaying the original communication on as many CPUs as have been used to execute the target application itself 47 PSSC Montpellier Deep Computing Team

48 Scalasca Usage Easy Use No source code modification Wrapper functions for compilation and execution Recompilation required Three-Stage Process Instrument Prepare application objects and executable for measurement scalasca -instrument [options] <compile-or-link-command> Analyze Run application under control of measurement system scalasca -analyze [options] <application-launch-command> Examine Interactively explore measurement analysis report scalasca -examine [options] <experiment-archive report> 48 PSSC Montpellier Deep Computing Team

49 Scalasca Graphical User Interface 49 PSSC Montpellier Deep Computing Team

50 Scalasca Personal Experience Feedback Hard to install but easy to use Exception: the mpirun command line is a pain in the neck More useful than a standard profiling / MPI Trace? Rich but complex GUI Requires X11 forwarding or VNC Probably mandatory for very large number of nodes Where standard profiling reaches its limits But performance analysis often performed on lower number of nodes 50 PSSC Montpellier Deep Computing Team

51 TAU TAU = Tuning and Analysis Utility Program and performance analysis tool framework being developed for the DOE Office of Science, ASC initiatives at LLNL, the ZeptoOS project at ANL, and the Los Alamos National Laboratory Provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications Link 51 PSSC Montpellier Deep Computing Team

Blue Gene/P Universal Performance Counters

Blue Gene/P Universal Performance Counters Blue Gene/P Universal Performance Counters Bob Walkup (walkup@us.ibm.com) 256 counters, 64 bits each; hardware unit on the BG/P chip 72 counters are in the clock-x1 domain (ppc450 core: fpu, fp load/store,

More information

Performance Analysis on Blue Gene/P

Performance Analysis on Blue Gene/P Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University From microprocessor to the full Blue Gene P/system IBM XL Compilers The commands

More information

IBM High Performance Computing Toolkit

IBM High Performance Computing Toolkit IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea

More information

Overview. Timers. Profilers. HPM Toolkit

Overview. Timers. Profilers. HPM Toolkit Overview Timers Profilers HPM Toolkit 2 Timers Wide range of timers available on the HPCx system Varying precision portability language ease of use 3 Timers Timer Usage Wallclock/C PU Resolution Language

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

IBM PSSC Montpellier Customer Center. Content

IBM PSSC Montpellier Customer Center. Content Content IBM PSSC Montpellier Customer Center Standard Tools Compiler Options GDB IBM System Blue Gene/P Specifics Core Files + addr2line Coreprocessor Supported Commercial Software TotalView Debugger Allinea

More information

Blue Gene/Q User Workshop. Performance analysis

Blue Gene/Q User Workshop. Performance analysis Blue Gene/Q User Workshop Performance analysis Agenda Code Profiling Linux tools GNU Profiler (Gprof) bfdprof Hardware Performance counter Monitors IBM Blue Gene/Q performances tools Internal mpitrace

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

MPI Performance Tools

MPI Performance Tools Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017 ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance Prof. Peter Bermel January 13, 2017 Outline Time Scaling Examples General performance strategies Computer architectures

More information

Performance Tools. Tulin Kaman. Department of Applied Mathematics and Statistics

Performance Tools. Tulin Kaman. Department of Applied Mathematics and Statistics Performance Tools Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook/BNL New York Center for Computational Science tkaman@ams.sunysb.edu Aug 23, 2012 Do you have information on exactly

More information

Integrating Parallel Application Development with Performance Analysis in Periscope

Integrating Parallel Application Development with Performance Analysis in Periscope Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation

More information

HPM Hardware Performance Monitor for Bluegene/Q

HPM Hardware Performance Monitor for Bluegene/Q HPM Hardware Performance Monitor for Bluegene/Q PRASHOBH BALASUNDARAM I-HSIN CHUNG KRIS DAVIS JOHN H MAGERLEIN The Hardware performance monitor (HPM) is a component of IBM high performance computing toolkit.

More information

Porting Applications to Blue Gene/P

Porting Applications to Blue Gene/P Porting Applications to Blue Gene/P Dr. Christoph Pospiech pospiech@de.ibm.com 05/17/2010 Agenda What beast is this? Compile - link go! MPI subtleties Help! It doesn't work (the way I want)! Blue Gene/P

More information

Software Tools. ACF Spring HPC Training Workshop Match 15-16, 2016

Software Tools. ACF Spring HPC Training Workshop Match 15-16, 2016 Software Tools ACF Spring HPC Training Workshop Match 15-16, 2016 IPM Integrated Performance Monitoring portable low-overhead selectable details at runtime various text and web reports. Link your code

More information

SUGI 29 Systems Architecture

SUGI 29 Systems Architecture Paper 231-29 Taking a Deeper Look into the SAS Intelligence Architecture How to Use AIX Tools to Examine Interactions of Base SAS with the IBM ^ pseries Frank Bartucca, IBM Corporation, Cary, NC Laurie

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität petkovve@in.tum.de March 2010 Outline Motivation Periscope (PSC) Periscope performance analysis

More information

IBM Hight Performance Computing Toolkit MPI Tracing/Profiling User Manual

IBM Hight Performance Computing Toolkit MPI Tracing/Profiling User Manual IBM Hight Performance Computing Toolkit MPI Tracing/Profiling User Manual Advanced Computing Technology Center IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 April 4, 2008 Contents 1 Overview

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

Hands-on Workshop on How To Debug Codes at the Institute

Hands-on Workshop on How To Debug Codes at the Institute Hands-on Workshop on How To Debug Codes at the Institute H. Birali Runesha, Shuxia Zhang and Ben Lynch (612) 626 0802 (help) help@msi.umn.edu October 13, 2005 Outline Debuggers at the Institute Totalview

More information

MPI Runtime Error Detection with MUST

MPI Runtime Error Detection with MUST MPI Runtime Error Detection with MUST At the 27th VI-HPS Tuning Workshop Joachim Protze IT Center RWTH Aachen University April 2018 How many issues can you spot in this tiny example? #include #include

More information

TAU by example - Mpich

TAU by example - Mpich TAU by example From Mpich TAU (Tuning and Analysis Utilities) is a toolkit for profiling and tracing parallel programs written in C, C++, Fortran and others. It supports dynamic (librarybased), compiler

More information

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Debugging and Profiling

Debugging and Profiling Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge. MPI and CUDA Filippo Spiga, HPCS, University of Cambridge Outline Basic principle of MPI Mixing MPI and CUDA 1 st example : parallel GPU detect 2 nd example: heat2d CUDA- aware MPI, how

More information

Performance Analysis for Large Scale Simulation Codes with Periscope

Performance Analysis for Large Scale Simulation Codes with Periscope Performance Analysis for Large Scale Simulation Codes with Periscope M. Gerndt, Y. Oleynik, C. Pospiech, D. Gudu Technische Universität München IBM Deutschland GmbH May 2011 Outline Motivation Periscope

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector A brief Introduction to MPI 2 What is MPI? Message Passing Interface Explicit parallel model All parallelism is explicit:

More information

Computer Organization: A Programmer's Perspective

Computer Organization: A Programmer's Perspective Profiling Oren Kapah orenkapah.ac@gmail.com Profiling: Performance Analysis Performance Analysis ( Profiling ) Understanding the run-time behavior of programs What parts are executed, when, for how long

More information

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Introduction to MPI Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Topics Introduction MPI Model and Basic Calls MPI Communication Summary 2 Topics Introduction

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Tool for Analysing and Checking MPI Applications

Tool for Analysing and Checking MPI Applications Tool for Analysing and Checking MPI Applications April 30, 2010 1 CONTENTS CONTENTS Contents 1 Introduction 3 1.1 What is Marmot?........................... 3 1.2 Design of Marmot..........................

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential

More information

Introduction to MPI. Branislav Jansík

Introduction to MPI. Branislav Jansík Introduction to MPI Branislav Jansík Resources https://computing.llnl.gov/tutorials/mpi/ http://www.mpi-forum.org/ https://www.open-mpi.org/doc/ Serial What is parallel computing Parallel What is MPI?

More information

The IBM High Performance Computing Toolkit

The IBM High Performance Computing Toolkit The IBM High Performance Computing Toolkit David Klepacki Advanced Computing Technology IBM T.J. Watson Research Center March 2004 2002 IBM Corporation Outline HPC Toolkit Overview PeekPerf Integration

More information

Performance Analysis and Debugging Tools

Performance Analysis and Debugging Tools Performance Analysis and Debugging Tools Performance analysis and debugging intimately connected since they both involve monitoring of the software execution. Just different goals: Debugging -- achieve

More information

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

Performance Analysis of Parallel Applications Using LTTng & Trace Compass Performance Analysis of Parallel Applications Using LTTng & Trace Compass Naser Ezzati DORSAL LAB Progress Report Meeting Polytechnique Montreal Dec 2017 What is MPI? Message Passing Interface (MPI) Industry-wide

More information

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015 HPC Lab Session 4: Profiler Sebastian Rettenberger, Chaulio Ferreira, Michael Bader November 9, 2015 Session 4: Profiler, November 9, 2015 1 Profiler Profiling allows you to learn where your program spent

More information

Profiling and debugging. Carlos Rosales September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin

Profiling and debugging. Carlos Rosales September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin Profiling and debugging Carlos Rosales carlos@tacc.utexas.edu September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin Outline Debugging Profiling GDB DDT Basic use Attaching

More information

Introduction to Performance Engineering

Introduction to Performance Engineering Introduction to Performance Engineering Markus Geimer Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance: an old problem

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

Prof. Thomas Sterling

Prof. Thomas Sterling High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Evaluating Performance Via Profiling

Evaluating Performance Via Profiling Performance Engineering of Software Systems September 21, 2010 Massachusetts Institute of Technology 6.172 Professors Saman Amarasinghe and Charles E. Leiserson Handout 6 Profiling Project 2-1 Evaluating

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

General Purpose Timing Library (GPTL)

General Purpose Timing Library (GPTL) General Purpose Timing Library (GPTL) A tool for characterizing parallel and serial application performance Jim Rosinski Outline Existing tools Motivation API and usage examples PAPI interface auto-profiling

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir VI-HPS Team Score-P: Specialized Measurements and Analyses Mastering build systems Hooking up the

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide

More information

Distributed Memory Programming with Message-Passing

Distributed Memory Programming with Message-Passing Distributed Memory Programming with Message-Passing Pacheco s book Chapter 3 T. Yang, CS240A Part of slides from the text book and B. Gropp Outline An overview of MPI programming Six MPI functions and

More information

Debugging on Blue Waters

Debugging on Blue Waters Debugging on Blue Waters Debugging tools and techniques for Blue Waters are described here with example sessions, output, and pointers to small test codes. For tutorial purposes, this material will work

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

MPI Programming Techniques

MPI Programming Techniques MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any

More information

Performance Analysis with Periscope

Performance Analysis with Periscope Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance

More information

Chapter 4. Message-passing Model

Chapter 4. Message-passing Model Chapter 4 Message-Passing Programming Message-passing Model 2 1 Characteristics of Processes Number is specified at start-up time Remains constant throughout the execution of program All execute same program

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2018 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation Blue Gene/P ASIC Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

Profiling & Optimization

Profiling & Optimization Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes

More information

Profiling with TAU. Le Yan. User Services LSU 2/15/2012

Profiling with TAU. Le Yan. User Services LSU 2/15/2012 Profiling with TAU Le Yan User Services HPC @ LSU Feb 13-16, 2012 1 Three Steps of Code Development Debugging Make sure the code runs and yields correct results Profiling Analyze the code to identify performance

More information

Distributed and Parallel Technology

Distributed and Parallel Technology Distributed and Parallel Technology Parallel Performance Tuning Hans-Wolfgang Loidl http://www.macs.hw.ac.uk/~hwloidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 No

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Improving Applica/on Performance Using the TAU Performance System

Improving Applica/on Performance Using the TAU Performance System Improving Applica/on Performance Using the TAU Performance System Sameer Shende, John C. Linford {sameer, jlinford}@paratools.com ParaTools, Inc and University of Oregon. April 4-5, 2013, CG1, NCAR, UCAR

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Using CrayPAT and Apprentice2: A Stepby-step

Using CrayPAT and Apprentice2: A Stepby-step Using CrayPAT and Apprentice2: A Stepby-step guide Cray Inc. (2014) Abstract This tutorial introduces Cray XC30 users to the Cray Performance Analysis Tool and its Graphical User Interface, Apprentice2.

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

No Time to Read This Book?

No Time to Read This Book? Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain

More information

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

CSE 160 Lecture 18. Message Passing

CSE 160 Lecture 18. Message Passing CSE 160 Lecture 18 Message Passing Question 4c % Serial Loop: for i = 1:n/3-1 x(2*i) = x(3*i); % Restructured for Parallelism (CORRECT) for i = 1:3:n/3-1 y(2*i) = y(3*i); for i = 2:3:n/3-1 y(2*i) = y(3*i);

More information

Scalability Improvements in the TAU Performance System for Extreme Scale

Scalability Improvements in the TAU Performance System for Extreme Scale Scalability Improvements in the TAU Performance System for Extreme Scale Sameer Shende Director, Performance Research Laboratory, University of Oregon TGCC, CEA / DAM Île de France Bruyères- le- Châtel,

More information

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Practical Malware Analysis

Practical Malware Analysis Practical Malware Analysis Ch 4: A Crash Course in x86 Disassembly Revised 1-16-7 Basic Techniques Basic static analysis Looks at malware from the outside Basic dynamic analysis Only shows you how the

More information

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Lecture 7: Distributed memory

Lecture 7: Distributed memory Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing

More information

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center Content IBM PSSC Montpellier Customer Center MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler Control System Service Node (SN) An IBM system-p 64-bit system Control

More information

Számítogépes modellezés labor (MSc)

Számítogépes modellezés labor (MSc) Számítogépes modellezés labor (MSc) Running Simulations on Supercomputers Gábor Rácz Physics of Complex Systems Department Eötvös Loránd University, Budapest September 19, 2018, Budapest, Hungary Outline

More information

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/ Introduction to MPI Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Distributed Memory Programming (MPI) Message Passing Model Initializing and terminating programs Point to point

More information

The Message Passing Model

The Message Passing Model Introduction to MPI The Message Passing Model Applications that do not share a global address space need a Message Passing Framework. An application passes messages among processes in order to perform

More information

Profiling and Workflow

Profiling and Workflow Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance

More information