HPCToolkit: Performance Tools for Parallel Scientific Codes

Size: px

Start display at page:

Download "HPCToolkit: Performance Tools for Parallel Scientific Codes"

Cynthia Stafford
6 years ago
Views:

1 HPCToolkit: Performance Tools for Parallel Scientific Codes John Mellor-Crummey Department of Computer Science Rice University Building Community Codes for Effective Scientific Research on HPC Platforms September 7,

2 Challenges for Computational Scientists Execution environments and applications are rapidly evolving architecture rapidly changing multicore microprocessor designs increasing scale of parallel systems growing use of accelerators applications transition from MPI everywhere to threaded implementations add additional scientific capabilities maintain multiple variants or configurations Computational scientists need to assess weaknesses in algorithms and their implementations improve scalability of executions within and across nodes adapt to changes in emerging architectures Performance tools can play an important role as a guide 2

3 Performance Analysis Challenges Complex architectures are hard to use efficiently multi-level parallelism: multi-core, ILP, SIMD instructions multi-level memory hierarchy result: gap between typical and peak performance is huge Complex applications present challenges measurement and analysis understanding behaviors and tuning performance Supercomputer platforms compound the complexity unique hardware unique microkernel-based operating systems multifaceted performance concerns computation data movement communication I/O 3

4 What Users Want Multi-platform, programming model independent tools Accurate measurement of complex parallel codes large, multi-lingual programs fully optimized code: loop optimization, templates, inlining binary-only libraries, sometimes partially stripped complex execution environments dynamic loading, static linking SPMD parallel codes with threaded node programs batch jobs Effective performance analysis insightful analysis that pinpoints and explains problems correlate measurements with code for actionable results support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers Scalable to petascale and beyond 4

5 We Build It * HPCToolkit - 160K lines, 797 files measurement, data analysis: 110K lines C/C++, scripts; 424 files hpcviewer, hpctraceviewer GUIs: 54K lines Java; 373 files HPCToolkit externals - 2.5M lines C/C++, 5782 files components developed execution control: libmonitor - 7K lines, 35 files binary analysis: Open Analysis - 76K lines, 343 files (+ ANL, Colorado) components extensively modified binary analysis: GNU binutils M total, 1650 files; (448K bfd) other components stack unwinding: libunwind XML: libxml2, xerces understanding binaries: libelf, libdwarf, symtabapi * With support from the US government DOE Office of Science: DE-FC02-07ER25800, DE-FC02-06ER25762 LANL: G, , , AFRL: FA C

6 Contributors Current staff: Michael Fagan, Mark Krentel, Laksono Adhianto students: Xu Liu, Milind Chabbi, Karthik Murthy external: Nathan Tallent (PNNL) Alumni students: Gabriel Marin (ORNL), Nathan Froyd (Mozilla) staff: Rob Fowler (UNC) interns: Sinchan Banerjee (MIT), Michael Franco (Rice), Reed Landrum (Stanford), Bowden Kelly (Georgia Tech), Philip Taffet (St. John s High School) 6

7 HPCToolkit Approach Employ binary-level measurement and analysis observe fully optimized, dynamically linked executions support multi-lingual codes with external binary-only libraries Use sampling-based measurement (avoid instrumentation) controllable overhead minimize systematic error and avoid blind spots enable data collection for large-scale parallelism Collect and correlate multiple derived performance metrics diagnosis typically requires more than one species of metric Associate metrics with both static and dynamic context loop nests, procedures, inlined code, calling context Support top-down performance analysis natural approach that minimizes burden on developers 7

8 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 8

9 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 9

10 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure For dynamically-linked executables on stock Linux compile and link as you usually do For statically-linked executables (e.g. for Blue Gene, Cray) add monitoring by using hpclink as prefix to your link line presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 10

11 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Measure execution unobtrusively launch optimized application binaries dynamically-linked applications: launch with hpcrun e.g., mpirun -np 8192 hpcrun -t -e flash3... statically-linked applications: control with environment variables collect statistical call path profiles of events of interest presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 11

12 Call Path Profiling Measure and attribute costs in context sample timer or hardware counter overflows gather calling context using stack unwinding Call path sample return address Calling context tree return address return address instruction pointer Overhead proportional to sampling frequency......not call frequency 12

13 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Analyze binary with hpcstruct: recover program structure analyze machine code, line map, debugging information extract loop nesting & identify inlined procedures map transformed loops and procedures to source presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 13

14 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Combine multiple profiles multiple threads; multiple processes; multiple executions Correlate metrics to static & dynamic program structure presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 14

15 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis call path profile program structure [hpcstruct] Presentation explore performance data from multiple perspectives rank order by metrics to focus on what s important compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth graph thread-level metrics for contexts explore evolution of behavior over time presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 15

16 Analyzing with hpcviewer source pane metric display view control costs for inlined procedures loops function calls in full context navigation pane metric pane 16

17 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 17

18 The Problem of Scaling ? Efficiency Ideal efficiency Actual efficiency CPUs Note: higher is better 18

19 Wanted: Scalability Analysis Isolate scalability bottlenecks Guide user to problems Quantify the magnitude of each problem 19

20 Challenges for Pinpointing Scalability Bottlenecks Parallel applications modern software uses layers of libraries performance is often context dependent Example climate code skeleton main land sea ice ocean atmosphere wait wait wait wait Monitoring bottleneck nature: computation, data movement, synchronization? 2 pragmatic constraints acceptable data volume low perturbation for use in production runs 20

21 Performance Analysis with Expectations You have performance expectations for your parallel code strong scaling: linear speedup weak scaling: constant execution time Put your expectations to work measure performance under different conditions e.g. different levels of parallelism or different inputs express your expectations as an equation compute the deviation from expectations for each calling context for both inclusive and exclusive costs correlate the metrics with the source code explore the annotated call tree interactively 21

22 Pinpointing and Quantifying Scalability Bottlenecks P 600K Q 400K = P Q coefficients for analysis of strong scaling 200K 22

Scalability Analysis Demo: FLASH3 Code: University of

256 processors Scaling type: weak Parallel,

AMR; a block is the unit of computation Designed for

solve a broad range of (astro)physical problems

systems Scales and performs well Fully modular and

different applications Nova outbursts on white dwarfs

Rayleigh-Taylor Cellular detonation Helium burning on

23 Scalability Analysis Demo: FLASH3 Code: University of Chicago FLASH3 Simulation: Platform: Experiment: 8192 vs. 256 processors Scaling type: weak Parallel, adaptive-mesh refinement (AMR) code Block structured AMR; a block is the unit of computation Designed for white compressible dwarf reactive detonation flows Can solve a broad range of (astro)physical problems Portable: runs Blue on many Gene/P massively-parallel systems Scales and performs well Fully modular and extensible: components can be combined to create many different applications Nova outbursts on white dwarfs Laser-driven shock instabilities Magnetic Rayleigh-Taylor Cellular detonation Helium burning on neutron stars Figures courtesy of FLASH Team, University of Chicago Orzag/Tang MHD vortex Rayleigh-Taylor instability 23

24 Improved Flash Scaling of AMR Setup Graph courtesy of Anshu Dubey, U Chicago 24

25 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 25

26 Understanding Temporal Behavior Profiling compresses out the temporal dimension temporal patterns, e.g. serialization, are invisible in profiles What can we do? Trace call path samples sketch: N times per second, take a call path sample of each thread organize the samples for each thread along a time line view how the execution evolves left to right what do we view? assign each procedure a color; view a depth slice of an execution Processes Time Call stack 26

27 hpctraceviewer: detail of Load imbalance among threads appears as different lengths of colored bands along the x axis 27

28 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 28

MPBS @ 960 cores, radix sort Two views of

29 960 cores, radix sort Two views of load imbalance since not on a 2 k cores 29

30 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 30

31 Blame Shifting Problem: in many circumstances sampling measures symptoms of performance losses rather than causes worker threads waiting for work threads waiting for a lock MPI process waiting for peers in a collective communication idle GPU waiting for work Approach: shift blame for losses from victims to perpetrators blame code executing while other threads are idle blame code executed by lock holder when thread(s) are waiting blame processes that arrive late to collectives shift blame between CPU and GPU for hybrid code 31

32 Performance Expectations for Hybrid Code with Blame Shifting 32

33 Performance Expectations for Hybrid Code with Blame Shifting 33

34 Performance Expectations for Hybrid Code with Blame Shifting 34

35 Performance Expectations for Hybrid Code with Blame Shifting 35

GPU Successes with HPCToolkit LAMMPS: identified hardware problem with Keeneland system improperly seated GPUs were observed to have lower

36 GPU Successes with HPCToolkit LAMMPS: identified hardware problem with Keeneland system improperly seated GPUs were observed to have lower data copy bandwidth LULESH: identified the dynamic memory allocation using cudamalloc and cudafree accounted for 90% of the idleness of the GPU 36

37 Data Centric Analysis Goal: associate memory hierarchy performance losses with data Approach intercept allocations to associate with their data ranges measure latency with instruction-based sampling (AMD Opteron) present quantitative results using hpcviewer 37

38 Data Centric Analysis of S3D yspecies latency for this loop is 14.5% of total latency in program 41.2% of memory hierarchy latency related to yspecies array 38

39 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 39

40 Summary Sampling provides low overhead measurement Call path profiling + binary analysis + blame shifting = insight scalability bottlenecks where insufficient parallelism lurks sources of lock contention load imbalance temporal dynamics bottlenecks in hybrid code problematic data structures Other capabilities attribute memory leaks back to their full calling context 40

41 Current Technical Issues Keeping up with emerging platforms... Blue Waters (Cray XK6) AMD Interlagos introduces new vector instructions issue: binary analysis tools need update (even gdb is ignorant at present!) NVIDIA Tesla K20 GPUs issue: monitoring and assessing impact of asynchronous activities OpenACC programming model issue: creates threads that synchronize before main Stampede (Intel MIC) new instruction set for MIC new programming models MIC only programming MIC as MPI rank (ranks with heterogeneous capabilities) offload model massive threading Keeneland (NVIDIA Tesla K20 GPU) multi-socket, multi-gpu nodes 41

42 A Spectrum of Tool Challenges Intellectual research - develop new techniques for measurement, analysis, and presentation challenges of emerging systems increasing scale of systems (e.g. Sequoia) heterogeneity (e.g., host + accelerator (MIC or GPU); AMD Fusion) exploding growth of threading: MIC supports 200+ threads blame shifting: identify causes rather than symptoms analyzing asynchronous activities measure and analyze all facets of application performance CPU, accelerator, intra- and inter-node data move & sync, I/O, interaction with HW, interaction with other jobs, interaction with system software provide higher level insight, diagnosis, and guidance Community leadership and engagement OS support for tools past successes: BG/P CNK and Cray CNL kernels; BG/Q spec; Linux kernel; PMU device drivers today: interfaces for observing communication, I/O network issues, data access latency standards committees: today - OpenMP tools API; future - OpenCL tools API? vendor engagement: NVIDIA, Intel, IBM outreach with workshops Software development: new instructions for new CPUs; new programming models for heterogeneous maintenance deployment & user support 42

43 A Community of End Users Government laboratories DOE Office of Science Labs Argonne National Laboratory Lawrence Berkeley National Laboratory Oak Ridge National Laboratory National Renewable Energy Laboratory DOE NNSA Labs Sandia National Laboratory Los Alamos National Laboratory Lawrence Livermore National Laboratory DOD Engineering Research Development Center TACC 25 PerfExpert Sites Numerous universities University of Utah University of Southern California Ohio State UNC Chapel Hill Mississippi State Georgia Tech... Companies Bull Cray Computer SAS Institute Western Geco AMD Numerical Algorithms Group Shell Foreign centers Juelich Supercomputing Centre Britain s Atomic Weapons Establishment Australian National University Laboratoire d'aérologie, O.M.P. INRIA Norwegian Metacenter for High Performance Computing 43

44 A Community: One Group s Tools are another group s infrastructure PerfExpert (University of Texas and Texas State) uses HPCToolkit s measurement and analysis infrastructure to collect hardware performance counter data, analyze application binaries, and attribute performance to routines and loop nests bullx (Bull) HPCToolkit performance tools are part of the standard bullx software suite deployed on Bull s supercomputers custom extensions to hpcviewer to support cluster analysis MIAMI (ORNL) uses HPCToolkit s hpcviewer user interface for interactive analysis of performance modeling results OpenSpeedshop (Krell) uses HPCToolkit s libmonitor for Scott Pakin (LANL) uses HPCToolkit s hpcviewer user interface for interactive analysis of performance measurement data collected with custom instrumentation 44

45 Software Engineering? Ideally: design abstractions and implement them Harsh reality: performance tools interact with the system at the lowest level, which introduces complexity and design compromises absence of proper interfaces and mechanisms in layers we interact with blocking system calls that are invisible to timer-based profiling, system calls that can t be restarted when interrupted stripped code (e.g., runtimes for OpenMP, CUDA) wrong information from compilers libraries (e.g., libc) that lack proper interfaces for wrapping helper threads launched during initialization of dynamic libraries lack of standard calling conventions in low-level code (e.g., dynamic library loading)... experimentally determine what s possible and how things work e.g., how can I get notification when CUDA kernels complete without serializing multiple threads that share a GPU only a stripped library from NVIDIA; no behavioral documentation grow a piece of infrastructure from a prototype With a larger team, we d re-implement some components from scratch with the benefit of hindsight Many packages we depend upon leave much to be desired but there is insufficient funding to rewrite them 45

46 Open Issues Incentives are lacking for community performance tools rational model: build reusable building tool blocks reality funding for research, not development components help our competitors more than us Ongoing funding for maintenance, deployment, user support? Continuity? 46

47 HPCToolkit Capabilities at a Glance Attribute Costs to Code Analyze Behavior over Time Pinpoint & Quantify Scaling Bottlenecks Shift Blame from Symptoms to Causes Assess Imbalance and Variability Associate Costs with Data hpctoolkit.org

HPCToolkit: Sampling-based Performance Tools for Leadership Computing

HPCToolkit: Sampling-based Performance Tools for Leadership Computing John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu http://hpctoolkit.org CScADS 2009 Workshop on