S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017

Similar documents
S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

OpenACC Course. Office Hour #2 Q&A

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

MULTI-GPU PROGRAMMING MODELS. Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer

PROFILER USER'S GUIDE. DU _v9.1 December 2017

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

CUDA Tools for Debugging and Profiling

Profiling GPU Code. Jeremy Appleyard, February 2016

Operational Robustness of Accelerator Aware MPI

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

MULTI-GPU PROGRAMMING WITH MPI. Jiri Kraus, Senior Devtech Compute, NVIDIA

Profiling & Tuning Applications. CUDA Course István Reguly

n N c CIni.o ewsrg.au

Quantum ESPRESSO on GPU accelerated systems

Heidi Poxon Cray Inc.

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Multi GPU Programming with MPI and OpenACC

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

PROFILER. DU _v5.0 October User's Guide

MULTI GPU PROGRAMMING WITH MPI

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Accelerator programming with OpenACC

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

PROFILER USER'S GUIDE. DU _v7.5 September 2015

Building NVLink for Developers

OpenACC Support in Score-P and Vampir

Allinea Unified Environment

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Accelerate HPC Development with Allinea Performance Tools

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Profiling of Data-Parallel Processors

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Parallel Programming Libraries and implementations

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Fundamental CUDA Optimization. NVIDIA Corporation

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Toward Automated Application Profiling on Cray Systems

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 12, 2015

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 3, 2016

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Mapping MPI+X Applications to Multi-GPU Architectures

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Parallel Programming. Libraries and Implementations

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPUs as better MPI Citizens

Mellanox GPUDirect RDMA User Manual

Using JURECA's GPU Nodes

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Profiling: Understand Your Application

Hands-on CUDA Optimization. CUDA Workshop

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

GPU Fundamentals Jeff Larkin November 14, 2016

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

GPU Computing with NVIDIA s new Kepler Architecture

CUDA. Matthew Joyner, Jeremy Williams

Understanding Dynamic Parallelism

Multi-GPU Programming

Jackson Marusarz Intel Corporation

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Performance Analysis of Parallel Scientific Applications In Eclipse

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

STUDYING OPENMP WITH VAMPIR & SCORE-P

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC Accelerator Directives. May 3, 2013

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Transcription:

S7445 - WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE Mathias Wagner, Jakob Progsch GTC 2017

BEFORE YOU START The five steps to enlightenment 1. Know your application What does it compute? How is it parallelized? What final performance is expected? 2. Know your hardware What are the target machines and how many? Machine-specific optimizations okay? 3. Know your tools Strengths and weaknesses of each tool? Learn how to use them. 4. Know your process Performance optimization is a constant learning process 5. Make it so! 2

PERFORMANCE OPTIMIZATION What exactly is the performance bottleneck? You might have a feeling where your application spends most of it s time but a more analytic approach might be better 3

WHOLE APPLICATION PERFORMANCE Need to consider all components and their interaction You might spend hours on optimizing GPU kernels but at the end your application is still not really faster Introduce imbalance in your system Amdahl s law applies Kernel performance in companion talk: S7444 (next) Image: https://www.flickr.com/photos/jurvetson/480227362 4

ASSESS PERFORMANCE Tools required Various tools available depending on your requirements Different levels of sophistication (amount of information scales with effort ) Simple wallclock time Timers build in your code Simple CPU profilers (gprof) GPU Timelines and profiles and details: CUDA profiling tools nvprof, NVIDIA Visual Profiler (NVVP), NVIDIA Nsight Vistual Studio Edition MPI, OpenMP, CPU Details (3 rd party tools) 5

3 RD PARTY PROFILING TOOLS (without even trying to be complete) TAU PERFORMANCE SYSTEM VAMPIRTRACE PAPI CUDA COMPONENT HPC TOOLKIT 6

You should get an idea about how to assess whole application performance and identify some bottlenecks. There are way to many potential bottlenecks to provide a cook book. 7

Introduction AGENDA HPGMG as sample application Timelime Data transfer Multi-GPU (MPI) 8

HPGMG 9

MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity 5/23/1 10

HPGMG High-Performance Geometric Multi-Grid Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking http://crd.lbl.gov/departments/computer-science/par/research/hpgmg/ https://bitbucket.org/nsakharnykh/hpgmg-cuda 5/23/1 11

HPGMG Multi-grid cycles SMOOTHER & RESIDUAL SMOOTHER & RESIDUAL V-CYCLE SMOOTHER SMOOTHER F-CYCLE FEW FINER GRIDS DIRECT SOLVE MANY COARSER GRIDS HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2 nd or 4 th order discretization 5/23/1 12

FIRST LOOK AT THE PERFORMANCE 13

RUN THE APPLICATION You might have a feeling where your application spends most of it s time but a more analytic approach might be better 14

PROFILE IT On the CPU side Tracing using SCORE-P GUI: CUBE 15

TIMELINE 16

THE PROFILER WINDOW NVVP as one possible tool to display timelines Timeline Summary Guide Analysis Results 17

IF REQ D: REMOTE PROFILING Application you want to profile for might not run locally various approaches to remote profiling A. Run profiler on remote machine and use remote desktop (X11 forwarding, NX, VNC, ) B. Collect data using command line profiler nvprof and view your local workstation # generate time line nvprof o myprofile.timeline./a.out # collect data needed for guided analysis nvprof o myprofile.analysis --analysis-metrics./a.out # custom selection of metrics for detailed investigations nvprof o myprofile.metrics -metrics <...>./a.out C. Use remote connection feature to create a new session 18

CREATE A REMOTE CONNECTION Start with new session 19

TIMELINE Only shows GPU activity 20

NVTX MARKUP NVIDIA Tools Extension NVVP by default only shows GPU activity on timelines Markup can be used to mark regions with CPU activity Also useful to group phases of your application for easier navigation Annotate your code Link against libnvtoolsext 21

NVTX MARKUP Code sample #include "nvtoolsext.h"... void init_host_data( int n, double * x ) { nvtxrangepusha("init_host_data"); //initialize x on host... nvtxrangepop(); } 22

NVTX MARKUP Simplify use of NVTX Use macros: PUSH_RANGE(name,cid), POP_RANGE Use C++ tracer class Exploit compiler instrumentation Details: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-customapplication-profile-timelines-nvtx/ 23

CPU TIMELINE 24

CPU TIMELINE One single solve 25

PORTING TO GPUS 26

GPU TIMELINE 27

GPU TIMELINE Shows when GPU Kernels run 28

CAN WE DO BETTER? 29

HYBRID IMPLEMENTATION Take advantage of both architectures V-CYCLE F-CYCLE THRESHOLD GPU CPU Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) 5/23/1 30

HYBRID TIMELINE 31

HYBRID TIMELINE For each level: Decide whether to run on the CPU or GPU Naïve performance estimate: GPU Time Level 0 GPU Time Level X + CPU Time Level X 32

HYBRID TIMELINE For each level: Decide whether to run on the CPU or GPU Naïve estimate: GPU Time Level 0 GPU Time Level X + CPU Time Level X 81.467 ms 3.211 ms + 0.419 ms = 78.765 ms 33

TAKEAWAYS CPU PROFLIING TIMELINE HYBRID Get an estimate of your hotspots Profile might not be detailed enough (sum, avg, max) Useful first estimate See what is going on Information on each call NVTX markup for CPU, grouping May also show voids, dependencies, Estimate speedups Strong GPU and Strong CPU Use both Troughput -> GPU Latency -> CPU 34

DATA MIGRATION 35

MEMORY MANAGEMENT Using Unified Memory No changes to data structures No explicit data movements Developer View With Unified Memory Single pointer for CPU and GPU data Use cudamallocmanaged for allocations Unified Memory 5/23/1 36

UNIFIED MEMORY Simplified GPU programming Minimal modifications to the original code: (1) malloc replaced with cudamallocmanaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold void smooth(level_type *level,...){... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++)... }} 5/23/1 37

PAGE FAULTS Segmented view 38

PAGE FAULTS Details 39

PAGE FAULTS Details 40

UNIFIED MEMORY Eliminating page migrations and faults data Level N Level N+1 Smoother Residual Restriction Smoother Residual GPU kernels CPU functions Level N+1 (large) (small) is is shared between CPU CPU and and GPU GPU Solution: allocate the first CPU level with cudamallochost (zero-copy memory) 5/23/1 41

PAGE FAULTS Almost gone 42

NEW HINTS API IN CUDA 8 Not used here cudamemprefetchasync(ptr, length, destdevice, stream) Migrate data to destdevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics cudamemadvise(ptr, length, advice, device) Specifies allocation and usage policy for memory region User can set and unset at any time 5/23/17 43

CONCURRENCY THROUGH PIPELINING Use CUDA streams to hide data transfers Serial cudamemcpyasync(h2d) Kernel<<<>>> cudamemcpyasync(d2h) time Concurrent overlap kernel and D2H copy performance improvement cudamemcpyasync(h2d) K1 DH1 K2 DH2 K3 DH3 K4 DH4 time 44

OPTIMIZE CPU AND GPU (S7444 UP NEXT) 45

HIGHER PERFORMANCE Optimized GPU Kernels + OpenMP on CPU 46

0.130 REVISITING HYBRID STRATEGY Unoptimized Optimized 0.120 0.110 0.100 0.090 Time [seconds] 0.080 0.070 0.060 131072 65536 32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 Max grid size on CPU threshold 47

TAKEAWAYS UNIFIED MEMORY CUDAMALLOC RE-ITERATE No manual data transfers necessary Avoid page faults Use prefetch and cudamemadvise Familarize with variants CudaMalloc CudaMallocHost CudaMallocManaged After optimizing kernels revisit your timeline Previous assumption might no longer apply Hybrid approaches strongly depend on used CPU and GPU Bottlenecks shift 48

DEPENDENCY ANALYSIS 49

DEPENDENCY ANALYSIS Easily find the critical parts to optimize 5% 40% CPU A wait B wait GPU Kernel X Kernel Y Timeline Optimize Here The longest running kernel is not always the most critical optimization target 50

DEPENDENCY ANALYSIS Visual Profiler Unguided Analysis Generating critical path Dependency Analysis Functions on critical path NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 51

DEPENDENCY ANALYSIS Visual profiler Launch copy_kernel MemCpy HtoD [sync] Inbound dependencies MemCpy DtoH [sync] Outbound dependencies 52

MULTI-GPU USING MPI 53

MPI Compiling and Launching $ mpicc -o myapp myapp.c $ mpirun -np 4./myapp <args> rank = 0 rank = 1 rank = 2 rank = 3 myapp myapp myapp myapp 54

PROFILING MPI APPLICATIONS Using nvprof Embed MPI rank in output filename, process name, and context name mpirun -np $np nvprof --output-profile profile.%q{ompi_comm_world_rank}.nvvp Use the import Wizard OpenMPI: OMPI_COMM_WORLD_RANK MVAPICH2: MV2_COMM_WORLD_RANK 5/23/1 55

MULTI GPU TIMELINE 56

MPI ACTIVITY IN NVVP Use NVTX MPI provides a MPI Profiling interface (PMPI) Intercept MPI calls and perform actions before and after the MPI call Python script to generate necessary wrapper is available python wrap/wrap.py -g -o nvtx_pmpi.c nvtx.w mpicc -c nvtx_pmpi.c mpicc nvtx_pmpi.o o myapplication -L$CUDA_HOME/lib64 -lnvtoolsext Details: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visualprofiler/ 57

MULTI GPU TIMELINE MPI activity 58

MPI TRANSFERS Without CUDA aware MPI 59

CUDA AWARE MPI MPI knows about the GPU Use MPI directly on GPU pointers (no manual copy to host required) Unified Memory needs explicit support from the CUDA-aware MPI implementation Check your MPI implementation for support (OpenMPI >1.8.5, MVAPICH2-GDR > 2.2b) Unified Memory and regular (non CUDA-aware) MPI Requires unmanaged staging buffer Regular MPI has no knowledge of Unified Memory 5/23/1 60

CUDA AWARE MPI TIMELINE 61

NVIDIA GPUDIRECT Peer to Peer Transfers GPU1 Memory GPU2 Memory System Memory GPU 1 GPU 2 CPU PCI-e Chip set IB 5/23/1 62

NVIDIA GPUDIRECT Peer to Peer Transfers GPU1 Memory GPU2 Memory System Memory NVLink GPU 1 GPU 2 CPU PCI-e Chip set IB 5/23/1 63

PEER TO PEER Using a pinned buffer for MPI Not supported on unified memory buffers Use pinned memory buffer for MPI Still staging through host??? 64

TOPOLOGY DGX-1 Fat GPU nodes Mutiple CPUs System memory attached to a single CPU Multiple GPUs P2P via NVLink or shared PCIe Multiple Network (IB) adapter Without direct connection: staging through host 65

TOPOLOGY GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV1 NV1 SOC SOC SOC PIX SOC PHB SOC 0-19 GPU1 NV1 X NV1 NV1 SOC NV1 SOC SOC PIX SOC PHB SOC 0-19 GPU2 NV1 NV1 X NV1 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19 GPU3 NV1 NV1 NV1 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19 GPU4 NV1 SOC SOC SOC X NV1 NV1 NV1 SOC PIX SOC PHB 20-39 GPU5 SOC NV1 SOC SOC NV1 X NV1 NV1 SOC PIX SOC PHB 20-39 GPU6 SOC SOC NV1 SOC NV1 NV1 X NV1 SOC PHB SOC PIX 20-39 GPU7 SOC SOC SOC NV1 NV1 NV1 NV1 X SOC PHB SOC PIX 20-39 mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X Legend: Query information using nvidia-smi topo -m X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks 66

PEER TO PEER With directly connected GPUs No staging through host Here via NVLink 67

TAKEAWAYS MARKUP MPI CUDA AWARE MPI TOPOLOGY Use NVTX wrapper Show MPI calls and duration More details: User 3 rd party tools Call MPI on GPU buffer Unified memory aware Can use D2D copies Avoids host staging Direct RDMA to network Multiple CPUs, GPUs, Net? Consider options to place your jobs (mpirun, numactl) Query topology of your system (hwinfo, nvidia-smi) 68

SUMMARY 69

TAKEAWAYS PROFILE OPTIMIZE LEARN Use available tools Use as many as needed Consider your whole application Focus on representative parts / smaller problems Check for overhead Look for dependencies Use streams to overlap work Be aware of unified memory page faults Use CUDA aware MPI Consider topology Build knowledge No optimization cookbook 70

LEARNING RESOURCES S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS 3 RD PARTY PROFILING L7115 - PERFORMANCE ANALYSIS [ ] SCORE-P and VAMPIR S7684 - PERFORMANCE ANALYSIS OF CUDA DEEP LEARNING NETWORKS USING TAU S7573 - DEVELOPING, DEBUGGING, AND OPTIMIZING [ ] WITH ALLINEA FORGE CUDA PROFILING S7495 - OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS S7824 - DEVELOPER TOOLS UPDATE IN CUDA 9 MPI AND UNIFIED MEMORY S7133 - MULTI-GPU PROGRAMMING WITH MPI L7114 - MULTI GPU PROGRAMMING WITH MPI AND OPENACC S7142 - MULTI-GPU PROGRAMMING MODELS S7285 - UNIFIED MEMORY ON THE LATEST GPU ARCHITECTURES CUDA DOCUMENTATION Programming Guide, Best Practices, Tuning Guides Parallel for-all Blog, GTC on-demand Stackoverflow 71