PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS

Size: px

Start display at page:

Download "PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS"

Anna Rosamond Hampton
5 years ago
Views:

1 April 4-7, 2016 Silicon Valley PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS Avinash Baliga, NVIDIA Developer Tools Software Architect April 5, 3:00 p.m. Room 211B

2 NVIDIA PerfWorks SDK New API for collecting performance metrics from NVIDIA GPUs. Cross-API: CUDA, OpenGL, OpenGL ES, D3D11, and D3D12 Cross-Platform: Windows, Linux, Mobile GPUs: Kepler, Maxwell, Pascal Tegra, GeForce, Tesla, Quadro Target Audience: tools developers, engine developers Successor to the NVIDIA Perfkit SDK (NVPMAPI) Adds range-based profiling Supports next-gen APIs featuring multi-threaded GPU work submission 2

3 GPU Counters and Metrics PerfWorks delivers actionable, high-level metrics, allowing you to recognize top performance limiters quickly and directly. Raw Counters : elapsed_cycles, time_duration Metric : average_clock_rate = elapsed_cycles / time_duration Metric Categories Cumulative Work : compute warps launched, shaded pixels Timing : elapsed cycles, duration in nanoseconds Activity : active, stalled, idle cycles Throughput : rate of operations, memory transactions, instruction issue, etc. 3

4 Speed of Light Metrics SOL = Speed of Light = peak throughput of a given piece of hardware max instructions per cycle, max bytes per cycle, etc. SOL% = achieved throughput, as % of the peak; how close are you to perfection? Unit SOL% takes the max across sub-unit SOL%s. SM, partition, sub-partition, ALU Example: the SM SOL% is the max of Instruction Issue utilization ALU utilization Shared memory utilization Texture/L1 utilization Image of Maxwell SM sub-partition from NVIDIA GeForce GTX 750 Ti Whitepaper 4

5 Compute Metrics L1 Device SM Tex L2 Shared System % utilization % utilization % utilization Instruction Issue-Efficiency Instruction Pipeline Statistics Stall Reasons Cache Hit/Miss Utilization Efficiency Cache Hit/Miss Utilization by Op Type Utilization by Client Utilization 5

6 Compute Metrics: Compute-Bound L1 Device SM Tex L2 Shared System High instruction issue utilization High pipeline utilization Medium-low utilization on all other units 6

7 Compute Metrics: Memory-Bound L1 Device SM Tex L2 Shared System Medium-low utilization in the SM. One of the memory units has reached close to its maximum throughput. 7

8 Compute Metrics: Latency-Bound L1 Device SM Tex L2 Stalls Shared System High number of pipeline stalls. Medium-low utilization on everything. Same amount of data transferred from both L1 and L2. Or same amount from both L2 and memory. 8

9 Graphics Metrics Vertex Shader Hull Shader Tess Domain Shader Geom Shader Raster IA (Vertex Fetch) Front End (decoder) SM (unified shaders) L1 Tex XFB Pixel Shader CROP ZROP L2 CPU System Device Image 9

10 Range Based Profiling Previous tools profile one kernel or draw-call at a time: With PerfWorks, you can profile them as a range, allowing for inherent parallelism: Optimizing these 2 cases is very different! Improving individual duration may increase resource usage per kernel, which can prevent parallelism or harm parallel execution time. Ranges can include diverse workloads, and setup cost. 10

11 Multi-Pass Profiling The hardware has a limited number of physical counters. To collect more than the physical limit, PerfWorks requires the application to deterministically replay the GPU work multiple times. During each replay pass: the application must make the same GPU calls, with the same range delimiters a different set of counters is collected BeginPass Range A Range B EndPass BeginPass Range A Range B EndPass ctr0 ctr0 ctr1 ctr1 4/11/

12 { CUDA Example kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); culaunchkernel(...); } cudadevicesynchronize(); 12

13 { CUDA Example NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); culaunchkernel(...); NVPA_CUDA_PopRange(); Range A Range B } cudadevicesynchronize(); 13

14 CUDA Example do { cuctxgetcurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); culaunchkernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudadevicesynchronize(); } while (! IsDataReady(ctx) ); Replay Pass 14

15 CUDA Example do { cuctxgetcurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); culaunchkernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudadevicesynchronize(); } while (! IsDataReady(ctx) ); Range IDs A 2 B 1 gpu dispatch_count Range A Range B Replay Pass 15

16 OpenGL Example do { glcontext = wglgetcurrentcontext(); NVPA_Context_BeginPass(glContext); NVPA_OpenGL_PushRange('A'); gldrawelements(...); gldrawelements(...); NVPA_OpenGL_PopRange(); NVPA_OpenGL_PushRange('B'); gldrawelements(...); NVPA_OpenGL_PopRange(); NVPA_Context_EndPass(glContext); SwapBuffers(...); } while (! IsDataReady(ctx) ); Range IDs A 2 B 1 gpu draw_count Range A Range B Replay Pass 16

17 D3D12 Example ID3D12GraphicsCommandList* pcmd =...; NVPA_Object_PushRange(pCmd, 'A'); pcmd->drawinstanced(...); pcmd->drawinstanced(...); NVPA_Object_PopRange(pCmd); NVPA_Object_PushRange(pCmd, 'B'); pcmd->drawinstanced(...); NVPA_Object_PopRange(pCmd); Prebake draw calls into a CommandList. Range A Range B ID3D12CommandQueue* pqueue =...; NVPA_Context_BeginPass(pQueue); NVPA_Object_PushRange(pQueue, 'F'); pqueue->executecommandlists(1, &pcmd); NVPA_Object_PopRange(pQueue); NVPA_Context_EndPass(pQueue); pswapchain->present(0, 0); Submit rendering work. Range F Replay Pass 17

18 This example produces nested ranges. D3D12 Metric Data The CommandList ranges {A, B} are nested under the Queue range F. Range IDs gpu draw_count gpu time_duration F usec F.A usec F.B usec Deterministic counters like draw count or shaded pixels will sum perfectly. Activity and throughput are NOT summable, due to parallel execution. 18

NVIDIA Nsight Range Profiler The new Range Profiler in the Nsight VSE Graphics Debugger allows you to define ranges by performance markers, render targets, shader programs, etc.

19 NVIDIA Nsight Range Profiler The new Range Profiler in the Nsight VSE Graphics Debugger allows you to define ranges by performance markers, render targets, shader programs, etc. This lets you see an overview of performance first, before drilling down into details. Every requested metric is re-collected per range. Image from NVIDIA Nsight VSE, showing perf markers from Unreal Engine 4 demo 19

20 Future: NVIDIA Developer Tools NVIDIA Developer Tools are moving to PerfWorks. Nsight Visual Studio Edition : new Graphics Range Profiler, Analysis CUDA Profiler CUDA Profiler Suite : CUDA Visual Profiler, nvprof Consistent metrics across tools and APIs. Bringing CUDA profiler features to OpenGL and D3D tools. 20

21 Future: NVIDIA Developer Tools 21

22 Future: PerfWorks SDK Source-level counters for compute and graphics shaders. GPU shader PC sampling, as in the Visual Profiler. Lower overhead, realtime counters usable for perf stats in a HUD. Frequency-based sampling of GPU counters. GPU workload trace events that produce an execution timeline. 4/11/

23 April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join SEND QUESTIONS TO

24 BACKUP SLIDES... 24

25 D3D11 Sample ID3D11DeviceContext* pcontext =...; NVPA_Context_BeginPass(pContext); NVPA_Object_PushRange(pContext, 'A'); pcontext->drawelements(...); pcontext->drawelements(...); NVPA_Object_PopRange(pContext); NVPA_Object_PushRange(pContext, 'B'); pcontext->drawelements(...); NVPA_Object_PopRange(pContext); NVPA_Context_EndPass(pContext); pswapchain->present(0, 0); Range A Range B Replay Pass 25

April 4-7, 2016 Silicon Valley

April 4-7, 2016 Silicon Valley TEGRA PLATFORMS GAMING DRONES ROBOTICS IVA AUTOMOTIVE 2 Compile Debug Profile Trace C/C++ NVTX NVIDIA Tools extension Getting Started CodeWorks JetPack Installers IDE Integration