Overview of Performance Prediction Tools for Better Development and Tuning Support

Size: px

Start display at page:

Download "Overview of Performance Prediction Tools for Better Development and Tuning Support"

Reginald Caldwell
5 years ago
Views:

1 Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor GTC 2016, San Jose, CA, USA, April 7 th, 2016

2 What you will learn from this talk...

3 Outline Motivation Performance models Applications Challenges

4 Performance Optimization Cycle* 1. Profile Application 5. Change and Test Code 2. Identify Performance Limiters 4. Reflect 3. Analyze Profile & Find Indicators * Adapted from S5173 CUDA Optimization with NVIDIA NSIGHT ECLIPSE Edition GTC 2015

5 Performance Analysis Tools NVIDIA Visual Profiler The NVIDIA CUDA Profiling Tools Interface The PAPI CUDA Component

6 Performance tools are still evolving CUDA 7.5 Instruction-level profiling NVIDIA Visual Profiler

7 Performance tools are still evolving But it's still not enough... Concurrent Kernel Execution Power Streaming

8 Outline Motivation Performance models Applications Challenges

9 Performance models Performance model

10 Performance models Pseudocode Source code PTX CUBIN Performance model Target Device Information Input

11 Performance models Pseudocode Source code PTX CUBIN Target Device Information Performance model Power consumption estimation Execution time prediction on a target device Performance bottlenecks identification Input Output

12 Types of performance models Analytical Models Simulation Statistical Models Advantages & Disadvantages

13 Analytical models The MWP-CWP model [Hong & Kim 2009] MWP: Memory warp parallelism CWP: Computation warp parallelism

14 Statistical models * GPGPU performance and power estimation using machine learning. - Wu, Gene, et al.

15 Simulation PTX Emulation PTX Kernel GPU Execution LLVM Translation GPU Ocelot

16 Outline Motivation Performance models Applications Challenges

17 Applications of performance models Successfully used to schedule concurrent kernels make auto-tuning Today! estimate power consumption identify performance bottlenecks make workload balancing

18 Auto-tuning Optimization goals Parameters Large search space

19 Concurrent Kernel Execution Supported since Fermi Limitations: Registers, Shared Memory, Occupancy * Image from

20 Outline Motivation Performance models Applications Challenges

21 Challenges Multiple-gpu systems, heterogeneous systems Each microarchitecture has its own features More complex execution behavior is harder to model accurately

22 References and Further reading Hong, Sunpyo, and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, Kim, Hyesoon, et al. "Performance analysis and tuning for general purpose graphics processing units (GPGPU)." Synthesis Lectures on Computer Architecture 7.2 (2012): Lopez-Novoa, Unai, Alexander Mendiburu, and José Miguel-Alonso. "A survey of performance modeling and simulation techniques for accelerator-based computing." Parallel and Distributed Systems, IEEE Transactions on 26.1 (2015): Zhong, Jianlong, and Bingsheng He. "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling." Parallel and Distributed Systems, IEEE Transactions on 25.6 (2014):

23 Acknowledgements

24 Thank you! #GTC16 Contact:

25 Questions & Answers

26 Backup Slides

27 Simplified compilation flow.cu CUDA Compiler.cpu Host code Host Compiler nvcc cudafe.gpu Device code cicc.ptx Virtual Instruction Set ptxas CUDA Front End.cubin CUDA Binary File High level optimizer and PTX generator PTX Optimizing Assembler.fatbinary CUDA Executable

28 Concurrent Kernel Execution Leftover policy Timeline K1 16 blocks... K1 16 blocks K1 4 blocks K2 12 blocks K2 16 blocks... Kernel slicing Timeline K1 6 blocks K2 10 blocks K1 6 blocks K1 6 blocks K1 6 blocks K2 10 blocks K2 10 blocks K2 10 blocks... * Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS - Jiao, Qing, et al.

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By