Overview of Performance Prediction Tools for Better Development and Tuning Support

Similar documents
CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Profiling of Data-Parallel Processors

High Performance Computing on GPUs using NVIDIA CUDA

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA Architecture & Programming Model

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Tools for Debugging and Profiling

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

CUDA Programming Model

A Framework for Modeling GPUs Power Consumption

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Design Space Explorations for Streaming Accelerators using Streaming Architectural Simulator

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Predictive Runtime Code Scheduling for Heterogeneous Architectures

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Faster Simulations of the National Airspace System

Improving Performance of Machine Learning Workloads

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

gpucc: An Open-Source GPGPU Compiler

n N c CIni.o ewsrg.au

CS 179: GPU Computing

gpucc: An Open-Source GPGPU Compiler

TUNING CUDA APPLICATIONS FOR MAXWELL

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Generic Polyphase Filterbanks with CUDA

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Practical Introduction to CUDA and GPU

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Hybrid Implementation of 3D Kirchhoff Migration

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

Paper Summary Problem Uses Problems Predictions/Trends. General Purpose GPU. Aurojit Panda

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU Computing with NVIDIA s new Kepler Architecture

Mapping Multi- Agent Systems Based on FIPA Specification to GPU Architectures

Hands-on CUDA Optimization. CUDA Workshop

Tuning CUDA Applications for Fermi. Version 1.2

Profiling & Tuning Applications. CUDA Course István Reguly

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CS377P Programming for Performance GPU Programming - II

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

GPU Implementation of a Multiobjective Search Algorithm

MAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

Enabling the Next Generation of Computational Graphics with NVIDIA Nsight Visual Studio Edition. Jeff Kiel Director, Graphics Developer Tools

TUNING CUDA APPLICATIONS FOR MAXWELL

Understanding and modeling the synchronization cost in the GPU architecture

Profiling GPU Code. Jeremy Appleyard, February 2016

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Recent Advances in Heterogeneous Computing using Charm++

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Introduction to GPU programming with CUDA

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

CS 179 Lecture 4. GPU Compute Architecture

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Compile-time GPU memory access optimizations

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

PASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

Architecture of Request Distributor for GPU Clusters

ECE 8823: GPU Architectures. Objectives

Algorithms for Auto- tuning OpenACC Accelerated Kernels

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

The Era of Heterogeneous Compute: Challenges and Opportunities

Mattan Erez. The University of Texas at Austin

NVIDIA Fermi Architecture

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

OpenCL: History & Future. November 20, 2017

Implementation and Experimental Evaluation of a CUDA Core under Single Event Effects. Werner Nedel, Fernanda Kastensmidt, José.

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

GPUCC An Open-Source GPGPU Compiler A Preview

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Cross-Architecture Performance Prediction and Analysis Using Machine Learning. Newsha Ardalani

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Spring 2009 Prof. Hyesoon Kim

Advancements in V-Ray RT GPU. Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team Lead Alexander Soklev, RT GPU R&D

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Transcription:

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor GTC 2016, San Jose, CA, USA, April 7 th, 2016

What you will learn from this talk...

Outline Motivation Performance models Applications Challenges

Performance Optimization Cycle* 1. Profile Application 5. Change and Test Code 2. Identify Performance Limiters 4. Reflect 3. Analyze Profile & Find Indicators * Adapted from S5173 CUDA Optimization with NVIDIA NSIGHT ECLIPSE Edition GTC 2015

Performance Analysis Tools NVIDIA Visual Profiler The NVIDIA CUDA Profiling Tools Interface The PAPI CUDA Component

Performance tools are still evolving CUDA 7.5 Instruction-level profiling NVIDIA Visual Profiler

Performance tools are still evolving But it's still not enough... Concurrent Kernel Execution Power Streaming

Outline Motivation Performance models Applications Challenges

Performance models Performance model

Performance models Pseudocode Source code PTX CUBIN Performance model Target Device Information Input

Performance models Pseudocode Source code PTX CUBIN Target Device Information Performance model Power consumption estimation Execution time prediction on a target device Performance bottlenecks identification Input Output

Types of performance models Analytical Models Simulation Statistical Models Advantages & Disadvantages

Analytical models The MWP-CWP model [Hong & Kim 2009] MWP: Memory warp parallelism CWP: Computation warp parallelism

Statistical models * GPGPU performance and power estimation using machine learning. - Wu, Gene, et al.

Simulation PTX Emulation PTX Kernel GPU Execution LLVM Translation GPU Ocelot

Outline Motivation Performance models Applications Challenges

Applications of performance models Successfully used to schedule concurrent kernels make auto-tuning Today! estimate power consumption identify performance bottlenecks make workload balancing

Auto-tuning Optimization goals Parameters Large search space

Concurrent Kernel Execution Supported since Fermi Limitations: Registers, Shared Memory, Occupancy * Image from http://www.turkpaylasim.com/cevahir

Outline Motivation Performance models Applications Challenges

Challenges Multiple-gpu systems, heterogeneous systems Each microarchitecture has its own features More complex execution behavior is harder to model accurately

References and Further reading Hong, Sunpyo, and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009. Kim, Hyesoon, et al. "Performance analysis and tuning for general purpose graphics processing units (GPGPU)." Synthesis Lectures on Computer Architecture 7.2 (2012): 1-96. Lopez-Novoa, Unai, Alexander Mendiburu, and José Miguel-Alonso. "A survey of performance modeling and simulation techniques for accelerator-based computing." Parallel and Distributed Systems, IEEE Transactions on 26.1 (2015): 272-281. Zhong, Jianlong, and Bingsheng He. "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling." Parallel and Distributed Systems, IEEE Transactions on 25.6 (2014): 1522-1532.

Acknowledgements

Thank you! #GTC16 Contact: rquintanillac@ic.uff.br esteban@ic.uff.br http://medialab.ic.uff.br

Questions & Answers

Backup Slides

Simplified compilation flow.cu CUDA Compiler.cpu Host code Host Compiler nvcc cudafe.gpu Device code cicc.ptx Virtual Instruction Set ptxas CUDA Front End.cubin CUDA Binary File High level optimizer and PTX generator PTX Optimizing Assembler.fatbinary CUDA Executable

Concurrent Kernel Execution Leftover policy Timeline K1 16 blocks... K1 16 blocks K1 4 blocks K2 12 blocks K2 16 blocks... Kernel slicing Timeline K1 6 blocks K2 10 blocks K1 6 blocks K1 6 blocks K1 6 blocks K2 10 blocks K2 10 blocks K2 10 blocks... * Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS - Jiao, Qing, et al.