High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

Size: px
Start display at page:

Download "High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA"

Transcription

1 High-Productivity CUDA Programming Levi Barnes, Developer Technology Engineer, NVIDIA

2 MORE RESOURCES

3 How to learn more GTC -- March 2014 San Jose, CA gputechconf.com Video archives, too Qwiklabs nvlabs.qwiklabs.com Hands on labs/tutorials More soon MOOCs udacity.com coursera.com

4 HIGH-PRODUCTIVITY PROGRAMMING

5 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort)

6 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes

7 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes Find Mistakes Sooner

8 High-Productivity Programming What does this mean? What s the goal? Do Less Work (Expend Less Effort) Make Fewer Mistakes Find Mistakes Sooner Discover/Publish More

9 High-Productivity Programming What kinds of tools exist to meet those goals? Languages/compilers Libraries Integrated development environments (IDEs) Profiling, correctness-checking, and debugging tools

10 HIGH-PRODUCTIVITY CUDA: Programming Languages, Compilers

11 GPU Programming Languages Numerical analytics Fortran C C++ Python.NET MATLAB, Mathematica, LabVIEW OpenACC, CUDA Fortran OpenACC, CUDA C CUDA C++, Thrust, Hemi, ArrayFire NumbaPro, PyCUDA, Copperhead CUDAfy.NET, Alea.cuBase, NMath developer.nvidia.com/language-solutions

12 HIGH-PRODUCTIVITY CUDA: GPU-Accelerated Libraries

13 GPU Accelerated Libraries Drop-in Acceleration for your Applications NVIDIA cublas NVIDIA cusparse NVIDIA NPP NVIDIA cufft Matrix Algebra on GPU and Multicore GPU Accelerated Linear Algebra Vector Signal Image Processing NVIDIA curand IMSL Library CenterSpace NMath Building-block Algorithms C++ Templated Parallel Algorithms

14 HIGH-PRODUCTIVITY CUDA: Integrated Development Environments

15 NVIDIA Nsight, Eclipse Edition for Linux and MacOS CUDA-Aware Editor Automated CPU to GPU code refactoring Semantic highlighting of CUDA code Integrated code samples & docs Nsight Debugger Simultaneously debug CPU and GPU Inspect variables across CUDA threads Use breakpoints & single-step debugging Nsight Profiler Quickly identifies performance issues Integrated expert system Source line correlation developer.nvidia.com/nsight

16 NVIDIA Nsight Visual Studio Edition, CUDA Debugger Debug CUDA kernels directly on GPU hardware Examine thousands of threads executing in parallel Use on-target conditional breakpoints to locate errors CUDA Memory Checker Enables precise error detection System Trace Review CUDA activities across CPU and GPU Perform deep kernel analysis to detect factors limiting maximum performance CUDA Profiler Advanced experiments to measure memory utilization, instruction throughput and stalls

17 HIGH-PRODUCTIVITY CUDA: Profiling and Debugging Tools

18 Debugging Solutions Command Line to Cluster-Wide NVIDIA Nsight Eclipse & Visual Studio Editions NVIDIA CUDA-GDB for Linux & Mac NVIDIA CUDA-MEMCHECK for Linux & Mac Allinea DDT with CUDA Distributed Debugging Tool TotalView for CUDA for Linux Clusters developer.nvidia.com/debugging-solutions

19 Performance Analysis Tools Single Node to Hybrid Cluster Solutions NVIDIA Nsight Eclipse & Visual Studio Editions NVIDIA Visual Profiler Vampir Trace Collector TAU Performance System PAPI CUDA Component Under Development developer.nvidia.com/performance-analysis-tools

20 Want to know more? Visit DeveloperZone developer.nvidia.com/cuda-tools-ecosystem

21 High-Productivity CUDA Programming Know the tools at our disposal Use them wisely Develop systematically

22 SYSTEMATIC CUDA DEVELOPMENT

23 APOD: A Systematic Path to Performance Assess

24 Assess HOTSPOTS Profile the code, find the hotspot(s) Focus your attention where it will give the most benefit

25 Applications Libraries OpenACC Directives Programming Languages

26 Profile-driven optimization Tools: nsight NVIDIA Nsight IDE nvvp NVIDIA Visual Profiler nvprof Command-line profiling cuobjdump Decompiler

27 Productize Check API return values Run cuda-memcheck tools Library distribution Cluster management Early gains Subsequent changes are evolutionary

28 Assess SYSTEMATIC CUDA DEVELOPMENT APOD Case Study

29 Assess 1 APOD CASE STUDY Round 1: Assess

30 Assess Assess 1 Profile the code, find the hotspot(s) Focus your attention where it will give the most benefit

31 Assess Assess 1 We ve found a hotspot to work on! What percent of our total time does this represent? How much can we improve it? What is the speed of light?

32 Assess Assess 1 ~36%? ~93% What percent of total time does our hotspot represent?

33 Assess Assess 1 We ve found a hotspot to work on! It s 93% of GPU compute time, but only 36% of total time What s going on during that other ~60% of time? Maybe it s one-time startup overhead not worth looking at Or maybe it s significant we should check

34 Assess: Asynchronicity Assess 1 This is the kind of case we would be concerned about Found the top kernel, but the GPU is mostly idle that is our bottleneck Need to overlap CPU/GPU computation and PCIe transfers

35 Parts of your system CPU Network Card GPU

36 Parts of your system CPU Network Card GPU Copy engine (2 directions)

37 Parts of your system CPU Network Card GPU Copy engine (2 directions) Compute engine

38 Parts of your system CPU Network Card GPU Copy engine (2 directions) Compute engine Integer ops Floating point ops Memory read/write

39 Asynchronicity = Overlap = Parallelism DMA DMA Heterogeneous system: overlap work and data movement Kepler/CUDA 5: Hyper-Q and CPU Callbacks make this fairly easy

40 Assess 1 APOD CASE STUDY Round 1:

41 : Achieve Asynchronicity Assess 1 What we want to see is maximum overlap of all engines How to achieve it? Use the CUDA APIs for asynchronous copies, stream callbacks Or use CUDA Proxy and multiple tasks/node to approximate this

42 Further or Move On? Assess 1 Even after we fixed overlap, we still have some pipeline bubbles CPU time per iteration is the limiting factor here So our next step should be to parallelize more

43 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel

44 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap

45 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap CPU is the new bottleneck Lesson : Big gains even w/o optimizing the kernel

46 Further or Move On? Assess 1 Here s what we know so far: Found the top kernel But, the GPU was mostly idle Used asynchronous API to overlap CPU is the new bottleneck Lesson : Big gains even w/o optimizing the kernel Skip ahead to

47 Assess 1 APOD CASE STUDY Round 1:

48 Assess 1 Let s reap the benefits of this sooner rather than later!

49 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can

50 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can CUDA is no excuse for bad coding practices

51 Assess 1 Let s reap the benefits of this sooner rather than later! Functionality remains the same as before We re just keeping as many units busy at once as we can CUDA is no excuse for bad coding practices Add error checking, range bounds and deploy

52 Assess 2 APOD CASE STUDY Round 2: Assess

53 Assess Assess 2 Our first round already gave us a glimpse at what s next: CPU compute time is now dominating No matter how well we tune our kernels or reduce our PCIe traffic at this point, it won t reduce total time even a little

54 Assess Assess 2 We need to tune our CPU code somehow Tune neglected code Poor asynchronicity use idle CPU cores, if any Tune MPI communication Offload more work to the GPU If so, which approach will work best?

55 Assess Assess 2 We need to tune our CPU code somehow Tune neglected code Poor asynchronicity use idle CPU cores, if any Tune MPI communication Offload more work to the GPU If so, which approach will work best? Notice that these basically say parallelize

56 Assess: Offloading Work to GPU Assess 2 Applications Libraries OpenACC Directives Programming Languages Pick the best tool for the job

57 Assess 2 APOD CASE STUDY Round 2:

58 : e.g., with OpenACC Assess 2 CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience OpenACC Compiler Hint Compiler s code Works on many-core GPUs & multicore CPUs Your original Fortran or C code

59 : e.g., with Thrust Assess 2 Similar to C++ STL High-level interface Flexible Enhances developer productivity Enables performance portability between GPUs and multicore CPUs Backends for CUDA, OpenMP, TBB Extensible and customizable Integrates with existing software Open source // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); thrust.github.com or developer.nvidia.com/thrust

60 : e.g., with CUDA C Assess 2 Standard C Code Parallel C Code void saxpy_serial(int n, float a, float *x, float *y) { } for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; // Perform SAXPY on 1M elements saxpy_serial(4096*256, 2.0, x, y); global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Perform SAXPY on 1M elements saxpy_parallel<<<4096,256>>>(n,2.0,x,y); developer.nvidia.com/cuda-toolkit

61 Assess 2 APOD CASE STUDY Round 2:

62 Optimizing OpenACC Assess 2!$acc data copy(u, v) do t = 1, 1000!$acc kernels u(:,:) = u(:,:) + dt * v(:,:) Usually this means adding some extra directives to give the compiler more information do y=2, ny-1 do x=2,nx-1 v(x,y) = v(x,y) + dt * c * end do end do!$acc end kernels!$acc update host(u(1:nx/4,1:2)) call BoundaryCondition(u)!$acc update device(u(1:nx/4, 1:2) This is an entire session: S Optimizing OpenACC Codes S Hands-on Lab: OpenACC Optimization end do!$acc end data

63 Assess 2 APOD CASE STUDY Round 2:

64 Assess 2 We ve removed (or reduced) a bottleneck Our app is now faster while remaining fully functional * Let s take advantage of that! *Don t forget to check correctness at every step

65 Assess 3 APOD CASE STUDY Round 3: Assess

66 Assess Assess 3 ~93% Finally got rid of those other bottlenecks Time to go dig in to that top kernel

67 Assess Assess 3 ~93% What percent of our total time does this represent? How much can we improve it? What is the speed of light? How much will this improve our overall performance?

68 Assess Assess 3 Let s investigate Strong scaling and Amdahl s Law Weak scaling and Gustafson s Law Expected perf limiters: Bandwidth? Computation? Latency?

69 Assess: Understanding Scaling Assess 3 Strong Scaling Fixed problem size Seek a faster solution with more processors Perfect scaling: run time α 1/N Amdahl s Law: S = 1 1 P + P N 1 (1 P)

70 Assess: Understanding Scaling Assess 3 Weak Scaling Adding more processor and more work Perfect scaling: run time stays constant Gustafson s Law: S = N + (1 P)(1 N)

71 Assess: Applying Strong and Weak Scaling Assess 3 Which scaling is important? Determine possible speedup

72 Assess: Applying Strong Scaling Assess 3 Recall that in this case we are wanting to optimize an existing kernel with a pre-determined workload That s strong scaling, so Amdahl s Law will determine the maximum speedup ~93%

73 Assess: Applying Strong Scaling Assess 3 Now that we ve removed the other bottlenecks, our kernel is ~93% of total time Speedup S = 1 1 P + P S P (S P = speedup in parallel part) In the limit when S P is huge, S will approach In practice, it will be less than that depending on the S P achieved ~93%

74 Assess: Speed of Light Assess 3 What s the limiting factor? Memory bandwidth? Compute throughput? Latency? For our example kernel, SpMV, we think it should be bandwidth We re getting only ~38% of peak bandwidth. If we could get this to 65% of peak, that would mean 1.7 for this kernel, 1.6 overall S = ~93%

75 Assess: Limiting Factor Assess 3 What s the limiting factor? Memory bandwidth Compute throughput Latency

76 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs

77 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s

78 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s Underperformance indicates you re latency bound

79 Assess: Limiting Factor Assess 3 Compute bytes per instruction, compare to device specs Compare to actual achieved GB/s vs. theory and achieved Ginstr/s Underperformance indicates you re latency bound Profiler can help

80 Know your GPU K20 = GK110 Compute capability: 3.5 Threads / Warp: 32 Max Warps / Multi Processor : 64 Max Threads / MP :?? Max Threadblocks / MP : bit registers/ MP : Max registers / thread : 255 Max threads / threadblock : 1024 Shared memory / MP : 48k

81 Specs K20X Memory bandwidth: 250 GB/s Instruction throughput: 3.95 Tflops SP / 1.31TFlops DP

82 SpMV Sparse Matrix-Vector multiply # of bytes read? # of floating point operations (flops)? Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

83 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

84 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : K20 crossover? 0.25 TB/4Tflops = bytes/flop

85 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : 10 bytes/flop K20 crossover? 0.25 TB/4Tflops = bytes/flop

86 SpMV Sparse Matrix-Vector multiply # of bytes read? 2 int + 1 float + 1 int + 1 float = 5*4 = 20 bytes per pair # of floating point operations (flops)? 1 multiply + 1 add per pair = 2 flops Ratio : 10 bytes/flop K20 crossover? 0.25 TB/4Tflops = bytes/flop

87 Assess: Limiting Factor Assess 3 Profiler indicates our bandwidth is 38% of peak Below indicates problems

88 Assess 3 APOD CASE STUDY Round 3:

89 Assess 3 Our optimization efforts should all be profiler-guided Eliminate memory replays Boost occupancy Handle more than 1 item/thread (increases ILP) Reduce block synchronizes Reduce warp divergence

90 HIGH-PRODUCTIVITY CUDA: Wrap-up

91 High-Productivity CUDA Programming Recap: Know the tools at our disposal Use them wisely Develop systematically with APOD Assess

92 Online Resources devtalk.nvidia.com developer.nvidia.com docs.nvidia.com

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Languages, Libraries and Development Tools for GPU Computing

Languages, Libraries and Development Tools for GPU Computing Languages, Libraries and Development Tools for GPU Computing CPU GPU GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

GPU Computing Ecosystem

GPU Computing Ecosystem GPU Computing Ecosystem CUDA 5 Enterprise level GPU Development GPU Development Paths Libraries, Directives, Languages GPU Tools Tools, libraries and plug-ins for GPU codes Tesla K10 Kepler! Tesla K20

More information

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31

More information

Ian Buck, GM GPU Computing Software

Ian Buck, GM GPU Computing Software Ian Buck, GM GPU Computing Software History... GPGPU in 2004 GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages Approaches to GPU Computing Libraries, OpenACC Directives, and Languages Add GPUs: Accelerate Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana

More information

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems

More information

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator CUDA Update: Present & Future Mark Ebersole, NVIDIA CUDA Educator Recent CUDA News Kepler K20 & K20X Kepler GPU Architecture: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 64 DP CUDA Cores per

More information

Introduction to GPU Computing

Introduction to GPU Computing Introduction to GPU Computing Add GPUs: Accelerate Science Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C CSE 599 I Accelerated Computing Programming GPUS Intro to CUDA C GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

APPROACHES. TO GPU COMPUTING Libraries, OpenACC Directives, and Languages

APPROACHES. TO GPU COMPUTING Libraries, OpenACC Directives, and Languages APPROACHES TO GPU COMPUTING Libraries, OpenACC Directives, and Languages Add GPUs: Accelerate Science Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois,

More information

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0 Sanjiv Satoor CUDA TOOLS 2 NVIDIA NSIGHT Homogeneous application development for CPU+GPU compute platforms CUDA-Aware Editor CUDA Debugger CPU+GPU CUDA Profiler

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Profiling GPU Code. Jeremy Appleyard, February 2016

Profiling GPU Code. Jeremy Appleyard, February 2016 Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini Data parallelism: GPU computing CUDA: libraries Why use libraries? Libraries are one of the most efficient ways to program GPUs, because they encapsulate the

More information

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective

More information

NVIDIA TECHNOLOGY Mark Ebersole

NVIDIA TECHNOLOGY Mark Ebersole NVIDIA TECHNOLOGY Mark Ebersole ACM Learning Center http://learning.acm.org 1,350+ trusted technical books and videos by leading publishers including O Reilly, Morgan Kaufmann, others Online courses with

More information

S CUDA on Xavier

S CUDA on Xavier S8868 - CUDA on Xavier Anshuman Bhat CUDA Product Manager Saikat Dasadhikari CUDA Engineering 29 th March 2018 1 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

GPU Computing with NVIDIA s new Kepler Architecture

GPU Computing with NVIDIA s new Kepler Architecture GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC) Modern GPU Programming With CUDA and Thrust Gilles Civario (ICHEC) gilles.civario@ichec.ie Let me introduce myself ID: Gilles Civario Gender: Male Age: 40 Affiliation: ICHEC Specialty: HPC Mission: CUDA

More information

CUDA Toolkit 5.0 Performance Report. January 2013

CUDA Toolkit 5.0 Performance Report. January 2013 CUDA Toolkit 5.0 Performance Report January 2013 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Future Directions for CUDA Presented by Robert Strzodka

Future Directions for CUDA Presented by Robert Strzodka Future Directions for CUDA Presented by Robert Strzodka Authored by Mark Harris NVIDIA Corporation Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

CUDA Tools for Debugging and Profiling

CUDA Tools for Debugging and Profiling CUDA Tools for Debugging and Profiling CUDA Course, Jülich Supercomputing Centre Andreas Herten, Forschungszentrum Jülich, 3 August 2016 Overview What you will learn in this session Use cuda-memcheck to

More information

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA) Mitglied der Helmholtz-Gemeinschaft CUDA Tools for Debugging and Profiling Jiri Kraus (NVIDIA) GPU Programming with CUDA@Jülich Supercomputing Centre Jülich 25-27 April 2016 What you will learn How to

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Using JURECA's GPU Nodes

Using JURECA's GPU Nodes Mitglied der Helmholtz-Gemeinschaft Using JURECA's GPU Nodes Willi Homberg Supercomputing Centre Jülich (JSC) Introduction to the usage and programming of supercomputer resources in Jülich 23-24 May 2016

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

AXEL KOEHLER GPU Computing Update

AXEL KOEHLER GPU Computing Update AXEL KOEHLER GPU Computing Update Agenda Introduction GPU Computing Introduction into GPU Programming Kepler GPU Architecture GPU Applications Future Developments 2 NVIDIA: Parallel Computing Company GPUs:

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Accelerate your Application with Kepler. Peter Messmer

Accelerate your Application with Kepler. Peter Messmer Accelerate your Application with Kepler Peter Messmer Goals How to analyze/optimize an existing application for GPUs What to consider when designing new GPU applications How to optimize performance with

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

An Introduction to OpenACC - Part 1

An Introduction to OpenACC - Part 1 An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

INTRODUCTION TO GPU PROGRAMMING. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

INTRODUCTION TO GPU PROGRAMMING. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA INTRODUCTION TO GPU PROGRAMMING Carlo Nardone Sr. Solution Architect, NVIDIA EMEA AGENDA 1 Intro 2 Libraries 3 Directives 4 Languages 5 An example: 6 ways to SAXPY 6 Software Roadmap 2 PARALLEL PROGRAMMING

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

CUDA C BEST PRACTICES GUIDE

CUDA C BEST PRACTICES GUIDE CUDA C BEST PRACTICES GUIDE DG-05603-001_v7.0 March 2015 Design Guide TABLE OF CONTENTS Preface... vii What Is This Document?... vii Who Should Read This Guide?...vii Assess, Parallelize, Optimize, Deploy...viii

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

A Sampling of CUDA Libraries Michael Garland

A Sampling of CUDA Libraries Michael Garland A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction

More information

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Optimizing OpenACC Codes. Peter Messmer, NVIDIA Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

GPU applications within HEP

GPU applications within HEP GPU applications within HEP Liang Sun Wuhan University 2018-07-20 The 5th International Summer School on TeV Experimental Physics (istep), Wuhan Outline Basic concepts GPU, CUDA, Thrust GooFit introduction

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information