Accelerate your Application with Kepler. Peter Messmer

Size: px
Start display at page:

Download "Accelerate your Application with Kepler. Peter Messmer"

Transcription

1 Accelerate your Application with Kepler Peter Messmer

2 Goals How to analyze/optimize an existing application for GPUs What to consider when designing new GPU applications How to optimize performance with Kepler /CUDA5

3 GPU Acceleration

4 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages

5 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages

6

7

8 APOD A Systematic Path to Performance Assess Deploy Parallelize Optimize

9 Starting point: Matrix transpose for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[j][i] = in[i][j]; i j

10 Matrix transpose on CPU void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j transpose(in, out, width);

11 An initial CUDA Version global void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j cudamemcpy(in, in_host, N*N*sizeof(float)); transpose<<<1,1>>>(in, out);

12 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak

13 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak => Express Parallelism!

14 Recap: Kernel Execution Model Thread: Sequential execution unit All threads execute same sequential program Threads execute in parallel Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM) Threads within a block can cooperate Light-weight synchronization Data exchange Grid: A group of thread blocks Communication between blocks expensive Threadblocks of a grid execute on multiple SMs

15 First Parallelization: Inner Loop Process source rows independently tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; } for(int j=0; j < N; j++) out[tid*n+j] = in[j*n+tid]; j in j float in[], out[]; transpose<<<1,n>>>(in, out); tid tid tid out

16 Second Parallelization: One Block per Row tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; int bid = blockidx.x; out[tid*n+bid] = in[bid*n+tid]; bid bid in } bid bid float in[][], out[][]; transpose<<<n,n>>>(in, out); tid tid tid out

17 NVVP NVIDIA Visual Profiler - Application analysis - Kernel properties

18 Application Assessment with NVVP

19 Application Assessment with NVVP

20 Application Assessment with NVVP

21 Source-Level Hot-spot Analysis in NVVP

22 Source-Level Hot-spot Analysis in NVVP

23 What does Uncoalesced Store mean? Global memory access happens in transactions of 32 Bytes Coalesced access: Group of 32 threads ( warp ) accessing adjacent bytes Uncoalesced access: Group of 32 threads accessing scattered bytes Results in up to 32 transactions

24 Memory Access Patterns Array access: ~OK: x[i] = a[i+1] a[i] Bad: x[i] = a[64*i] a[i] SoA vs AoS: OK : point.x[i] Bad: point[i].x Random access Bad: a[rand_fun(i)]

25 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block out

26 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block => Need threads in a block to cooperate => Use shared memory out

27 Shared memory - Accessible by all threads in a block SM-0 Registers SM-N Registers - Fast compared to global mem - Low access latency - high BW - (almost like registers) SMEM SMEM Global Memory (DRAM) - Common uses: - Software managed cache - Data layout conversion

28 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

29 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

30 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);

31 What happens at the barrier? - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum Thread-Block Matrix Tile

32 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Matrix Tile

33 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Solution: Reduce number of threads per block Matrix Tile

34 Impact of Reduced Serialization

35 Impact of Reduced Serialization

36 Shared Memory Organization Organized in 32 independent banks Bank Bank Bank Bank Optimal access: disjoint banks or multicast Any 1:1 mapping or MC Multiple access to same bank: serialization C C C C Solution for transpose: pading tile[16][16] => tile[16][17] Bank Bank Bank Bank C C C C

37 Final Solution

38 Final Solution

39 APOD Cycle Summary Assessment: Algorithm highly parallel Parallelization: 1 thread per column 12 GB/s Parallelization: 1 thread per element 99 GB/s Optimization: Memory access coalescence 93 GB/s Optimization: Latency hiding 124 GB/s Optimization: Bank conflict resolution 170 GB/s => Ready for Deployment APOD applied at each P/O step

40 Additional Metrics

41 instructions Control Flow if (... ) { // then-clause } else { // else-clause }

42 instructions / time Execution within warps is coherent Warp ( vector of threads) Warp ( vector of threads)

43 instructions / time Execution diverges within a warp

44 instructions / time Execution diverges within a warp Solution: Group threads with similar control flow

45 Occupancy Need independent threads per SM to hide latencies - Memory access - Instruction Hardware resources determine maximum number of threads/threadblocks per SM Consumed resources determine actual number Occupancy = N actual / N max

46 Occupancy Limiting resources: - Number of threads - Number of registers per thread - Number of blocks - Amount of shared memory per block No need for 100% occupancy - Depends on kernel

47 Occupancy Calculator Analyze effect of resource consumption on occupancy

48 Alternatives to NVVP: nvprof %nvprof --print-gpu-trace./transpose Profiling result: Start Duration Grid Size Block Size Regs* Size Throughput Name ms us MB 4.80GB/s [CUDA memcpy HtoD] ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched./transpose Profiling result: Device Event Name, Kernel, Values 0 sm_cta_launched, transposenaive(float*,..), Command-Line Profiler - Access to hardware counters - List of supported counters: --query-events

49 Alternatives to NVVP: nvprof

50 Time Alternatives to NVVP: Instrumentation cudaeventrecord(start, 0); transpose<<<grid, threads>>>(..); start cudaeventrecord(stop,0); cudaeventsynchronize(stop); cudaeventelapsedtime(&time, start, stop); transpose stop EventSynchronize

51 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

52 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

53

54 Host App Streams Stream 0 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

55 Host App Streams Stream 0 Grid 1 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

56 Host App Streams Stream 0 Grid 1 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

57 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order

58 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order

59 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

60 Host App Streams Stream 0 Grid 3 Grid Mgmt Unit SM-0 Grid 2 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order

61 Host App Streams Stream 1 Grid 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

62 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

63 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

64 Host App Streams Stream 1 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 2 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host

65 Asynchronous Data Transfer / Pipelining Time Host GPU

66 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

67 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

68 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

69 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

70 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU

71 Asynchronous Data Transfer / Pipelining cudastream_t stream1, stream2; CPU GPU cudamemcpyasync( dst, src, size, dir, stream1 ); kernel<<<grid, block, 0, stream2>>>( ); Time CPU GPU

72 Hyper-Q Enables Efficient Scheduling Grid management unit can select most appropriate grid from 32 streams Improves scheduling of concurrently executed grids Particularly interesting for MPI applications

73 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 Multicore CPU only

74 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 Multicore CPU only

75 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 Multicore CPU only

76 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 Multicore CPU only

77 GPU Accelerated MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 N=1 Multicore CPU only GPU accelerated CPU

78 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU

79 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU

80 Example: Hyper-Q/Proxy for CP2K

81 How to use Hyper-Q - No application modifications necessary - Proxy process between user processes and GPU - nvidia-proxy-server-control d

82 Time Don t Forget Large-Scale Behavior Profile in realistic environment Get profile at scale Tau, Scalasca, VampirTrace+Vampir, Craypat,.. Compute Waste Fix messaging problems first! Nrank = 384 GPUs will accelerate your compute, amplify messaging problems Will also help CPU-only code

83 CUDA Dynamic Parallelism CPU GPU GPU as Co-Processor

84 CUDA Dynamic Parallelism CPU GPU CPU GPU GPU as Co-Processor Autonomous, Dynamic Parallelism

85 Dynamic Work Generation Initial Grid

86 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Initial Grid

87 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Dynamically assign performance where accuracy is required Initial Grid Dynamic Grid

88 CUDA Dynamic Parallelism Kernel launches grids Identical syntax as host CUDA runtime function in cudadevrt library global void childkernel() { printf("hello %d", threadidx.x); } global void parentkernel() { childkernel<<<1,10>>>(); cudadevicesynchronize(); printf("world!\n"); } int main(int argc, char *argv[]) { parentkernel<<<1,1>>>(); cudadevicesynchronize(); return 0; }

89 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Concurrent grids Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads

90

91 Before CUDA 5: Whole-Program Compilation main.cpp + a.cu b.cu c.cu program.exe #include all files together to build Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported

92 CUDA 5: Separate Compilation & Linking a.cu b.cu c.cu main.cpp + program.exe a.o b.o c.o Separate compilation allows building independent object files CUDA 5 can link multiple object files into one program

93 Benefits of Separate Compilation & Linking Easier to reuse your existing code - No need to include all files together any more - extern attribute is respected Incremental compilation reduces build time - e.g. 47,000 line single-file: 50s down to 4s Use 3 rd party GPU Callable libraries or create your own - GPU Callable BLAS Library (libcublas_device.a) included in CUDA Toolkit 5.0, uses Dynamic Parallelism

94 CUDA 5: GPU Callable Libraries main.cpp + a.cu b.cu foo.cu + ab.a a.o + b.o program.exe Can combine object files into static libraries Link and externally call device code

95 CUDA 5: GPU Callable Libraries main.cpp main2.cpp + + foo.cu bar.cu a.cu b.cu + + ab.a ab.a a.o + b.o program.exe program2.exe Combine object files into static libraries Facilitates code reuse, reduces compile time

96 CUDA 5: Callbacks main.cpp + foo.cu Enables closed-source device libraries to call user-defined device callback functions + vendor.a + callback.cu program.exe

97 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart

98 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart Link device-runtime library for dynamic parallelism nvcc arch=sm_35 dc a.cu b.cu nvcc arch=sm_35 dlink a.o b.o -lcudadevrt o link.o g++ a.o b.o link.o L<path> -lcudadevrt -lcudart Currently, link occurs at cubin level (PTX not yet supported)

99 GPUDirect enables GPU-aware MPI GPU-GPU transfer across NIC Without CPU participation Unified Virtual addresses allows to detect location of buffer pointed to

100 GPUDirect enables GPU-aware MPI cudamemcpy(s_buf_h,s_buf_d,size,cudamemcpydevicetohost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); cudamemcpy(r_buf_h,r_buf_d,size,cudamemcpyhosttodevice); Simplifies to MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); (for CPU and GPU buffers)

101 GPU Management: nvidia-smi Multi-GPU systems are widely available Different systems are set up differently Want to get quick information on - Approximate GPU utilization - Approximate memory footprint - Number of GPUs - ECC state - Driver version Thu Nov 1 09:10: NVIDIA-SMI Driver Version: GPU Name Bus-Id Disp. Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Tesla K20X 0000:03:00.0 Off Off N/A 30C P8 28W / 235W 0% 12MB / 6143MB 0% Default Tesla K20X 0000:85:00.0 Off Off N/A 28C P8 26W / 235W 0% 12MB / 6143MB 0% Default Compute processes: GPU Memory GPU PID Process name Usage ============================================================================= No running compute processes found Inspect and modify GPU state

102 Where to find additional information CUDA documentation [1] - Best Practice Guide [2] - Kepler Tuning Guide [3] Kepler whitepaper [4] [1] [2] [3] [4]

103 Where to find additional information: GTC Kepler architecture: GTC12 Session S0642: Inside Kepler Assessing performance limiters: GTC10 Session 2012: Analysis-driven Optimization (slides 5-19): Profiling tools: GTC12 sessions: S0419: Optimizing Application Performance with CUDA Performance Tools S0420: Nsight IDE for Linux and Mac... CUPTI documentation (describes all the profiler counters) Included in every CUDA toolkit (/cuda/extras/cupti/doc/cupti_users_guide.pdf GPU computing webinars in general:

104 Kepler and CUDA5: Powerful yet Easy Kepler and CUDA5 simplify GPU acceleration Bypass optimization trial/error with APOD Profile and analyze sefficiently with NVVP Improve MPI scalability with Hyper-Q/Proxy Parallelize with CUDA Dynamic Parallelism

105 Thank you!

106 Backup Slides

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CUDA Performance Optimization Mark Harris NVIDIA Corporation CUDA Performance Optimization Mark Harris NVIDIA Corporation Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimize Algorithms for

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimizing CUDA Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary NVIDIA Corporation 2009 2 Optimize Algorithms for the GPU Maximize

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Performance Optimization Process

Performance Optimization Process Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Profiling of Data-Parallel Processors

Profiling of Data-Parallel Processors Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

More CUDA. Advanced programming

More CUDA. Advanced programming More CUDA 1 Advanced programming Synchronization and atomics Warps and Occupancy Memory coalescing Shared Memory Streams and Asynchronous Execution Bank conflicts Other features 2 Synchronization and Atomics

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Tuning Performance on Kepler GPUs

Tuning Performance on Kepler GPUs Tuning Performance on Kepler GPUs An Introduction to Kepler Assembler and Its Usage in CNN optimization Cheng Wang, Zhe Jia, Kai Chen Alibaba Convolution Layer Performance on K40 Height = 16; Width = 16;

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

GPU MEMORY BOOTCAMP III

GPU MEMORY BOOTCAMP III April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

A Tutorial on CUDA Performance Optimizations

A Tutorial on CUDA Performance Optimizations A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Host-side task and memory management Recap: the GPU memory hierarchy Asynchronism

More information

CUDA Performance Optimization

CUDA Performance Optimization Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Advanced Research Computing. ARC3 and GPUs. Mark Dixon

Advanced Research Computing. ARC3 and GPUs. Mark Dixon Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

State-of-the-art in Heterogeneous Computing

State-of-the-art in Heterogeneous Computing State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1 Overview Introduction GPU Programming Strategies Trends: Heterogeneous

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show

More information