Accelerate your Application with Kepler. Peter Messmer
|
|
- Norman Johns
- 5 years ago
- Views:
Transcription
1 Accelerate your Application with Kepler Peter Messmer
2 Goals How to analyze/optimize an existing application for GPUs What to consider when designing new GPU applications How to optimize performance with Kepler /CUDA5
3 GPU Acceleration
4 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages
5 Here: Focus on Programming Languages Applications Libraries OpenACC Directives Programming Languages
6
7
8 APOD A Systematic Path to Performance Assess Deploy Parallelize Optimize
9 Starting point: Matrix transpose for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[j][i] = in[i][j]; i j
10 Matrix transpose on CPU void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j transpose(in, out, width);
11 An initial CUDA Version global void transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; i } float in[], out[]; j cudamemcpy(in, in_host, N*N*sizeof(float)); transpose<<<1,1>>>(in, out);
12 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak
13 An initial CUDA Version global transpose(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[i*n+j] = in[j*n+i]; } float in[], out[]; transpose<<<1,1>>>(in, out); + Quickly implemented - Performance weak => Express Parallelism!
14 Recap: Kernel Execution Model Thread: Sequential execution unit All threads execute same sequential program Threads execute in parallel Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM) Threads within a block can cooperate Light-weight synchronization Data exchange Grid: A group of thread blocks Communication between blocks expensive Threadblocks of a grid execute on multiple SMs
15 First Parallelization: Inner Loop Process source rows independently tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; } for(int j=0; j < N; j++) out[tid*n+j] = in[j*n+tid]; j in j float in[], out[]; transpose<<<1,n>>>(in, out); tid tid tid out
16 Second Parallelization: One Block per Row tid tid tid global transpose(float in[], float out[]) { int tid = threadidx.x; int bid = blockidx.x; out[tid*n+bid] = in[bid*n+tid]; bid bid in } bid bid float in[][], out[][]; transpose<<<n,n>>>(in, out); tid tid tid out
17 NVVP NVIDIA Visual Profiler - Application analysis - Kernel properties
18 Application Assessment with NVVP
19 Application Assessment with NVVP
20 Application Assessment with NVVP
21 Source-Level Hot-spot Analysis in NVVP
22 Source-Level Hot-spot Analysis in NVVP
23 What does Uncoalesced Store mean? Global memory access happens in transactions of 32 Bytes Coalesced access: Group of 32 threads ( warp ) accessing adjacent bytes Uncoalesced access: Group of 32 threads accessing scattered bytes Results in up to 32 transactions
24 Memory Access Patterns Array access: ~OK: x[i] = a[i+1] a[i] Bad: x[i] = a[64*i] a[i] SoA vs AoS: OK : point.x[i] Bad: point[i].x Random access Bad: a[rand_fun(i)]
25 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block out
26 How can we improve the write? Coalesced read Scattered write (stride N) Process matrix tile, not single row/column in Transpose matrix tile within block => Need threads in a block to cooperate => Use shared memory out
27 Shared memory - Accessible by all threads in a block SM-0 Registers SM-N Registers - Fast compared to global mem - Low access latency - high BW - (almost like registers) SMEM SMEM Global Memory (DRAM) - Common uses: - Software managed cache - Data layout conversion
28 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);
29 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);
30 Transpose with coalesced read/write global transpose(float in[], float out[]) { } shared float tile[tile][tile]; int glob_in = xindex + (yindex)*n; int glob_out = xindex + (yindex)*n; tile[threadidx.y][threadidx.x] = in[glob_in]; syncthreads(); out[glob_out] = tile[threadidx.x][threadidx.y]; grid(n/tile, N/TILE,1) threads(tile, TILE, 1) transpose<<<grid, threads>>>(in, out);
31 What happens at the barrier? - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum Thread-Block Matrix Tile
32 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Matrix Tile
33 Thread Serialization - Synchronization in kernel tile[y][x] = in[in_data] syncthreads() out[out_index] = tile[x][y] Thread-Block Matrix Tile - Keep threads blocked at barrier to minimum - Use more thread blocks, but: - # blocks per SM limited by # threads/block Thread-Block Solution: Reduce number of threads per block Matrix Tile
34 Impact of Reduced Serialization
35 Impact of Reduced Serialization
36 Shared Memory Organization Organized in 32 independent banks Bank Bank Bank Bank Optimal access: disjoint banks or multicast Any 1:1 mapping or MC Multiple access to same bank: serialization C C C C Solution for transpose: pading tile[16][16] => tile[16][17] Bank Bank Bank Bank C C C C
37 Final Solution
38 Final Solution
39 APOD Cycle Summary Assessment: Algorithm highly parallel Parallelization: 1 thread per column 12 GB/s Parallelization: 1 thread per element 99 GB/s Optimization: Memory access coalescence 93 GB/s Optimization: Latency hiding 124 GB/s Optimization: Bank conflict resolution 170 GB/s => Ready for Deployment APOD applied at each P/O step
40 Additional Metrics
41 instructions Control Flow if (... ) { // then-clause } else { // else-clause }
42 instructions / time Execution within warps is coherent Warp ( vector of threads) Warp ( vector of threads)
43 instructions / time Execution diverges within a warp
44 instructions / time Execution diverges within a warp Solution: Group threads with similar control flow
45 Occupancy Need independent threads per SM to hide latencies - Memory access - Instruction Hardware resources determine maximum number of threads/threadblocks per SM Consumed resources determine actual number Occupancy = N actual / N max
46 Occupancy Limiting resources: - Number of threads - Number of registers per thread - Number of blocks - Amount of shared memory per block No need for 100% occupancy - Depends on kernel
47 Occupancy Calculator Analyze effect of resource consumption on occupancy
48 Alternatives to NVVP: nvprof %nvprof --print-gpu-trace./transpose Profiling result: Start Duration Grid Size Block Size Regs* Size Throughput Name ms us MB 4.80GB/s [CUDA memcpy HtoD] ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, ms 1.67ms (1 1 1) ( ) transposenaive(float*, nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched./transpose Profiling result: Device Event Name, Kernel, Values 0 sm_cta_launched, transposenaive(float*,..), Command-Line Profiler - Access to hardware counters - List of supported counters: --query-events
49 Alternatives to NVVP: nvprof
50 Time Alternatives to NVVP: Instrumentation cudaeventrecord(start, 0); transpose<<<grid, threads>>>(..); start cudaeventrecord(stop,0); cudaeventsynchronize(stop); cudaeventelapsedtime(&time, start, stop); transpose stop EventSynchronize
51 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads
52 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads
53
54 Host App Streams Stream 0 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order
55 Host App Streams Stream 0 Grid 1 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order
56 Host App Streams Stream 0 Grid 1 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order
57 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order
58 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order
59 Host App Streams Stream 0 Grid 2 Grid 3 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order
60 Host App Streams Stream 0 Grid 3 Grid Mgmt Unit SM-0 Grid 2 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order
61 Host App Streams Stream 1 Grid 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host
62 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 5 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 1 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host
63 Host App Streams Stream 1 Grid 2 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 1 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host
64 Host App Streams Stream 1 Grid 3 Stream 2 Grid 6 Grid 7 Grid Mgmt Unit SM-0 SM-N Grid 5 Grid 2 Global Memory (DRAM) Kernel launches - within the same stream are in-order - In different streams can be concurrent All kernel launches are asynchronous to the host
65 Asynchronous Data Transfer / Pipelining Time Host GPU
66 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU
67 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU
68 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU
69 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU
70 Asynchronous Data Transfer / Pipelining CPU GPU Time CPU GPU
71 Asynchronous Data Transfer / Pipelining cudastream_t stream1, stream2; CPU GPU cudamemcpyasync( dst, src, size, dir, stream1 ); kernel<<<grid, block, 0, stream2>>>( ); Time CPU GPU
72 Hyper-Q Enables Efficient Scheduling Grid management unit can select most appropriate grid from 32 streams Improves scheduling of concurrently executed grids Particularly interesting for MPI applications
73 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 Multicore CPU only
74 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 Multicore CPU only
75 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 Multicore CPU only
76 Strong Scaling of MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 Multicore CPU only
77 GPU Accelerated MPI Application GPU parallelizable part CPU parallel part Serial part N=1 N=2 N=4 N=8 N=1 Multicore CPU only GPU accelerated CPU
78 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU
79 GPU Accelerated Strong Scaling GPU parallelizable part CPU parallel part Serial part With Hyper-Q/Proxy Available in K20 N=1 N=2 N=4 N=8 Multicore CPU only N=1 N=2 N=4 N=8 GPU accelerated CPU
80 Example: Hyper-Q/Proxy for CP2K
81 How to use Hyper-Q - No application modifications necessary - Proxy process between user processes and GPU - nvidia-proxy-server-control d
82 Time Don t Forget Large-Scale Behavior Profile in realistic environment Get profile at scale Tau, Scalasca, VampirTrace+Vampir, Craypat,.. Compute Waste Fix messaging problems first! Nrank = 384 GPUs will accelerate your compute, amplify messaging problems Will also help CPU-only code
83 CUDA Dynamic Parallelism CPU GPU GPU as Co-Processor
84 CUDA Dynamic Parallelism CPU GPU CPU GPU GPU as Co-Processor Autonomous, Dynamic Parallelism
85 Dynamic Work Generation Initial Grid
86 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Initial Grid
87 Dynamic Work Generation Fixed Grid Statically assign conservative worst-case grid Dynamically assign performance where accuracy is required Initial Grid Dynamic Grid
88 CUDA Dynamic Parallelism Kernel launches grids Identical syntax as host CUDA runtime function in cudadevrt library global void childkernel() { printf("hello %d", threadidx.x); } global void parentkernel() { childkernel<<<1,10>>>(); cudadevicesynchronize(); printf("world!\n"); } int main(int argc, char *argv[]) { parentkernel<<<1,1>>>(); cudadevicesynchronize(); return 0; }
89 Characteristics of an Ideal GPU Candidate Sufficient parallelism K20X: Up to threads in flight Rather > way than ~10-way Concurrent grids Memory access patterns Ideally close to stride 1 possible Control-flow patterns Low divergence, at least for groups of threads
90
91 Before CUDA 5: Whole-Program Compilation main.cpp + a.cu b.cu c.cu program.exe #include all files together to build Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported
92 CUDA 5: Separate Compilation & Linking a.cu b.cu c.cu main.cpp + program.exe a.o b.o c.o Separate compilation allows building independent object files CUDA 5 can link multiple object files into one program
93 Benefits of Separate Compilation & Linking Easier to reuse your existing code - No need to include all files together any more - extern attribute is respected Incremental compilation reduces build time - e.g. 47,000 line single-file: 50s down to 4s Use 3 rd party GPU Callable libraries or create your own - GPU Callable BLAS Library (libcublas_device.a) included in CUDA Toolkit 5.0, uses Dynamic Parallelism
94 CUDA 5: GPU Callable Libraries main.cpp + a.cu b.cu foo.cu + ab.a a.o + b.o program.exe Can combine object files into static libraries Link and externally call device code
95 CUDA 5: GPU Callable Libraries main.cpp main2.cpp + + foo.cu bar.cu a.cu b.cu + + ab.a ab.a a.o + b.o program.exe program2.exe Combine object files into static libraries Facilitates code reuse, reduces compile time
96 CUDA 5: Callbacks main.cpp + foo.cu Enables closed-source device libraries to call user-defined device callback functions + vendor.a + callback.cu program.exe
97 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart
98 Device Linker Invocation Introduction of an optional link step for device code nvcc arch=sm_20 dc a.cu b.cu nvcc arch=sm_20 dlink a.o b.o o link.o g++ a.o b.o link.o L<path> -lcudart Link device-runtime library for dynamic parallelism nvcc arch=sm_35 dc a.cu b.cu nvcc arch=sm_35 dlink a.o b.o -lcudadevrt o link.o g++ a.o b.o link.o L<path> -lcudadevrt -lcudart Currently, link occurs at cubin level (PTX not yet supported)
99 GPUDirect enables GPU-aware MPI GPU-GPU transfer across NIC Without CPU participation Unified Virtual addresses allows to detect location of buffer pointed to
100 GPUDirect enables GPU-aware MPI cudamemcpy(s_buf_h,s_buf_d,size,cudamemcpydevicetohost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); cudamemcpy(r_buf_h,r_buf_d,size,cudamemcpyhosttodevice); Simplifies to MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD); (for CPU and GPU buffers)
101 GPU Management: nvidia-smi Multi-GPU systems are widely available Different systems are set up differently Want to get quick information on - Approximate GPU utilization - Approximate memory footprint - Number of GPUs - ECC state - Driver version Thu Nov 1 09:10: NVIDIA-SMI Driver Version: GPU Name Bus-Id Disp. Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Tesla K20X 0000:03:00.0 Off Off N/A 30C P8 28W / 235W 0% 12MB / 6143MB 0% Default Tesla K20X 0000:85:00.0 Off Off N/A 28C P8 26W / 235W 0% 12MB / 6143MB 0% Default Compute processes: GPU Memory GPU PID Process name Usage ============================================================================= No running compute processes found Inspect and modify GPU state
102 Where to find additional information CUDA documentation [1] - Best Practice Guide [2] - Kepler Tuning Guide [3] Kepler whitepaper [4] [1] [2] [3] [4]
103 Where to find additional information: GTC Kepler architecture: GTC12 Session S0642: Inside Kepler Assessing performance limiters: GTC10 Session 2012: Analysis-driven Optimization (slides 5-19): Profiling tools: GTC12 sessions: S0419: Optimizing Application Performance with CUDA Performance Tools S0420: Nsight IDE for Linux and Mac... CUPTI documentation (describes all the profiler counters) Included in every CUDA toolkit (/cuda/extras/cupti/doc/cupti_users_guide.pdf GPU computing webinars in general:
104 Kepler and CUDA5: Powerful yet Easy Kepler and CUDA5 simplify GPU acceleration Bypass optimization trial/error with APOD Profile and analyze sefficiently with NVVP Improve MPI scalability with Hyper-Q/Proxy Parallelize with CUDA Dynamic Parallelism
105 Thank you!
106 Backup Slides
GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group
GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationKepler Overview Mark Ebersole
Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationCUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA
CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationCUDA Performance Optimization Mark Harris NVIDIA Corporation
CUDA Performance Optimization Mark Harris NVIDIA Corporation Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimize Algorithms for
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationPerformance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationNVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationOutline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary
Optimizing CUDA Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary NVIDIA Corporation 2009 2 Optimize Algorithms for the GPU Maximize
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationPerformance Optimization Process
Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationMore CUDA. Advanced programming
More CUDA 1 Advanced programming Synchronization and atomics Warps and Occupancy Memory coalescing Shared Memory Streams and Asynchronous Execution Bank conflicts Other features 2 Synchronization and Atomics
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationTuning Performance on Kepler GPUs
Tuning Performance on Kepler GPUs An Introduction to Kepler Assembler and Its Usage in CNN optimization Cheng Wang, Zhe Jia, Kai Chen Alibaba Convolution Layer Performance on K40 Height = 16; Width = 16;
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationGPU MEMORY BOOTCAMP III
April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationA Tutorial on CUDA Performance Optimizations
A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationCSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationGPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Host-side task and memory management Recap: the GPU memory hierarchy Asynchronism
More informationCUDA Performance Optimization
Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationTutorial: Parallel programming technologies on hybrid architectures HybriLIT Team
Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationAdvanced Research Computing. ARC3 and GPUs. Mark Dixon
Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM
More informationProfiling & Tuning Applica1ons. CUDA Course July István Reguly
Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationCUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory
CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationState-of-the-art in Heterogeneous Computing
State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1 Overview Introduction GPU Programming Strategies Trends: Heterogeneous
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationPerformance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them
Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show
More information