Memory Bound Wave Propagation at Hardware Limit. Igor Podladtchikov, Spectraseis Inc
|
|
- Kory Summers
- 5 years ago
- Views:
Transcription
1 Memory Bound Wave Propagation at Hardware Limit Igor Podladtchikov, Spectraseis Inc March 19, 2013
2 Microseismic Monitoring Geophysical Method to locate subsurface events: Propagate and image time-reversed data acquired at the surface Use full wave-equation Acoustic or Elastic Heterogeneous Materials Need very fast solvers Thousands of Events Big Models Time Reversed Imaging (TRI) 2
3 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 3
4 Performance Limiters Processors Computation 1000 Gflops/s Transfer 100 GB/s Memory The two principle performance limiters 4
5 Acoustic Equations 1 variable read & write 2 variables read only 5
6 Elastic Equations 9 variables read & write 3 variables read only 6
7 Arithmetic Intensity flops / bytes ratio: Compute or Memory Bound? BYTES, not numbers: Single precision: 4 bytes e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred 7
8 Machine Balance M2070 machine balance: Peak flops / bytes : 1030 / 117 ~ 9 K10 machine balance: Peak flops / bytes : 4577 / 228 ~ 20 Arithmetic Intensity: << machine balance : memory bound >> machine balance : compute bound 8
9 Arithmetic Intensity x 2 acoustic elastic 2 x flops bytes Machine Balance: Fermi: ~ 9 Kepler: ~ 20 ratio << 9 We re memory bound 9
10 Memory Bound What To Do? Option A: Option B: Give up (don t even try) Celebrate: FLOPS are for FREE Blame memory bound for slow code Optimize memory access efficiency Count bytes, not flops Try to approach memcpy throughput Our claim: real world applications can run close to memcpy! 10
11 How to optimize memory access? Aim for minimum read / writes Touch everything once (un-improvable) Don t read neighbors twice Read me once don t read me again! Try to avoid redundant read / writes 11
12 Don t count neighbor reads! Track optimization progress Don t count neighbors in your performance metric! Remember traffic is the volume of data to a particular memory. It is not the number of loads and stores Performance Tuning of Scientific Applications Don t cheat (yourself) 12
13 Ideal Memory Throughput MTP N IO Grid Size Word Size Time Elapsed N IO 2 DOF Constants DOF : Constants : Grid Size : Word Size: Degree of freedom (read and write) read only nx * ny * nz * nt 4 bytes (single precision) No Neighbors! 13
14 Ideal Memory Throughput MTP Acoustic 4 Grid Size 4 bytes GB 30 Time Elapsed 2 s MTP Elastic 21 Grid Size 4 bytes GB 30 Time Elapsed 2 s N_IO Acoustic : 2 * = 4 N_IO Elastic : 2 * = 21 14
15 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 15
16 Results All solvers include free surface absorbing layer domain decomposition along all 3 axes IPC if GPUs map-able, MPI otherwise All solvers verified against single CPU code All data from NVIDIA PSG Cluster Thank You! without further ado.. 16
17 MTP GB/s Memory Throughput on M M cube size Memcpy Pressure Density Elastic Real physics at 85% and 52% of hardware limit 17
18 Neighbors Don t Count! Acoustic pressure update: po[center] = 2*pcc - po[center]*abs + vp2[center] * ( ); pc[left ] + pc[right ] + pc[left2 ] + pc[right2] + pcm + pcp - 6*pcc Neighbor Reads If we count neighbor reads as IO operations: 6 additional 10 IO operations 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070 Theoretical hardware limit is 150 GB/s DON T COUNT NEIGHBOR READS 18
19 MTP GB/s Memory Throughput on M K Memcpy Pressure Density Elastic cube size Strong scaling on both K10 GPU s same size and power consumption as M2070! 19
20 Other GPUs Memcpy K10 K20X K20 M2090 M2070 GK Pressure K10 K20X K20 M2090 M2070 GK Density K10 K20X K M M2070 GK Elastic K10 K20X K20 M2090 M2070 GK104 The green cards win 20
21 per GPU MTP, % of single per GPU MTP, % of single Multi-GPU Weak Scaling on GK Density cube size Elastic 2 nodes (IPC) 4 nodes (MPI PCIe3) 8 nodes (IB FDR) cube size PCIe 2: 6 GB/s PCIe 3: 12 GB/s 21
22 Results Summary Defined Ideal, Un-improvable Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const No neighbors or temporary variables Came close to memcpy with real world applications acoustic: 85 % elastic: 52 % performance proportional to memcpy on various architectures Solvers scale on multiple GPUs 22
23 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 23
24 General Considerations Respect the number x 8 Thread-blocks Fast axis sizes multiples of 32 (can be padded) Hit global memory segments and L1 cache lines (32 x 4B = 128B) Rely on cache Shared memory requires extra operations Shared memory needs synchthreads() Registers are faster than shared memory If working set fits in cache, cache is faster 24
25 First Try Acoustic Pressure #define EXIT_BND(xx,yy,nx,ny) \ int xx = blockidx.x*blockdim.x + threadidx.x; if(xx < 1 xx >= nx - 1) return; \ int yy = blockidx.y*blockdim.y + threadidx.y; if(yy < 1 yy >= ny - 1) return; #define CENTER i1 + i2*n1 + i3*n1*n2 #define RIGHT i1+1 + i2*n1 + i3*n1*n2 #define LEFT i1-1 + i2*n1 + i3*n1*n2 #define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2 #define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2 #define TOP i1 + i2*n1 + (i3+1)*n1*n2 #define BOT i1 + i2*n1 + (i3-1)*n1*n2 global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = 1; i3 < n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Yes, that s it 25
26 MTP GB/s First Try Acoustic Pressure Yay! Boo Boo Hoo Hoo cube size pretty good, but not good enough 26
27 First Try Suspect TLB misses: Translation Lookaside Buffers Accelerating translation from virtual to physical memory Act like caches on the page table If the kernel s working set exceeds TLB capacity (or associativity) then one generates TLB capacity (or conflict) misses. Performance Tuning of Scientific Applications 27
28 Batched Execution If the kernel s working set is too big, we ll reduce it: global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3, const int offset){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = offset+1; i3 < offset+n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Launch kernel batches for slowest axis 28
29 MTP GB/s MTP GB/s Batched Execution Batched Execution First Try vs. Batched no batch batch 32 no batch cube size cube size Done 29
30 MTP GB/s The Density Problem Density equation has Vp inside difference, which means twice the amount of neighbors to fetch: Pressure vs. Density cube size Pressure Density Naïve 30
31 MTP GB/s What Problem? Add Variable! Just replace vp*current inside derivative by variable! At every timestep: launch ucvp = uc*vp kernel launch solver, take ucvp derivative Why is it slower?? we introduced additional read+write the additional read+write don t count! same problem, same result, same performance metric formula! Pressure vs. Density cube size Pressure Add Var Density Naïve THAT problem 31
32 The Density Trick The code looks a little repetitive, we re multiplying by vp a whole lot of times: unew = 2*ucc - uo[center]*abs + uc[right] *v2[right] + uc[left] *v2[left] + uc[right2]*v2[right2]+ uc[left2]*v2[left2] + ucp *v2p + ucm *v2m - 6*ucc*v2c; What if we do this: unew = (2*ucc - uo[center]*abs) / vp[center] // divide by vp for time-step! + uc[right] + uc[left] + uc[right2]+ uc[left2] + ucp + ucm - 6*ucc; uo[center] = unew*abs*vp[center]; // store wave-fields pre-multiplied with vp! Same memory usage, same N_IO, but less neighbor reads! 32
33 MTP GB/s The Density Trick Memory access pattern the same as pressure same performance as pressure! Pressure vs. Density cube size Pressure Add Var Density Naïve Pre-Mul And BOOM goes the dynamite 33
34 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 34
35 General Considerations Very similar situation to acoustic solver Less neighbors because 1st derivative, but more variables Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32 Use staggered grid Average materials on the fly All variables the same size All coalesced, same stride for everyone 35
36 Staggered Grid Elementary Cell y Syz, out of screen Sxy Every grid point contains 12 elements: 3 particle velocity components Vx, Vy, Vz 3 normal stress components Sxx, Syy, Szz Vy 3 shear stress components Sxy, Sxz, Syz 3 material properties ρ, λ, μ Vx Sxz, out of screen Vz, out of screen x Sxx, Syy, Szz, ρ, λ, μ 36
37 Staggered Grid Everyone surrounded by correct spatial difference neighbors Materials over Velocity and Shear need to be averaged 37
38 Staggered Grid Vx Vy Sxy Sxz over Vx Sxy over Vy Sxx, Syy, Szz, ρ, λ, μ Area updated Ghost Stress, ignored Ghost Stress, updated Boundary Velocity, from neighbor or boundary condition 38
39 Separate Stress and Velocity Update Velocity Kernel Stress Kernel 39
40 Separate Stress and Velocity Update needs to be at time-step t needs to be at time-step t+1/2 handled by thread-block, possibly on different SM thread-block scheduling unknown read redundancy 40
41 Separate Stress and Shear Update Velocity Kernel Stress Kernel Shear Kernel 41
42 Separate Stress and Shear Update for i 0 n-1 for i 1 n Divergence experimentally established to be slightly worse than read redundancy 42
43 MTP GB/s Individual Kernel Performance Normal stress has no material averaging Velocity needs to average density from 2 values, for 3 different positions Shear stress needs to average Lame coefficient from 4 values, for 3 different positions Elastic Kernels cube size Stress Shear Velocity Elastic In sequence they suffer read redundancy 43
44 Read Redundancy Individual kernels are close to limit, but introduce read redundancy: shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10) normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN), write 3 S (11, 4 redundant) velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R, write 3 V (13, 9 redundant) Total 34, 13 redundant We could totally cheat and say we re doing 34 IO, and therefore our peak performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed! It s important to know whether an algorithm has room for improvement or not This one definitely has! 44
45 Implementation Summary Respected the number 32 Memory segments, warps and L1 cache lines Relied on cache Only works if working unit small enough So, reduce your working units Give Hardware maximum possibility to parallelize No syncthreads() Minimum divergence 45
46 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 46
47 Summary Ideal Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const GFlops misleading in memory bound situation Counting neighbors is a crime Real world applications can approach memcpy throughput acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10) elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10) Physics at Memcpy Throughput: Physics for free! 47
48 Every Algorithm s Dream For fixed problem size and hardware capabilities 3D FFT: 40 GB/s, 180 Gflops/s Read Compute Write Acoustic: 100 GB/s, 70 Gflops/s Read C Write Memcpy: 117 GB/s, 0 Gflops/s Read Write which is faster? 48
49 References Performance Tuning of Scientific Applications David H. Bailey and Robert F. Lucas GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010 Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012 Questions? 49
50
51 Performance Peak Analysis App Throughput (TP) Hardware (HW) TP HW TP Limit GB/s Application Speed GB/s Hardware s Transfer Throughput GB/s Practical Throughput Limit (memcpy) Profile how many bytes transferred. Practical instead of theoretical throughput limit. App / Limit % 100 % is ideal App / HW % HW / Limit % Less than 100 % means not all bytes transferred are used Less than 100 % means memory bus underutilized Less than 100% is only critical if it s substantially less. 51
52 Performance Peak Analysis: Density App Throughput (TP) GB/s Hardware (HW) TP GB/s HW TP Limit GB/s App / Limit % App / HW % HW / Limit % GPU M2070 GK104 Data from 448 cubed, 10 time steps run. Access pattern OK. Could have more concurrent memory access, especially on GK104, to increase HW utilization. 52
53 Register Queue GK104 Profiled Metric Queue No Q Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 53
54 Density Trick Profiled on M2070 Metric Trick Naive Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 54
55 Profiling Notes nvprof from cuda toolkit 5.0 : nvprof --event <event name> inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10) per warp, 32 instructions per count fb_subp0_write_sectors + fb_subp1_write_sectors 32 bytes per count fb_subp0_read_sectors + fb_subp1_read_sectors 32 bytes per count 55
56 Higher Order Approximations? Reported compute bound for large stencils, so not memory bound anymore Would you prefer to pay for the bus or ride it for free? Reported higher accuracy Assuming function well behaved and infinitely differentiable, which is not the case for heterogeneous media Ironically, free mall ride in Denver is cleaner and newer then normal busses you actually pay for 56
57 Smooth vs. Real World smooth real Waves are smooth, for sure Sine and cosine are infinitely differentiable Taylor approximation seems like a good idea Let s approximate some derivatives sin(x) sin(x) * 0.9 All differences are multiplied by material properties If property has step, difference x property will have step We chose factor of 0.9 here -> not a very rough step Function looks smooth 57
58 Smooth vs. Heterogeneous: 1 st Derivative Order Error smooth 1st deriv real 1st deriv nd 2 iii 2 h f ( a) th 4 v 4 h f ( a) th 6 vii 6 h f ( a) Big Error th 8 ix 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th My oh my, what do we have here? 58
59 Smooth vs. Heterogeneous: 2 nd Derivative Order Error smooth 2nd deriv real 2nd deriv nd 2 iv 2 h f ( a) th 4 vi 4 h f ( a) th 6 viii 6 h f ( a) Big Error th 8 x 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th All orders fail, but the higher ones seem worse 59
60 1E-6 sum abs err 1E-6 sum abs err 1D solvers: Pressure Pressure Hom Pressure Het. Points per Wavelength Points per Wavelength 2nd 4th 6th 8th 2nd 4th 6th 8th Higher order not substantially better below 6 ppw 60
61 1E-9 sum abs err 1E-9 sum abs err 1D solvers: Stress Velocity SV Hom. SV Het Points per Wavelength 2nd 4th 6th 8th Points per Wavelength 2nd 4th 6th 8th Higher order WORSE below 6 ppw 61
62 Higher Order Approximations? Reported larger time-step possible Smaller time-step required for the same resolution Lower resolution problematic in heterogeneous media In Conclusion: More expensive to develop No accuracy benefits in heterogeneous media Building Ferrari with shopping cart wheels is silly:» also need higher order boundary conditions» also need higher order time-stepping» etc. Higher order complications. 62
63 MTP GB/s Kepler GK GK104 vs. M cube size M2070 GK104 Same memcpy bandwidth expect same performance 63
64 What s wrong with GK104? GK104: max 2048 threads, 256 threads / TB occupancy 1 -> 8 TB / SM 8 SM x 8 TB / SM -> 64 TB concurrently 512 KB L2 -> 8 KB L2 per TB No need to fetch center and top use ancient register queue technique M2070: max 1536 threads, 256 threads / TB occupancy 2/3 -> 4 TB / SM 14 SM x 4 TB -> 56 TB concurrently 768 KB L2 -> about 14 KB L2 per TB AND: 48 KB L1 -> 12 KB L1 per TB 3D Finite Difference Computation on GPUs using CUDA Paulius Micikevicius, NVIDIA,
65 MTP GB/s Register Queue 95 GK104 vs. M2070 Further improvement more likely through concurrent access increase (more bytes in flight) cube size Looking at compiler numbers, occupancy reduction to increase cache per TB seems like a bad idea (HW utilization limited) Fermi doesn t care, as expected M no reg Q GK104 - req Q GK104 - no reg Q M070 - reg Q That s better. 65
66 The Kink Why is volume and pressure performance curve so jagged and why is there a massive kink down at 384 (12*32)? Suspect: accidental locality TB 0,3 TB 0,4 TB 0,5 read read or hit cache read or hit cache read or hit cache read TB 0,0 TB 0,1 TB 0,2 SM 1 read read or hit cache read or hit cache read or hit cache read SM 0 Read CENTER might prefetch someone s LEFT or RIGHT, or hit in cache Read LEFT or RIGHT might prefetch someone s CENTER, or hit in cache 66
67 The Kink If no accidental locality, there should be more IO operations than necessary, and a lower performance ceiling: unnecessary right OR left : 5 instead of 4 IO 4/5 = 80% throughput unnecessary right AND left : 6 instead of 4 IO 4/6 = 66% throughput How to test? Create 80% situation with chess pattern 67
68 MTP GB/s The Kink Chess Experiment prevent possibility of accidental locality by removing all neighbors (chess board pattern) specifically: fast axis index = (2*blockIdx.x+blockIdx.y%2) * blockdim.x + threadidx.x no direct neighbors that can help each other, and either left or right overfetch is an unnecessary additional read Pressure Chess Pattern cube size expect 80% of peak performance: 80 GB/s, as benchmark shows! Pressure Normal Pressure Chess 68
69 MTP GB/s The Kink Locality Effect? second experiment: comment out left and right neighbor access results are relatively flat, not jagged 14 SM on M2070, peak at 448 = 14*32? Pressure Chess Pattern = 12*32 some especially bad locality situation? cube size Pressure Normal Pressure Chess Pressure No Left+Right 69
70 Averaged Materials Our weakest link is obviously shear stress kernel Most probably because of material average What if we pre-average and store Mue, Mue_x, Mue_y and Mue_z? Less pressure on cache and faster solver? Interesting. 70
71 Averaged Materials Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s, overall performance won t improve much Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3 extra Mue to read would increase to total 13. What memory throughput would MATCH existing version? s s s t t mtp mtp mtp1 mtp2 s1 mtp GB/s GB/s Maybe possible, but even if, still used much more memory. Alas. 71
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationPerformance Optimization Process
Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound
More informationNVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationOutline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters
Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationLecture 2: different memory and variable types
Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationUniversiteit Leiden Opleiding Informatica
Universiteit Leiden Opleiding Informatica Comparison of the effectiveness of shared memory optimizations for stencil computations on NVIDIA GPU architectures Name: Geerten Verweij Date: 12/08/2016 1st
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationPerformance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them
Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationLecture 7. Using Shared Memory Performance programming and the memory hierarchy
Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more
More informationEfficient 3D Stencil Computations Using CUDA
Efficient 3D Stencil Computations Using CUDA Marcin Krotkiewski,Marcin Dabrowski October 2011 Abstract We present an efficient implementation of 7 point and 27 point stencils on high-end Nvidia GPUs. A
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationGPU implementation of minimal dispersion recursive operators for reverse time migration
GPU implementation of minimal dispersion recursive operators for reverse time migration Allon Bartana*, Dan Kosloff, Brandon Warnell, Chris Connor, Jeff Codd and David Kessler, SeismicCity Inc. Paulius
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationParallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities.
Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities. Vukasin Strbac Biomechanics section KU Leuven 2/21 The Finite Element Method and the GPU Basically
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationAdministrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13
Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design
More informationGPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1
GPU Architectures for Non-Graphics People GPU Background David Black-Schaffer david.black-schaffer@it.uu.se David Black-Schaffer 1 David Black-Schaffer 2 GPUs: Architectures for Drawing Triangles Fast!
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationUppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016
CUDA Exercises Karl Ljungkvist 25 February 2016 Karl Ljungkvist karl.ljungkvist@it.uu.se 2016-02-25 2/21 Example: PDE solver Heat equation: Discretization: u n+1 i,j k u n i,j Time stepping: u n+1 i,j
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More information