Memory Bound Wave Propagation at Hardware Limit. Igor Podladtchikov, Spectraseis Inc

Size: px
Start display at page:

Download "Memory Bound Wave Propagation at Hardware Limit. Igor Podladtchikov, Spectraseis Inc"

Transcription

1 Memory Bound Wave Propagation at Hardware Limit Igor Podladtchikov, Spectraseis Inc March 19, 2013

2 Microseismic Monitoring Geophysical Method to locate subsurface events: Propagate and image time-reversed data acquired at the surface Use full wave-equation Acoustic or Elastic Heterogeneous Materials Need very fast solvers Thousands of Events Big Models Time Reversed Imaging (TRI) 2

3 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 3

4 Performance Limiters Processors Computation 1000 Gflops/s Transfer 100 GB/s Memory The two principle performance limiters 4

5 Acoustic Equations 1 variable read & write 2 variables read only 5

6 Elastic Equations 9 variables read & write 3 variables read only 6

7 Arithmetic Intensity flops / bytes ratio: Compute or Memory Bound? BYTES, not numbers: Single precision: 4 bytes e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred 7

8 Machine Balance M2070 machine balance: Peak flops / bytes : 1030 / 117 ~ 9 K10 machine balance: Peak flops / bytes : 4577 / 228 ~ 20 Arithmetic Intensity: << machine balance : memory bound >> machine balance : compute bound 8

9 Arithmetic Intensity x 2 acoustic elastic 2 x flops bytes Machine Balance: Fermi: ~ 9 Kepler: ~ 20 ratio << 9 We re memory bound 9

10 Memory Bound What To Do? Option A: Option B: Give up (don t even try) Celebrate: FLOPS are for FREE Blame memory bound for slow code Optimize memory access efficiency Count bytes, not flops Try to approach memcpy throughput Our claim: real world applications can run close to memcpy! 10

11 How to optimize memory access? Aim for minimum read / writes Touch everything once (un-improvable) Don t read neighbors twice Read me once don t read me again! Try to avoid redundant read / writes 11

12 Don t count neighbor reads! Track optimization progress Don t count neighbors in your performance metric! Remember traffic is the volume of data to a particular memory. It is not the number of loads and stores Performance Tuning of Scientific Applications Don t cheat (yourself) 12

13 Ideal Memory Throughput MTP N IO Grid Size Word Size Time Elapsed N IO 2 DOF Constants DOF : Constants : Grid Size : Word Size: Degree of freedom (read and write) read only nx * ny * nz * nt 4 bytes (single precision) No Neighbors! 13

14 Ideal Memory Throughput MTP Acoustic 4 Grid Size 4 bytes GB 30 Time Elapsed 2 s MTP Elastic 21 Grid Size 4 bytes GB 30 Time Elapsed 2 s N_IO Acoustic : 2 * = 4 N_IO Elastic : 2 * = 21 14

15 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 15

16 Results All solvers include free surface absorbing layer domain decomposition along all 3 axes IPC if GPUs map-able, MPI otherwise All solvers verified against single CPU code All data from NVIDIA PSG Cluster Thank You! without further ado.. 16

17 MTP GB/s Memory Throughput on M M cube size Memcpy Pressure Density Elastic Real physics at 85% and 52% of hardware limit 17

18 Neighbors Don t Count! Acoustic pressure update: po[center] = 2*pcc - po[center]*abs + vp2[center] * ( ); pc[left ] + pc[right ] + pc[left2 ] + pc[right2] + pcm + pcp - 6*pcc Neighbor Reads If we count neighbor reads as IO operations: 6 additional 10 IO operations 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070 Theoretical hardware limit is 150 GB/s DON T COUNT NEIGHBOR READS 18

19 MTP GB/s Memory Throughput on M K Memcpy Pressure Density Elastic cube size Strong scaling on both K10 GPU s same size and power consumption as M2070! 19

20 Other GPUs Memcpy K10 K20X K20 M2090 M2070 GK Pressure K10 K20X K20 M2090 M2070 GK Density K10 K20X K M M2070 GK Elastic K10 K20X K20 M2090 M2070 GK104 The green cards win 20

21 per GPU MTP, % of single per GPU MTP, % of single Multi-GPU Weak Scaling on GK Density cube size Elastic 2 nodes (IPC) 4 nodes (MPI PCIe3) 8 nodes (IB FDR) cube size PCIe 2: 6 GB/s PCIe 3: 12 GB/s 21

22 Results Summary Defined Ideal, Un-improvable Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const No neighbors or temporary variables Came close to memcpy with real world applications acoustic: 85 % elastic: 52 % performance proportional to memcpy on various architectures Solvers scale on multiple GPUs 22

23 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 23

24 General Considerations Respect the number x 8 Thread-blocks Fast axis sizes multiples of 32 (can be padded) Hit global memory segments and L1 cache lines (32 x 4B = 128B) Rely on cache Shared memory requires extra operations Shared memory needs synchthreads() Registers are faster than shared memory If working set fits in cache, cache is faster 24

25 First Try Acoustic Pressure #define EXIT_BND(xx,yy,nx,ny) \ int xx = blockidx.x*blockdim.x + threadidx.x; if(xx < 1 xx >= nx - 1) return; \ int yy = blockidx.y*blockdim.y + threadidx.y; if(yy < 1 yy >= ny - 1) return; #define CENTER i1 + i2*n1 + i3*n1*n2 #define RIGHT i1+1 + i2*n1 + i3*n1*n2 #define LEFT i1-1 + i2*n1 + i3*n1*n2 #define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2 #define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2 #define TOP i1 + i2*n1 + (i3+1)*n1*n2 #define BOT i1 + i2*n1 + (i3-1)*n1*n2 global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = 1; i3 < n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Yes, that s it 25

26 MTP GB/s First Try Acoustic Pressure Yay! Boo Boo Hoo Hoo cube size pretty good, but not good enough 26

27 First Try Suspect TLB misses: Translation Lookaside Buffers Accelerating translation from virtual to physical memory Act like caches on the page table If the kernel s working set exceeds TLB capacity (or associativity) then one generates TLB capacity (or conflict) misses. Performance Tuning of Scientific Applications 27

28 Batched Execution If the kernel s working set is too big, we ll reduce it: global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3, const int offset){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = offset+1; i3 < offset+n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Launch kernel batches for slowest axis 28

29 MTP GB/s MTP GB/s Batched Execution Batched Execution First Try vs. Batched no batch batch 32 no batch cube size cube size Done 29

30 MTP GB/s The Density Problem Density equation has Vp inside difference, which means twice the amount of neighbors to fetch: Pressure vs. Density cube size Pressure Density Naïve 30

31 MTP GB/s What Problem? Add Variable! Just replace vp*current inside derivative by variable! At every timestep: launch ucvp = uc*vp kernel launch solver, take ucvp derivative Why is it slower?? we introduced additional read+write the additional read+write don t count! same problem, same result, same performance metric formula! Pressure vs. Density cube size Pressure Add Var Density Naïve THAT problem 31

32 The Density Trick The code looks a little repetitive, we re multiplying by vp a whole lot of times: unew = 2*ucc - uo[center]*abs + uc[right] *v2[right] + uc[left] *v2[left] + uc[right2]*v2[right2]+ uc[left2]*v2[left2] + ucp *v2p + ucm *v2m - 6*ucc*v2c; What if we do this: unew = (2*ucc - uo[center]*abs) / vp[center] // divide by vp for time-step! + uc[right] + uc[left] + uc[right2]+ uc[left2] + ucp + ucm - 6*ucc; uo[center] = unew*abs*vp[center]; // store wave-fields pre-multiplied with vp! Same memory usage, same N_IO, but less neighbor reads! 32

33 MTP GB/s The Density Trick Memory access pattern the same as pressure same performance as pressure! Pressure vs. Density cube size Pressure Add Var Density Naïve Pre-Mul And BOOM goes the dynamite 33

34 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 34

35 General Considerations Very similar situation to acoustic solver Less neighbors because 1st derivative, but more variables Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32 Use staggered grid Average materials on the fly All variables the same size All coalesced, same stride for everyone 35

36 Staggered Grid Elementary Cell y Syz, out of screen Sxy Every grid point contains 12 elements: 3 particle velocity components Vx, Vy, Vz 3 normal stress components Sxx, Syy, Szz Vy 3 shear stress components Sxy, Sxz, Syz 3 material properties ρ, λ, μ Vx Sxz, out of screen Vz, out of screen x Sxx, Syy, Szz, ρ, λ, μ 36

37 Staggered Grid Everyone surrounded by correct spatial difference neighbors Materials over Velocity and Shear need to be averaged 37

38 Staggered Grid Vx Vy Sxy Sxz over Vx Sxy over Vy Sxx, Syy, Szz, ρ, λ, μ Area updated Ghost Stress, ignored Ghost Stress, updated Boundary Velocity, from neighbor or boundary condition 38

39 Separate Stress and Velocity Update Velocity Kernel Stress Kernel 39

40 Separate Stress and Velocity Update needs to be at time-step t needs to be at time-step t+1/2 handled by thread-block, possibly on different SM thread-block scheduling unknown read redundancy 40

41 Separate Stress and Shear Update Velocity Kernel Stress Kernel Shear Kernel 41

42 Separate Stress and Shear Update for i 0 n-1 for i 1 n Divergence experimentally established to be slightly worse than read redundancy 42

43 MTP GB/s Individual Kernel Performance Normal stress has no material averaging Velocity needs to average density from 2 values, for 3 different positions Shear stress needs to average Lame coefficient from 4 values, for 3 different positions Elastic Kernels cube size Stress Shear Velocity Elastic In sequence they suffer read redundancy 43

44 Read Redundancy Individual kernels are close to limit, but introduce read redundancy: shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10) normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN), write 3 S (11, 4 redundant) velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R, write 3 V (13, 9 redundant) Total 34, 13 redundant We could totally cheat and say we re doing 34 IO, and therefore our peak performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed! It s important to know whether an algorithm has room for improvement or not This one definitely has! 44

45 Implementation Summary Respected the number 32 Memory segments, warps and L1 cache lines Relied on cache Only works if working unit small enough So, reduce your working units Give Hardware maximum possibility to parallelize No syncthreads() Minimum divergence 45

46 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 46

47 Summary Ideal Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const GFlops misleading in memory bound situation Counting neighbors is a crime Real world applications can approach memcpy throughput acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10) elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10) Physics at Memcpy Throughput: Physics for free! 47

48 Every Algorithm s Dream For fixed problem size and hardware capabilities 3D FFT: 40 GB/s, 180 Gflops/s Read Compute Write Acoustic: 100 GB/s, 70 Gflops/s Read C Write Memcpy: 117 GB/s, 0 Gflops/s Read Write which is faster? 48

49 References Performance Tuning of Scientific Applications David H. Bailey and Robert F. Lucas GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010 Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012 Questions? 49

50

51 Performance Peak Analysis App Throughput (TP) Hardware (HW) TP HW TP Limit GB/s Application Speed GB/s Hardware s Transfer Throughput GB/s Practical Throughput Limit (memcpy) Profile how many bytes transferred. Practical instead of theoretical throughput limit. App / Limit % 100 % is ideal App / HW % HW / Limit % Less than 100 % means not all bytes transferred are used Less than 100 % means memory bus underutilized Less than 100% is only critical if it s substantially less. 51

52 Performance Peak Analysis: Density App Throughput (TP) GB/s Hardware (HW) TP GB/s HW TP Limit GB/s App / Limit % App / HW % HW / Limit % GPU M2070 GK104 Data from 448 cubed, 10 time steps run. Access pattern OK. Could have more concurrent memory access, especially on GK104, to increase HW utilization. 52

53 Register Queue GK104 Profiled Metric Queue No Q Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 53

54 Density Trick Profiled on M2070 Metric Trick Naive Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 54

55 Profiling Notes nvprof from cuda toolkit 5.0 : nvprof --event <event name> inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10) per warp, 32 instructions per count fb_subp0_write_sectors + fb_subp1_write_sectors 32 bytes per count fb_subp0_read_sectors + fb_subp1_read_sectors 32 bytes per count 55

56 Higher Order Approximations? Reported compute bound for large stencils, so not memory bound anymore Would you prefer to pay for the bus or ride it for free? Reported higher accuracy Assuming function well behaved and infinitely differentiable, which is not the case for heterogeneous media Ironically, free mall ride in Denver is cleaner and newer then normal busses you actually pay for 56

57 Smooth vs. Real World smooth real Waves are smooth, for sure Sine and cosine are infinitely differentiable Taylor approximation seems like a good idea Let s approximate some derivatives sin(x) sin(x) * 0.9 All differences are multiplied by material properties If property has step, difference x property will have step We chose factor of 0.9 here -> not a very rough step Function looks smooth 57

58 Smooth vs. Heterogeneous: 1 st Derivative Order Error smooth 1st deriv real 1st deriv nd 2 iii 2 h f ( a) th 4 v 4 h f ( a) th 6 vii 6 h f ( a) Big Error th 8 ix 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th My oh my, what do we have here? 58

59 Smooth vs. Heterogeneous: 2 nd Derivative Order Error smooth 2nd deriv real 2nd deriv nd 2 iv 2 h f ( a) th 4 vi 4 h f ( a) th 6 viii 6 h f ( a) Big Error th 8 x 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th All orders fail, but the higher ones seem worse 59

60 1E-6 sum abs err 1E-6 sum abs err 1D solvers: Pressure Pressure Hom Pressure Het. Points per Wavelength Points per Wavelength 2nd 4th 6th 8th 2nd 4th 6th 8th Higher order not substantially better below 6 ppw 60

61 1E-9 sum abs err 1E-9 sum abs err 1D solvers: Stress Velocity SV Hom. SV Het Points per Wavelength 2nd 4th 6th 8th Points per Wavelength 2nd 4th 6th 8th Higher order WORSE below 6 ppw 61

62 Higher Order Approximations? Reported larger time-step possible Smaller time-step required for the same resolution Lower resolution problematic in heterogeneous media In Conclusion: More expensive to develop No accuracy benefits in heterogeneous media Building Ferrari with shopping cart wheels is silly:» also need higher order boundary conditions» also need higher order time-stepping» etc. Higher order complications. 62

63 MTP GB/s Kepler GK GK104 vs. M cube size M2070 GK104 Same memcpy bandwidth expect same performance 63

64 What s wrong with GK104? GK104: max 2048 threads, 256 threads / TB occupancy 1 -> 8 TB / SM 8 SM x 8 TB / SM -> 64 TB concurrently 512 KB L2 -> 8 KB L2 per TB No need to fetch center and top use ancient register queue technique M2070: max 1536 threads, 256 threads / TB occupancy 2/3 -> 4 TB / SM 14 SM x 4 TB -> 56 TB concurrently 768 KB L2 -> about 14 KB L2 per TB AND: 48 KB L1 -> 12 KB L1 per TB 3D Finite Difference Computation on GPUs using CUDA Paulius Micikevicius, NVIDIA,

65 MTP GB/s Register Queue 95 GK104 vs. M2070 Further improvement more likely through concurrent access increase (more bytes in flight) cube size Looking at compiler numbers, occupancy reduction to increase cache per TB seems like a bad idea (HW utilization limited) Fermi doesn t care, as expected M no reg Q GK104 - req Q GK104 - no reg Q M070 - reg Q That s better. 65

66 The Kink Why is volume and pressure performance curve so jagged and why is there a massive kink down at 384 (12*32)? Suspect: accidental locality TB 0,3 TB 0,4 TB 0,5 read read or hit cache read or hit cache read or hit cache read TB 0,0 TB 0,1 TB 0,2 SM 1 read read or hit cache read or hit cache read or hit cache read SM 0 Read CENTER might prefetch someone s LEFT or RIGHT, or hit in cache Read LEFT or RIGHT might prefetch someone s CENTER, or hit in cache 66

67 The Kink If no accidental locality, there should be more IO operations than necessary, and a lower performance ceiling: unnecessary right OR left : 5 instead of 4 IO 4/5 = 80% throughput unnecessary right AND left : 6 instead of 4 IO 4/6 = 66% throughput How to test? Create 80% situation with chess pattern 67

68 MTP GB/s The Kink Chess Experiment prevent possibility of accidental locality by removing all neighbors (chess board pattern) specifically: fast axis index = (2*blockIdx.x+blockIdx.y%2) * blockdim.x + threadidx.x no direct neighbors that can help each other, and either left or right overfetch is an unnecessary additional read Pressure Chess Pattern cube size expect 80% of peak performance: 80 GB/s, as benchmark shows! Pressure Normal Pressure Chess 68

69 MTP GB/s The Kink Locality Effect? second experiment: comment out left and right neighbor access results are relatively flat, not jagged 14 SM on M2070, peak at 448 = 14*32? Pressure Chess Pattern = 12*32 some especially bad locality situation? cube size Pressure Normal Pressure Chess Pressure No Left+Right 69

70 Averaged Materials Our weakest link is obviously shear stress kernel Most probably because of material average What if we pre-average and store Mue, Mue_x, Mue_y and Mue_z? Less pressure on cache and faster solver? Interesting. 70

71 Averaged Materials Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s, overall performance won t improve much Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3 extra Mue to read would increase to total 13. What memory throughput would MATCH existing version? s s s t t mtp mtp mtp1 mtp2 s1 mtp GB/s GB/s Maybe possible, but even if, still used much more memory. Alas. 71

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Performance Optimization Process

Performance Optimization Process Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound

More information

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Lecture 2: different memory and variable types

Lecture 2: different memory and variable types Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Universiteit Leiden Opleiding Informatica

Universiteit Leiden Opleiding Informatica Universiteit Leiden Opleiding Informatica Comparison of the effectiveness of shared memory optimizations for stencil computations on NVIDIA GPU architectures Name: Geerten Verweij Date: 12/08/2016 1st

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

Efficient 3D Stencil Computations Using CUDA

Efficient 3D Stencil Computations Using CUDA Efficient 3D Stencil Computations Using CUDA Marcin Krotkiewski,Marcin Dabrowski October 2011 Abstract We present an efficient implementation of 7 point and 27 point stencils on high-end Nvidia GPUs. A

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

GPU implementation of minimal dispersion recursive operators for reverse time migration

GPU implementation of minimal dispersion recursive operators for reverse time migration GPU implementation of minimal dispersion recursive operators for reverse time migration Allon Bartana*, Dan Kosloff, Brandon Warnell, Chris Connor, Jeff Codd and David Kessler, SeismicCity Inc. Paulius

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities.

Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities. Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities. Vukasin Strbac Biomechanics section KU Leuven 2/21 The Finite Element Method and the GPU Basically

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13 Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design

More information

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1 GPU Architectures for Non-Graphics People GPU Background David Black-Schaffer david.black-schaffer@it.uu.se David Black-Schaffer 1 David Black-Schaffer 2 GPUs: Architectures for Drawing Triangles Fast!

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Uppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016

Uppsala University. CUDA Exercises. Karl Ljungkvist. 25 February 2016 CUDA Exercises Karl Ljungkvist 25 February 2016 Karl Ljungkvist karl.ljungkvist@it.uu.se 2016-02-25 2/21 Example: PDE solver Heat equation: Discretization: u n+1 i,j k u n i,j Time stepping: u n+1 i,j

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information