Memory Bound Wave Propagation at Hardware Limit. Igor Podladtchikov, Spectraseis Inc

Size: px

Start display at page:

Download "Memory Bound Wave Propagation at Hardware Limit. Igor Podladtchikov, Spectraseis Inc"

Kory Summers
5 years ago
Views:

1 Memory Bound Wave Propagation at Hardware Limit Igor Podladtchikov, Spectraseis Inc March 19, 2013

2 Microseismic Monitoring Geophysical Method to locate subsurface events: Propagate and image time-reversed data acquired at the surface Use full wave-equation Acoustic or Elastic Heterogeneous Materials Need very fast solvers Thousands of Events Big Models Time Reversed Imaging (TRI) 2

3 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 3

4 Performance Limiters Processors Computation 1000 Gflops/s Transfer 100 GB/s Memory The two principle performance limiters 4

5 Acoustic Equations 1 variable read & write 2 variables read only 5

6 Elastic Equations 9 variables read & write 3 variables read only 6

7 Arithmetic Intensity flops / bytes ratio: Compute or Memory Bound? BYTES, not numbers: Single precision: 4 bytes e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred 7

8 Machine Balance M2070 machine balance: Peak flops / bytes : 1030 / 117 ~ 9 K10 machine balance: Peak flops / bytes : 4577 / 228 ~ 20 Arithmetic Intensity: << machine balance : memory bound >> machine balance : compute bound 8

9 Arithmetic Intensity x 2 acoustic elastic 2 x flops bytes Machine Balance: Fermi: ~ 9 Kepler: ~ 20 ratio << 9 We re memory bound 9

10 Memory Bound What To Do? Option A: Option B: Give up (don t even try) Celebrate: FLOPS are for FREE Blame memory bound for slow code Optimize memory access efficiency Count bytes, not flops Try to approach memcpy throughput Our claim: real world applications can run close to memcpy! 10

11 How to optimize memory access? Aim for minimum read / writes Touch everything once (un-improvable) Don t read neighbors twice Read me once don t read me again! Try to avoid redundant read / writes 11

12 Don t count neighbor reads! Track optimization progress Don t count neighbors in your performance metric! Remember traffic is the volume of data to a particular memory. It is not the number of loads and stores Performance Tuning of Scientific Applications Don t cheat (yourself) 12

13 Ideal Memory Throughput MTP N IO Grid Size Word Size Time Elapsed N IO 2 DOF Constants DOF : Constants : Grid Size : Word Size: Degree of freedom (read and write) read only nx * ny * nz * nt 4 bytes (single precision) No Neighbors! 13

14 Ideal Memory Throughput MTP Acoustic 4 Grid Size 4 bytes GB 30 Time Elapsed 2 s MTP Elastic 21 Grid Size 4 bytes GB 30 Time Elapsed 2 s N_IO Acoustic : 2 * = 4 N_IO Elastic : 2 * = 21 14

15 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 15

16 Results All solvers include free surface absorbing layer domain decomposition along all 3 axes IPC if GPUs map-able, MPI otherwise All solvers verified against single CPU code All data from NVIDIA PSG Cluster Thank You! without further ado.. 16

MTP GB/s Memory Throughput on M2070 120 100 80 60 40 20 0 M2070 64 128 192 256 320 384 448 512 576

17 MTP GB/s Memory Throughput on M M cube size Memcpy Pressure Density Elastic Real physics at 85% and 52% of hardware limit 17

18 Neighbors Don t Count! Acoustic pressure update: po[center] = 2*pcc - po[center]*abs + vp2[center] * ( ); pc[left ] + pc[right ] + pc[left2 ] + pc[right2] + pcm + pcp - 6*pcc Neighbor Reads If we count neighbor reads as IO operations: 6 additional 10 IO operations 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070 Theoretical hardware limit is 150 GB/s DON T COUNT NEIGHBOR READS 18

19 MTP GB/s Memory Throughput on M K Memcpy Pressure Density Elastic cube size Strong scaling on both K10 GPU s same size and power consumption as M2070! 19

20 Other GPUs Memcpy K10 K20X K20 M2090 M2070 GK Pressure K10 K20X K20 M2090 M2070 GK Density K10 K20X K M M2070 GK Elastic K10 K20X K20 M2090 M2070 GK104 The green cards win 20

21 per GPU MTP, % of single per GPU MTP, % of single Multi-GPU Weak Scaling on GK Density cube size Elastic 2 nodes (IPC) 4 nodes (MPI PCIe3) 8 nodes (IB FDR) cube size PCIe 2: 6 GB/s PCIe 3: 12 GB/s 21

22 Results Summary Defined Ideal, Un-improvable Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const No neighbors or temporary variables Came close to memcpy with real world applications acoustic: 85 % elastic: 52 % performance proportional to memcpy on various architectures Solvers scale on multiple GPUs 22

23 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 23

24 General Considerations Respect the number x 8 Thread-blocks Fast axis sizes multiples of 32 (can be padded) Hit global memory segments and L1 cache lines (32 x 4B = 128B) Rely on cache Shared memory requires extra operations Shared memory needs synchthreads() Registers are faster than shared memory If working set fits in cache, cache is faster 24

25 First Try Acoustic Pressure #define EXIT_BND(xx,yy,nx,ny) \ int xx = blockidx.x*blockdim.x + threadidx.x; if(xx < 1 xx >= nx - 1) return; \ int yy = blockidx.y*blockdim.y + threadidx.y; if(yy < 1 yy >= ny - 1) return; #define CENTER i1 + i2*n1 + i3*n1*n2 #define RIGHT i1+1 + i2*n1 + i3*n1*n2 #define LEFT i1-1 + i2*n1 + i3*n1*n2 #define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2 #define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2 #define TOP i1 + i2*n1 + (i3+1)*n1*n2 #define BOT i1 + i2*n1 + (i3-1)*n1*n2 global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = 1; i3 < n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Yes, that s it 25

MTP GB/s First Try Acoustic Pressure 95 85 Yay!

26 MTP GB/s First Try Acoustic Pressure Yay! Boo Boo Hoo Hoo cube size pretty good, but not good enough 26

27 First Try Suspect TLB misses: Translation Lookaside Buffers Accelerating translation from virtual to physical memory Act like caches on the page table If the kernel s working set exceeds TLB capacity (or associativity) then one generates TLB capacity (or conflict) misses. Performance Tuning of Scientific Applications 27

28 Batched Execution If the kernel s working set is too big, we ll reduce it: global void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2, const int n1, const int n2, const int n3, const int offset){ EXIT_BND(i1,i2,n1,n2) int i3; for(i3 = offset+1; i3 < offset+n3-1; i3++){ po[center] = 2*pc[CENTER] - po[center] + vp2[center] * ( pc[left] + pc[right] + pc[left2] + pc[right2] + pc[bot] + pc[top] 6*pc[CENTER] ); } } Launch kernel batches for slowest axis 28

29 MTP GB/s MTP GB/s Batched Execution Batched Execution First Try vs. Batched no batch batch 32 no batch cube size cube size Done 29

30 MTP GB/s The Density Problem Density equation has Vp inside difference, which means twice the amount of neighbors to fetch: Pressure vs. Density cube size Pressure Density Naïve 30

31 MTP GB/s What Problem? Add Variable! Just replace vp*current inside derivative by variable! At every timestep: launch ucvp = uc*vp kernel launch solver, take ucvp derivative Why is it slower?? we introduced additional read+write the additional read+write don t count! same problem, same result, same performance metric formula! Pressure vs. Density cube size Pressure Add Var Density Naïve THAT problem 31

32 The Density Trick The code looks a little repetitive, we re multiplying by vp a whole lot of times: unew = 2*ucc - uo[center]*abs + uc[right] *v2[right] + uc[left] *v2[left] + uc[right2]*v2[right2]+ uc[left2]*v2[left2] + ucp *v2p + ucm *v2m - 6*ucc*v2c; What if we do this: unew = (2*ucc - uo[center]*abs) / vp[center] // divide by vp for time-step! + uc[right] + uc[left] + uc[right2]+ uc[left2] + ucp + ucm - 6*ucc; uo[center] = unew*abs*vp[center]; // store wave-fields pre-multiplied with vp! Same memory usage, same N_IO, but less neighbor reads! 32

33 MTP GB/s The Density Trick Memory access pattern the same as pressure same performance as pressure! Pressure vs. Density cube size Pressure Add Var Density Naïve Pre-Mul And BOOM goes the dynamite 33

34 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 34

35 General Considerations Very similar situation to acoustic solver Less neighbors because 1st derivative, but more variables Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32 Use staggered grid Average materials on the fly All variables the same size All coalesced, same stride for everyone 35

36 Staggered Grid Elementary Cell y Syz, out of screen Sxy Every grid point contains 12 elements: 3 particle velocity components Vx, Vy, Vz 3 normal stress components Sxx, Syy, Szz Vy 3 shear stress components Sxy, Sxz, Syz 3 material properties ρ, λ, μ Vx Sxz, out of screen Vz, out of screen x Sxx, Syy, Szz, ρ, λ, μ 36

37 Staggered Grid Everyone surrounded by correct spatial difference neighbors Materials over Velocity and Shear need to be averaged 37

38 Staggered Grid Vx Vy Sxy Sxz over Vx Sxy over Vy Sxx, Syy, Szz, ρ, λ, μ Area updated Ghost Stress, ignored Ghost Stress, updated Boundary Velocity, from neighbor or boundary condition 38

39 Separate Stress and Velocity Update Velocity Kernel Stress Kernel 39

40 Separate Stress and Velocity Update needs to be at time-step t needs to be at time-step t+1/2 handled by thread-block, possibly on different SM thread-block scheduling unknown read redundancy 40

41 Separate Stress and Shear Update Velocity Kernel Stress Kernel Shear Kernel 41

42 Separate Stress and Shear Update for i 0 n-1 for i 1 n Divergence experimentally established to be slightly worse than read redundancy 42

43 MTP GB/s Individual Kernel Performance Normal stress has no material averaging Velocity needs to average density from 2 values, for 3 different positions Shear stress needs to average Lame coefficient from 4 values, for 3 different positions Elastic Kernels cube size Stress Shear Velocity Elastic In sequence they suffer read redundancy 43

44 Read Redundancy Individual kernels are close to limit, but introduce read redundancy: shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10) normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN), write 3 S (11, 4 redundant) velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R, write 3 V (13, 9 redundant) Total 34, 13 redundant We could totally cheat and say we re doing 34 IO, and therefore our peak performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed! It s important to know whether an algorithm has room for improvement or not This one definitely has! 44

45 Implementation Summary Respected the number 32 Memory segments, warps and L1 cache lines Relied on cache Only works if working unit small enough So, reduce your working units Give Hardware maximum possibility to parallelize No syncthreads() Minimum divergence 45

46 Roadmap Performance Expectations Results Acoustic Solver Implementation Elastic Solver Implementation Summary 46

47 Summary Ideal Memory Throughput MTP = N_IO * Grid Size * Word Size / time elapsed N_IO = 2*DOF + Const GFlops misleading in memory bound situation Counting neighbors is a crime Real world applications can approach memcpy throughput acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10) elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10) Physics at Memcpy Throughput: Physics for free! 47

48 Every Algorithm s Dream For fixed problem size and hardware capabilities 3D FFT: 40 GB/s, 180 Gflops/s Read Compute Write Acoustic: 100 GB/s, 70 Gflops/s Read C Write Memcpy: 117 GB/s, 0 Gflops/s Read Write which is faster? 48

49 References Performance Tuning of Scientific Applications David H. Bailey and Robert F. Lucas GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010 Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012 Questions? 49

51 Performance Peak Analysis App Throughput (TP) Hardware (HW) TP HW TP Limit GB/s Application Speed GB/s Hardware s Transfer Throughput GB/s Practical Throughput Limit (memcpy) Profile how many bytes transferred. Practical instead of theoretical throughput limit. App / Limit % 100 % is ideal App / HW % HW / Limit % Less than 100 % means not all bytes transferred are used Less than 100 % means memory bus underutilized Less than 100% is only critical if it s substantially less. 51

52 Performance Peak Analysis: Density App Throughput (TP) GB/s Hardware (HW) TP GB/s HW TP Limit GB/s App / Limit % App / HW % HW / Limit % GPU M2070 GK104 Data from 448 cubed, 10 time steps run. Access pattern OK. Could have more concurrent memory access, especially on GK104, to increase HW utilization. 52

53 Register Queue GK104 Profiled Metric Queue No Q Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 53

54 Density Trick Profiled on M2070 Metric Trick Naive Comments APP Time [sec] APP MTP [GB/s] Instructions [10^9] replays Writes [GB] Reads [GB] Cache miss Reads/cube Cache miss HW MTP [ GB/s] Stalls APP / HW MTP [%] Cache miss Cache miss causes memory replays and stalls 54

55 Profiling Notes nvprof from cuda toolkit 5.0 : nvprof --event <event name> inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10) per warp, 32 instructions per count fb_subp0_write_sectors + fb_subp1_write_sectors 32 bytes per count fb_subp0_read_sectors + fb_subp1_read_sectors 32 bytes per count 55

56 Higher Order Approximations? Reported compute bound for large stencils, so not memory bound anymore Would you prefer to pay for the bus or ride it for free? Reported higher accuracy Assuming function well behaved and infinitely differentiable, which is not the case for heterogeneous media Ironically, free mall ride in Denver is cleaner and newer then normal busses you actually pay for 56

57 Smooth vs. Real World smooth real Waves are smooth, for sure Sine and cosine are infinitely differentiable Taylor approximation seems like a good idea Let s approximate some derivatives sin(x) sin(x) * 0.9 All differences are multiplied by material properties If property has step, difference x property will have step We chose factor of 0.9 here -> not a very rough step Function looks smooth 57

58 Smooth vs. Heterogeneous: 1 st Derivative Order Error smooth 1st deriv real 1st deriv nd 2 iii 2 h f ( a) th 4 v 4 h f ( a) th 6 vii 6 h f ( a) Big Error th 8 ix 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th My oh my, what do we have here? 58

59 Smooth vs. Heterogeneous: 2 nd Derivative Order Error smooth 2nd deriv real 2nd deriv nd 2 iv 2 h f ( a) th 4 vi 4 h f ( a) th 6 viii 6 h f ( a) Big Error th 8 x 8 h f ( a) Big Error Small Error Big h Small h 2nd 4th 6th 8th Small Error Big h Small h 2nd 4th 6th 8th All orders fail, but the higher ones seem worse 59

60 1E-6 sum abs err 1E-6 sum abs err 1D solvers: Pressure Pressure Hom Pressure Het. Points per Wavelength Points per Wavelength 2nd 4th 6th 8th 2nd 4th 6th 8th Higher order not substantially better below 6 ppw 60

61 1E-9 sum abs err 1E-9 sum abs err 1D solvers: Stress Velocity SV Hom. SV Het Points per Wavelength 2nd 4th 6th 8th Points per Wavelength 2nd 4th 6th 8th Higher order WORSE below 6 ppw 61

62 Higher Order Approximations? Reported larger time-step possible Smaller time-step required for the same resolution Lower resolution problematic in heterogeneous media In Conclusion: More expensive to develop No accuracy benefits in heterogeneous media Building Ferrari with shopping cart wheels is silly:» also need higher order boundary conditions» also need higher order time-stepping» etc. Higher order complications. 62

63 MTP GB/s Kepler GK GK104 vs. M cube size M2070 GK104 Same memcpy bandwidth expect same performance 63

64 What s wrong with GK104? GK104: max 2048 threads, 256 threads / TB occupancy 1 -> 8 TB / SM 8 SM x 8 TB / SM -> 64 TB concurrently 512 KB L2 -> 8 KB L2 per TB No need to fetch center and top use ancient register queue technique M2070: max 1536 threads, 256 threads / TB occupancy 2/3 -> 4 TB / SM 14 SM x 4 TB -> 56 TB concurrently 768 KB L2 -> about 14 KB L2 per TB AND: 48 KB L1 -> 12 KB L1 per TB 3D Finite Difference Computation on GPUs using CUDA Paulius Micikevicius, NVIDIA,

65 MTP GB/s Register Queue 95 GK104 vs. M2070 Further improvement more likely through concurrent access increase (more bytes in flight) cube size Looking at compiler numbers, occupancy reduction to increase cache per TB seems like a bad idea (HW utilization limited) Fermi doesn t care, as expected M no reg Q GK104 - req Q GK104 - no reg Q M070 - reg Q That s better. 65

66 The Kink Why is volume and pressure performance curve so jagged and why is there a massive kink down at 384 (12*32)? Suspect: accidental locality TB 0,3 TB 0,4 TB 0,5 read read or hit cache read or hit cache read or hit cache read TB 0,0 TB 0,1 TB 0,2 SM 1 read read or hit cache read or hit cache read or hit cache read SM 0 Read CENTER might prefetch someone s LEFT or RIGHT, or hit in cache Read LEFT or RIGHT might prefetch someone s CENTER, or hit in cache 66

67 The Kink If no accidental locality, there should be more IO operations than necessary, and a lower performance ceiling: unnecessary right OR left : 5 instead of 4 IO 4/5 = 80% throughput unnecessary right AND left : 6 instead of 4 IO 4/6 = 66% throughput How to test? Create 80% situation with chess pattern 67

68 MTP GB/s The Kink Chess Experiment prevent possibility of accidental locality by removing all neighbors (chess board pattern) specifically: fast axis index = (2*blockIdx.x+blockIdx.y%2) * blockdim.x + threadidx.x no direct neighbors that can help each other, and either left or right overfetch is an unnecessary additional read Pressure Chess Pattern cube size expect 80% of peak performance: 80 GB/s, as benchmark shows! Pressure Normal Pressure Chess 68

69 MTP GB/s The Kink Locality Effect? second experiment: comment out left and right neighbor access results are relatively flat, not jagged 14 SM on M2070, peak at 448 = 14*32? Pressure Chess Pattern = 12*32 some especially bad locality situation? cube size Pressure Normal Pressure Chess Pressure No Left+Right 69

70 Averaged Materials Our weakest link is obviously shear stress kernel Most probably because of material average What if we pre-average and store Mue, Mue_x, Mue_y and Mue_z? Less pressure on cache and faster solver? Interesting. 70

71 Averaged Materials Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s, overall performance won t improve much Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3 extra Mue to read would increase to total 13. What memory throughput would MATCH existing version? s s s t t mtp mtp mtp1 mtp2 s1 mtp GB/s GB/s Maybe possible, but even if, still used much more memory. Alas. 71

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in