NVIDIA CUDA 超大规模并行. 邓仰东清华大学微电子学研究所

Size: px

Start display at page:

Download "NVIDIA CUDA 超大规模并行. 邓仰东清华大学微电子学研究所"

Dwain Butler
5 years ago
Views:

1 NVIDIA CUDA 超大规模并行程序设计训练课程邓仰东清华大学微电子学研究所

2 日程安排 Day 1: CUDA 概论 Day 2: 编程模型 Day 3: 多线程和存储器硬件 Day 4: 性能提升 2 2

3 提纲主要参考 Dr. Mark Harris 在 Super-Computing Conference 2007 上的讲座基本优化原则 Coalescing memory operations Occupancy and latency hiding Using shared memory 实例 1: 矩阵转置 Coalescing and bank conflict avoidance 实例 2: efficient parallel reductions Using peak performance metrics to guide optimization Avoiding SIMD divergence & bank conflicts Loop unrolling Using template parameters to write general-yet-optimized code Algorithmic strategy: Cost efficiency 3 3

4 CUDA 优化策略针对 GPU 优化算法利用 shared memory 有效并行优化存储器访问的一致性 4 4

5 针对 GPU 优化算法最大化独立并行性最大化算术计算密度 (math/bandwidth) 重复计算往往优于访问存储器 GPU spends its transistors on ALUs, not memory 尽量在 GPU 上计算以避免与 CPU 传递数据即使低并行度运算也往往优于频繁的 CPU-GPU 数据传递 5 5

6 利用 Shared Memory 几百倍快于 global memory 线程之间通过 shared memory 合作使用一个或少量线程装载和计算 thread block 内全部线程共享的数据 Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non-coalesceable addressing 6 6

7 有效并行划分计算使得 GPU 各个 SM 负载均衡 Many threads, many thread blocks 降低资源使用, 以便多个 thread blocks 在 SM 上运行 Registers, shared memory 7 7

8 优化存储器访问的一致性全局存储器延时 : clock cycles 经常成为性能瓶颈 Coalesced vs. non-coalesced 优化效果明显 - 数量级性能差别 Experiment: > Kernel: read a float, increment, write back > 3M floats (12MB), Times averaged over 10K runs 12K blocks x 256 threads: > 356 s coalesced > 357 s coalesced, some threads don t participate > 3,494 s permuted/misaligned thread access 使用 texture memory 优化空间局部行为 Spatial locality 8 8

9 Coalescing 在 half warp 层次对访问 global memory 进行协调访问连续 global memory 区域 : 64 bytes - each thread reads a word: int, float, 128 bytes - each thread reads a double-word: int2, float2, 256 bytes each thread reads a quad-word: int4, float4, 额外限制 : Global memory 区域的起始地址必须是该区域数据类型尺寸的整数倍 Warp 中第 k 个线程访问第 k 个地址例外 : 可以有某些中间线程不参加 Predicated access, divergence within a warp 9 9

10 Coalesced Access: Reading floats 10 10

11 Uncoalesced Access: Reading floats 11 11

12 Uncoalesced float3 Code global void accessfloat3(float3 *d_in, float3 d_ out) { } int index = blockidx.x * blockdim.x + threadidx.x; float3 a = d_in[index]; a.x += 2; a.y += 2; a.z += 2; d_out[index] = a; 12 12

13 Uncoalesced float3 Code float3 需要 12 bytes: float3 f = d_in[threadidx.x]; Each thread ends up executing 3 reads sizeof(float3) 4, 8, or 16 Half-warp reads three 64B non-contiguous regions 13 13

14 Coalescing float3 Access A 3-step approach (256 threads/block) Global Memory Shared memory Shared memory 14 14

15 Coalescing float3 Access Use shared memory to allow coalescing 256 threads per block A thread block needs sizeof(float3)x256 bytes of SMEM Each thread reads 3 scalar floats: Offsets: 0, (threads/block), 2*(threads/block) These will likely l be processed by other threads, so sync Processing Each thread retrieves its float3 from SMEM array Cast the SMEM pointer to (float3*) Use thread ID as index Rest of the compute code does not change! 15 15

16 Coalescing float3 Access 代码 16 16

17 Coalescing: Structures of Size 4, 8, or 16B Use a structure of arrays instead of AoS 这样每个成员变量可以 Coalescing SOA: Struct SOA{ float x[256], y[256], z[256];}; 17 4B Bytes float3 AOS: Struct AOS{ float x, y, z;}; Struct AOS aos[256]; AO S 17

18 Coalescing: Structures of Size 4, 8, or 16B 如果 SoA 不可行 : 强制对齐 : align (X), where X = 4, 8, or 16 struct align (16) { float x, y, z; }; 浪费部分空间 float3 Un-used 使用 shared memory 18 18

19 CPU-GPU 数据传送 Device memory - host memory 带宽远低于 device memory - device 带宽 4GB/s peak (PCI-e x16) vs. 80 GB/s peak (Quadro FX 5600) 最小化 CPU-GPU 数据传送中间数据结构在 GPU 上分配操作释放不必使用 CPU memory 组合 CPU-GPU 数据传送整块数据传送远快于多次分小块传送 19 19

20 Page-Locked Memory Transfers cudamallochost(): page-locked CPU 内存分配 Page-locked memory 不允许操作系统移动使用特殊 DMA 机制存取 GPU 直接使用 PCI-express 访问实现 cudamemcpy() 最高性能 3.2 GB/s+ common on PCI-express x16 ~4 GB/s measured on nforce 680i motherboards (overclocked PCI-e) See the bandwidth test CUDA SDK sample 注意 : 分配太多的 page-locked 会降低系统系能需要系统测试来确定最佳分配 20 20

21 Occupancy 线程内部需要等待 global memory 时, 切换到其它 warp 是隐藏延时和保持 GPU 运转的有效办法 Occupancy = 实际并发运行的 warp 个数 / 最大可能并发运行的 warp 个数最大化 occupancy 通过优化 thread block 内的线程实现 21 21

22 Occupancy!= Performance 提高 occupancy 不一定直接提高性能还依赖于计算密度和其它方面的并行性但是低 occupancy 的 SM 不能有效隐藏延时特别是对 memory-bound kernels 22 22

23 Grid/Block Size Heuristics # of blocks / # of multiprocessors > 1 每个 SM 至少有一个 thread block 可以执行更好的选择 : # of blocks / # of multiprocessors > 2 每个 SM 有多个 thread block 可以执行每个 thread block 占用 SM 一半以下的资源即 shared memory and registers 资源允许多个 thread block 在 SM 上并发运行减小所有线程都等待 syncthreads() 的概率 # of blocks > 100 Scale to future devices 1000 blocks per grid will scale across multiple generations 23 23

24 优化 Thread Block 内的线程个数 Thread block 内的线程个数应该是 warp size 的整数倍避免 under-populated warps 线程越多, 越容易隐藏存储器延时线程越多, 每个线程可用的寄存器越少甚至造成 kernel 由于寄存器不足而无法启动 Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 128 to 256 threads a better choice Usually still enough registers to compile and invoke successfully This all depends on your computation! Experiment! 24 24

25 隐藏存储器延时 Global memory access: cycle latency 线程内依赖于存储器访问的指令将被阻塞补救之道 : 更多的线程! 提高计算密度 Coalescing memory accesses to neighboring addresses 例子 : 4 次顺序访问至少需要 4*400 = 1,600 cycles 4 个并行线程, 每线程 1 次存储器访问, 最少只需要 : = 403 cycles 25 25

26 隐藏存储器延时一个 SM 可以并发处理 768 threads 最大 thread block 尺寸为 512 线程 Configurations with 100% occupancy: 2 blocks x 384 threads 3 blocks x 256 threads 4 blocks x 192 threads 6 blocks x 128 threads 8 blocks x 96 threads 最小存储器延时 : 50% or higher occupancy AND 128 or more threads/block 26 26

27 寄存器 (Register) 解决存储器延时的主要手段 = 增加 SM 上的线程数量限制条件 : Number of registers per kernel 8192 per SM, partitioned among concurrent threads Amount of shared memory 16KB per SM, partitioned among concurrent threadblocks 检查.cubin file for # registers / kernel 调用 NVCC 时使用 maxrregcount=n 开关 N = desired maximum registers / kernel At some point spilling into local l memory may occur Reduces performance local memory is slow Check.cubin file for local memory usage 27 27

28 检查资源使用使用 -cubin flag 编译开关检查.cubin 文件的 code 部分 architecture {sm_10} abiversion {0} modname {cubin} code { name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 per thread local memory per thread block shared memory per thread registers bincode { 0xa x x40024c09 0x

29 参数化代码使用 C++ 模版机制 (template) 有助于适应不同 GPU GPU 的配置变化很大 > # of multiprocessors > Memory bandwidth > Shared memory size > Register file size > Threads per block 甚至自动 self-tuning (like FFTW and ATLAS) 现场实验找到最佳配置 29 29

30 提纲基本优化原则 Coalescing memory operations Occupancy and latency hiding Using shared memory 实例 1: 矩阵转置 Coalescing and bank conflict avoidance 实例 2: efficient parallel reductions Using peak performance metrics to guide optimization Avoiding SIMD divergence & bank conflicts Loop unrolling Using template parameters to write general-yet-optimized code Algorithmic strategy: Cost efficiency 30 30

31 Matrix Transpose SDK Sample ( transpose ) 解释通过 shared memory 实现 coalescing 在小尺度数据即可显示优化的明显效果 31 31

32 Uncoalesced Transpose global void transpose_naive(float *odata, float *idata, int width, int height) { unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; odata[index_out] = idata[index_in]; } tx, ty } Height Width 32 Width Height 32

33 Uncoalesced Transpose 33 33

34 Coalesced Transpose 假设 : 矩阵已被分解为方块 (tile) Thread block (bx, by): Read the (bx, by) input tile, store into SMEM Write the SMEM data to (by, bx) output tile Thread (tx, ty): Reads element (tx, ty) from input tile Writes element (tx, ty) into output tile Coalescing is achieved if: Block/tile dimensions are multiples of

35 Coalesced Transpose 35 35

36 Coalesced Transpose global void transpose(float *odata, float *idata, int width, int height) { } shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int it id index_out, id index_transpose; if (xindex < width && yindex < height) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; x; block[index_block] = idata[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } syncthreads(); if (xindex < width && yindex < height) odata[index_out] = block[index_transpose]; 36 36

37 Shared Memory 优化 Threads read SMEM with stride = 16 Bank conflicts Solution Allocate an extra column Read stride = 17 Threads read from consecutive banks 37 37

38 38 Coalesced Transpose with Shared Memory Optimization global void transpose_exp(float *odata, float *idata, int width, int height){ shared float block[block_dim][block_dim+1]; unsigned int xindex = blockidx.x * BLOCK_DIM + threadidx.x; unsigned int yindex = blockidx.y * BLOCK_DIM + threadidx.y; if( (xindex < width)&&(yindex < height) { unsigned int index_in = xindex + yindex * width; block[threadidx.y][threadidx.x] = idata[index_in]; } syncthreads(); xindex = blockidx.y * BLOCK_DIM + threadidx.x; yindex = blockidx.x * BLOCK_DIM + threadidx.y; if( (xindex < height)&&(yindex < width) ){ unsigned int index_out = yindex * height + xindex; odata[index_out] = block[threadidx.x][threadidx.y]; } } 38

39 Transpose Timings Speedups with coalescing 128x128: 0.011ms 011ms vs ms 022ms (2.0X speedup) 512x512: 0.07ms vs. 0.33ms (4.5X speedup) 1024x1024: 0.30ms vs. 1.92ms (6.4X speedup) 1024x2048: 0.79ms vs. 6.6ms (8.4X speedup) ( 注意 : 优化 shared memory bank conflicts 带来 ~10% 的 speedup) 39 39

40 提纲基本优化原则 Coalescing memory operations Occupancy and latency hiding Using shared memory 实例 1: 矩阵转置 Coalescing and bank conflict avoidance 实例 2: efficient parallel reductions Using peak performance metrics to guide optimization Avoiding SIMD divergence & bank conflicts Loop unrolling Using template parameters to write general-yet-optimized code Algorithmic strategy: Cost efficiency 40 40

41 Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as a great optimization example Step by step through 7 different versions Demonstrates several important optimization strategies =? 41 41

42 Parallel Reduction Tree-based approach used within each thread block 需要使用多个 thread blocks 超大数组处理每个 thread block 负责一部分怎样在 thread blocks 间通信? 42 42

43 问题 : 全局同步如果能够对所有 thread blocks 进行同步, parallel reduction 很容易实现 : 每个 thread block 计算结束后全局同步所有线程到达同步点后, 继续以递归方式计算 syncthreads() syncthreads() syncthreads() 43 43

44 问题 : 全局同步但是 CUDA 不支持全局同步 : 内核个数越多, 同步硬件成本越高避免 dead-lock 解决办法 : 分割为多个内核启动 kernel 作为全局同步点 Kernel 启动的硬件成本可以忽略, 软件成本极低 44 44

45 Parallel Reduction: Decomposition 实现全局同步分割为多个内核对于 parallel reduction, 额外的好处是每一层的代码可以完全相同递归式激活 kernel 45 45

46 46 优化目标向 GPU 峰值性能迈进! 选择正确的标准 : GFLOP/s: for compute-bound kernels 例如 : QR factorization, convolution, FIR filter, Bandwidth: for memory-bound kernels 例如 : Database, video playback, Both 例如 : Pattern matching, singular value decomposition, Reduction 的计算密度很低每一数组元素只需要一次浮点运算因此需要争取峰值带宽! G80 GPU 384-bit memory interface, 900 MHz DDR 384 * 900 * 2 / 8 = 86.4 GB/s 46

47 Parallel Reduction: Interleaved Addressing 47 47

48 Reduction #1: Interleaved Addressing global l void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; syncthreads(); } 48 } // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; 48

49 Parallel Reduction: Interleaved Addressing syn c syn c syn c syn c 49 49

50 Reduction #1: Interleaved Addressing global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= 2) { Problem: highly divergent branching results in very poor performance! } if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; 50 50

51 Performance for 4M Element Reduction 51 51

52 52 Reduction #2: Interleaved Addressing 修改内循环中的分支语句 : 为 : for (unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } syncthreads(); } for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; } syncthreads(); } 52

53 Reduction #2: Interleaved Addressing 53 for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; } syncthreads(); } 53

54 Performance for 4M Element Reduction 54 54

55 Reduction #2: Interleaved Addressing Bank conflict! s=2 Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { Thread 6 Bank 6 Thread 7 Bank 7 sdata[index] += sdata[index + s]; Thread 8 Bank 8 } Thread 9 Bank 9 Thread 10 syncthreads(); Thread 11 } Thread 12 Thread 14 Thread 14 Thread 15 Bank 10 Bank 11 Bank 12 Bank 13 Bank 14 Bank k

56 Parallel Reduction: Sequential Addressing Bank conflict free 56 56

57 57 Reduction #3: Sequential Addressing Just replace strided indexing in inner loop: for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; } syncthreads(); } With reversed loop and threadid-based indexing: for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } 57

58 Performance for 4M Element Reduction 58 58

59 观察 : 空闲线程问题 : for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } 第一次循环有一半线程处于空闲状态! 浪费一半资源 59 59

60 60 Reduction #4: First Add During Load 减少 block 线程个数为原来一半, 替换 load 语句 : // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; x; sdata[tid] = g_idata[i]; syncthreads(); 变为两个 load 语句并完成第一次加法 : // perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*2) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); 60

61 Reduction #4: First Add During Load Reduction #3 Global Memory t 0 t 1 t blockdim-1 Reduction #4 t 0 t 1 t blockdim-1 Global Memory 61 61

62 Performance for 4M Element Reduction 62 62

63 性能瓶颈 17 GB/s 仍然远离峰值带宽 80 GB/s 已知 reduction 操作计算密度很低因此辅助指令开销可能造成性能瓶颈辅助指令即 loads, stores, 和 arithmetic ti 之外的指令策略地址算术和循环开销循环展开 63 63

64 展开最后一次循环随着 reduction 进展, 活动线程个数逐渐减少当线程数少于 32 时, 只剩下第一个 warp 其余 warp 已经完成计算 Warp 内指令是自动同步的 Why? Scoreboarding automatically maintain synchronization 当线程数少于 32 时 : 不需要 syncthreads() 不需要 if (tid < s) 展开内循环中的最后 6 次 64 64

65 Reduction #5: Unroll the Last Warp for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } 65 for (unsigned int s=blockdim.x/2; s>32; s>>=1){ if (tid < s) sdata[tid] += sdata[tid + s]; syncthreads(); } if (tid < 32){ sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid 65 +1];

66 Reduction #5: Unroll the Last Warp for (unsigned int s=blockdim.x/2; s>32; s>>=1){ if (tid < s) //To save work! sdata[tid] += sdata[tid + s]; syncthreads(); } if (tid < 32){ sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid +4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; } 66 66

67 Performance for 4M Element Reduction 67 67

68 Completely Unrolled 如果编译时就知道迭代次数, 就可以全面展开循环幸运的是 GPU 的 thread block 内至多有 512 个线程而且我们一般在 thread block 内使用 2 的幂次个数的线程所以对任意确定线程个数的 thread block 都可以做循环展开怎样保持代码通用 (generic)? 编译时不知道 thread block 内的线程个数, 怎样循环展开? 即 kernel 函数以不同配置方式被启动摸版 (template)! CUDA 支持 C++ template 参数 Device 和 host 函数 68 68

69 Unrolling with Templates Specify block size as a function template parameter: template <unsigned int blocksize> global void reduce5(int *g_idata, int *g_odata) 69 69

70 70 Reduction #6: Completely Unrolled if (blocksize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } syncthreads(); } if (blocksize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } syncthreads(); } if (blocksize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } syncthreads(); } if (tid < 32) { if (blocksize >= 64) sdata[tid] += sdata[tid + 32]; if (blocksize >= 32) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + 2]; if (blocksize >= 2) sdata[tid] += sdata[tid + 1]; } All code in RED will be evaluated at compile time. 70

71 调用模版 Kernel 71 还需要在编译时指定 block size 吗? 不必! 只需要处理 10 种可能 block size 的 switch 语句 switch (threads){ case 512: reduce5<512><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; } case 256: reduce5<256><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 128: reduce5<128><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 64: reduce5< 64><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 32: reduce5< 32><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 16: reduce5< 16><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 8: reduce5< 8><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 4: reduce5< 4><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 2: reduce5< 2><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 1: reduce5< 1><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; 71

72 Performance for 4M Element Reduction 72 72

73 Brent 定理令 W(N) 是某并行算法 A 在运行时间 T(N) 内所执行的运算量, 则 A 使用 p 台处理器可在 t(n)=o(w(n)/p(n)+t(n)) 时间内执行完毕并行算法的几个复杂性度量指标运行时间 t(n): 包含计算时间和通讯时间处理器数 P(N) 并行算法成本 c(n): c(n)=t(n)p(n) 总运算量 W(N): 并行算法求解问题时所完成的总的操作步数 73 73

74 Parallel Reduction Complexity 处理器数 P(N) P 个线程 parallel executing (P processors) 总运算量 W(N): 并行算法求解问题时所完成的总的操作步数在 log(n) 个并行步骤上完成, 步骤 S 需要 N/2S 次独立操作 T(N) = logn 如果 N=2D, 共需要 W(N)=Σ S [1..D] 2 D-S = N-1 次操作整体复杂度 O(N) work-efficient > i.e. 不需要比顺序执行更多的工作运行时间 t(n): 包含计算时间和通讯时间 t(n) = O(W(N)/P+T(N)) = O(N/P+logN) CUDA 模型考虑线程,O(N) 线程, 一个 thread block 内, N=P t(n) = O(logN) 并行算法成本 c(n): c(n) = t(n)p(n) = NlogN 74 74

75 算法成本并行算法的成本定义为 #processors x time complexity 并行算法成本 c(n): c(n) = t(n)p(n) = NlogN O(NlogN): Not cost efficient! Brent s theorem 建议 O(N/logN) threads 每一线程负责 O(logN) 顺序工作线程数量减少为 1/logN 每一线程负责 logn more 工作则所有 O(N/logN) threads 线程合作 O(logN) 个步骤成本为 O((N/logN) * logn) = O(N) 也称为 algorithm cascading 实际工作中经常可以带来显著性能提升 75 75

76 Algorithm Cascading 76 结合 sequential and parallel reduction 每一线程装载并加和若干元素于 shared memory Tree-based reduction in shared memory Brent s theorem 指出一个线程应该加和 O(logN) 个元素即 1024 或 2048 元素每 block, 而不是 256 根据实际经验, 可以更多可能线程内工作越多, 有助于隐藏存储器延时 Thread block 内更多线程可以减少递归调用内核的次数 > 递归最后阶段, thread block 数量较小, 则启动内核的相对开销更大 G80 平台上最高性能在 blocks of 128 threads 即元素每 block 8-32 元素每线程 76

77 77 Reduction #7: Multiple Adds / Thread 把装载和加和两个元素 : unsigned int tid = threadidx.x; x; unsigned int i = blockidx.x*(blockdim.x*2) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); 替换为一个 while 循环做必要次数之加法 unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*2) + threadidx.x; unsigned int gridsize = blocksize*2*griddim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; } syncthreads(); 77

78 Reduction #7: Multiple Adds / Thread 每次循环 global memory 地址增量 : gridsize gridsize 应该是 16 的整数倍以 gridsize 增量后下一个 global memory 满足对齐条件保证 coalescing! unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*2) + threadidx.x; unsigned int gridsize = blocksize*2*griddim griddim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; } syncthreads(); 78 78

79 Performance for 4M Element Reduction 79 79

80 template <unsigned int blocksize> global void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern shared int sdata[]; unsigned int tid = threadidx.x; x; unsigned int i = blockidx.x*(blocksize*2) + tid; unsigned int gridsize = blocksize*2*griddim.x; sdata[tid] = 0; do { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; } while (i < n); syncthreads(); if (blocksize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } syncthreads(); } if (blocksize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } syncthreads(); } if (blocksize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } syncthreads(); } if (tid < 32) { if (blocksize >= 64) sdata[tid] += sdata[tid + 32]; if (blocksize >= 32) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; } if (blocksize >= 4) sdata[tid] += sdata[tid + 2]; if (blocksize >= 2) sdata[tid] += sdata[tid + 1]; } if (tid == 0) g_odata[blockidx.x] = sdata[0]; 最终优化代码 80 80

81 Performance Comparison 81 81

82 Parallel Reduction 总结算法优化 Changes to addressing, algorithm cascading 11.84x speedup, combined! 代码优化循环展开 2.54x speedup, combined 82 82

83 优化技术总结影响 CUDA 性能的主要方面 Memory coalescing Divergent branching Bank conflicts Latency hiding 用峰值性能导引优化理解并行算法复杂度理论了解怎样发现性能瓶颈例如 memory, core computation, or instruction overhead 优化算法, 然后展开循环使用模版参数优化代码 83 83

84 Final Notes People who are really serious about software should make their own hardware. Alan Kay 84 84

85 Final Notes 知者不惑, 仁者不忧, 勇者不惧 - 孔子 85 85

Understanding IO patterns of SSDs

Understanding IO patterns of SSDs 固态硬盘 I/O 特性测试周大众所周知, 固态硬盘是一种由闪存作为存储介质的数据库存储设备由于闪存和磁盘之间物理特性的巨大差异, 现有的各种软件系统无法直接使用闪存芯片为了提供对现有软件系统的支持, 往往在闪存之上添加一个闪存转换层来实现此目的固态硬盘就是在闪存上附加了闪存转换层从而提供和磁盘相同的访问接口的存储设备一方面, 闪存本身具有独特的访问特性另外一方面, 闪存转换层内置大量的算法来实现闪存和磁盘访问接口之间的转换

NVIDIA CUDA 超大规模并行. 邓仰东 清华大学微电子学研究所

NVIDIA CUDA 超大规模并行. 邓仰东清华大学微电子学研究所