High Performance Linear Algebra on Data Parallel Co-Processors I

926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 415926535897932384626433832795028841971693993754918980 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 159265358979323846264338327950288419716939937549189801 High Performance Linear Algebra on Data Parallel Co-Processors I

2 A bit of history GFLOPS 300 G71 G80 200 100 NV30 NV35 NV40 G70 G70-512 Intel Core2 Duo 3.0 GHz Jan. Jun. Apr. May Nov. Mar. Nov. 2003 2004 2005 2006 Original data Mark Harris (NVIDIA), 2007.

3 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

4 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

5 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

6 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

7 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

8 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

9 Modern graphics card Host Input Assembler Setup / Raster / ZCull Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

10 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

11 Array of multiprocessors Analogue of a multi-core chip with each multiprocessor corresponding to one core. But the cores are vectorized. Task parallel execution. Independent program counters. Between 2 and 16 multiprocessors on one chip.

12 Multiprocessor Data parallel processing unit.

13 Multiprocessor Data parallel processing unit.

14 Multiprocessor Data parallel processing unit. +

15 Multiprocessor Data parallel processing unit. +

16 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor.

17 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32.

18 Multiprocessor

19 Multiprocessor

20 Multiprocessor

21 Multiprocessor

22 Multiprocessor

23 Multiprocessor

24 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32. Thread: processes one data element. Extremely lightweight.

25 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

26 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 Global Memory

27 Memory vs. Execution Task parallel array of multiprocessors Global memory Host and device very high latency

28 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency

29 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads

30 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

31 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

32 Thread level parallelism Data

33 Thread level parallelism Data

34 Thread level parallelism Data

35 Thread level parallelism Data

36 Thread level parallelism Data

37 Thread level parallelism Data

38 Thread level parallelism Data

39 Thread level parallelism Data

40 Thread level parallelism Data

41 Thread level parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf

42 Instruction Level Parallelism... c = c * b + a;

43 Instruction Level Parallelism... c = c * b + a; d = d * b + a;

44 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;...

45 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;... Commands are non-blocking only dependent operations block multiprocessor.

46 Instruction Level Parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf

47 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.

48 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.

49 Ideal program flow 1. Load data form global to shared memory. 2. Process data in shared memory. 3. Write data from shared to global memory.

50 CUDA Compute Unified Device Architecture. Proprietary solution by NVIDIA. But very similar to OpenCL. Low-level software interface based on ANSI C. Explicit parallelization, memory handling,... Latest hardware has full C++ support.

51 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

52 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

53 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

54 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

55 CUDA program structure Kernel: function executed on the device that can be called from the host. Host program: Device and memory initialization and execution management. C/C++/Fortran program.

56 CUDA Host Extensions Kernel invocation mykernel<<< grid_dim, block_dim >>>(in, out) Memory management cudamalloc(), cudamemcpy(), cudafree(),... Device management... cudagetdevicecount(), cudasetdevice(),...

57 CUDA Device Extensions Memory declaration shared float smem[1024] Synchronisation (barrier) synchthreads() Atomic operations atomicadd(), atomicexch(), atomicxor(),... Thread identifier threadidx, blockidx, blockdim,...

58 CUDA Program Flow 1. Upload data to process into device memory. 2. Define execution environment. 3. Launch kernel. Read input data into shared memory. Process data. Write result from shared to global memory. 4. Read result back to host memory.

59 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1]

60 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap

61 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b

62 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

63 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

64 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

65 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

66 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b block block block block

67 CUDA Example global void conv( float* in, float* out) {

68 CUDA Example global void conv( float* in, float* out) { shared float smem[1024];

69 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads();

70 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1];

71 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1]; } out[tid] = res;

72 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float);

73 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float); float* d_in, d_out; cudamalloc( (void**) &d_in, mem_size); cudamalloc( (void**) &d_out, mem_size); cudamemcpy( d_in, signal, mem_size, cudamemcpyhosttodevice);

74 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1;

75 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out);

76 CUDA Example } // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out); cudathreadsynchronize(); // copy result back to the host array result cudamemcpy( result, d_out, mem_size, cudamemcpydevicetohost);

77 Final Remarks Data parallel co-processor: task parallel array of data parallel multiprocessor. Efficient programs require to keep hardware architecture in mind. - Execution model and organization of threads. - Memory hierarchy and latencies.