High Performance Linear Algebra on Data Parallel Co-Processors I

Size: px

Start display at page:

Download "High Performance Linear Algebra on Data Parallel Co-Processors I"

Deirdre Ellis
6 years ago
Views:

1 High Performance Linear Algebra on Data Parallel Co-Processors I

2 2 A bit of history GFLOPS 300 G71 G NV30 NV35 NV40 G70 G Intel Core2 Duo 3.0 GHz Jan. Jun. Apr. May Nov. Mar. Nov Original data Mark Harris (NVIDIA), 2007.

3 3 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

4 4 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

5 5 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

6 6 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

7 7 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

8 8 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target

9 9 Modern graphics card Host Input Assembler Setup / Raster / ZCull Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

10 10 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

11 11 Array of multiprocessors Analogue of a multi-core chip with each multiprocessor corresponding to one core. But the cores are vectorized. Task parallel execution. Independent program counters. Between 2 and 16 multiprocessors on one chip.

12 12 Multiprocessor Data parallel processing unit.

13 13 Multiprocessor Data parallel processing unit.

14 14 Multiprocessor Data parallel processing unit. +

15 15 Multiprocessor Data parallel processing unit. +

16 16 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor.

17 17 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32.

18 18 Multiprocessor

19 19 Multiprocessor

20 20 Multiprocessor

21 21 Multiprocessor

22 22 Multiprocessor

23 23 Multiprocessor

24 24 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32. Thread: processes one data element. Extremely lightweight.

25 25 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM

26 26 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 Global Memory

27 27 Memory vs. Execution Task parallel array of multiprocessors Global memory Host and device very high latency

28 28 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency

29 29 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads

30 30 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

31 31 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency

32 32 Thread level parallelism Data

33 33 Thread level parallelism Data

34 34 Thread level parallelism Data

35 35 Thread level parallelism Data

36 36 Thread level parallelism Data

37 37 Thread level parallelism Data

38 38 Thread level parallelism Data

39 39 Thread level parallelism Data

40 40 Thread level parallelism Data

41 41 Thread level parallelism

42 42 Instruction Level Parallelism... c = c * b + a;

43 43 Instruction Level Parallelism... c = c * b + a; d = d * b + a;

44 44 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;...

45 45 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;... Commands are non-blocking only dependent operations block multiprocessor.

46 46 Instruction Level Parallelism

47 47 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra

48 48 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra

49 49 Ideal program flow 1. Load data form global to shared memory. 2. Process data in shared memory. 3. Write data from shared to global memory.

50 50 CUDA Compute Unified Device Architecture. Proprietary solution by NVIDIA. But very similar to OpenCL. Low-level software interface based on ANSI C. Explicit parallelization, memory handling,... Latest hardware has full C++ support.

51 51 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

52 52 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

53 53 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

54 54 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.

55 55 CUDA program structure Kernel: function executed on the device that can be called from the host. Host program: Device and memory initialization and execution management. C/C++/Fortran program.

56 56 CUDA Host Extensions Kernel invocation mykernel<<< grid_dim, block_dim >>>(in, out) Memory management cudamalloc(), cudamemcpy(), cudafree(),... Device management... cudagetdevicecount(), cudasetdevice(),...

57 57 CUDA Device Extensions Memory declaration shared float smem[1024] Synchronisation (barrier) synchthreads() Atomic operations atomicadd(), atomicexch(), atomicxor(),... Thread identifier threadidx, blockidx, blockdim,...

58 58 CUDA Program Flow 1. Upload data to process into device memory. 2. Define execution environment. 3. Launch kernel. Read input data into shared memory. Process data. Write result from shared to global memory. 4. Read result back to host memory.

59 59 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1]

60 60 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap

61 61 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b

62 62 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

63 63 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

64 64 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

65 65 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b

66 66 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b block block block block

67 67 CUDA Example global void conv( float* in, float* out) {

68 68 CUDA Example global void conv( float* in, float* out) { shared float smem[1024];

69 69 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads();

70 70 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1];

71 71 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1]; } out[tid] = res;

72 72 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float);

73 73 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float); float* d_in, d_out; cudamalloc( (void**) &d_in, mem_size); cudamalloc( (void**) &d_out, mem_size); cudamemcpy( d_in, signal, mem_size, cudamemcpyhosttodevice);

74 74 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1;

75 75 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out);

76 76 CUDA Example } // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out); cudathreadsynchronize(); // copy result back to the host array result cudamemcpy( result, d_out, mem_size, cudamemcpydevicetohost);

77 77 Final Remarks Data parallel co-processor: task parallel array of data parallel multiprocessor. Efficient programs require to keep hardware architecture in mind. - Execution model and organization of threads. - Memory hierarchy and latencies.

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment