926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 415926535897932384626433832795028841971693993754918980 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 592653589793238462643383279502884197169399375491898018 159265358979323846264338327950288419716939937549189801 926535897932384626433832795028841971693993754918980183 159265358979323846264338327950288419716939937549189801 High Performance Linear Algebra on Data Parallel Co-Processors I
2 A bit of history GFLOPS 300 G71 G80 200 100 NV30 NV35 NV40 G70 G70-512 Intel Core2 Duo 3.0 GHz Jan. Jun. Apr. May Nov. Mar. Nov. 2003 2004 2005 2006 Original data Mark Harris (NVIDIA), 2007.
3 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
4 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
5 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
6 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
7 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
8 A bit of history GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target
9 Modern graphics card Host Input Assembler Setup / Raster / ZCull Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM
10 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM
11 Array of multiprocessors Analogue of a multi-core chip with each multiprocessor corresponding to one core. But the cores are vectorized. Task parallel execution. Independent program counters. Between 2 and 16 multiprocessors on one chip.
12 Multiprocessor Data parallel processing unit.
13 Multiprocessor Data parallel processing unit.
14 Multiprocessor Data parallel processing unit. +
15 Multiprocessor Data parallel processing unit. +
16 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor.
17 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32.
18 Multiprocessor
19 Multiprocessor
20 Multiprocessor
21 Multiprocessor
22 Multiprocessor
23 Multiprocessor
24 Multiprocessor Data parallel processing unit. Thread block: set of threads that is logically processed together on a multiprocessor. Thread warp: set of threads that is physically processed together on a multiprocessor. Width of the processor, typically 16 or 32. Thread: processes one data element. Extremely lightweight.
25 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 MEM MEM MEM MEM MEM MEM
26 Data parallel co-processors Host Input Assembler Thread Scheduler SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM L1 L1 L1 L1 L1 L1 L1 L1 L2 Global Memory
27 Memory vs. Execution Task parallel array of multiprocessors Global memory Host and device very high latency
28 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency
29 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads
30 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency
31 Memory vs. Execution Task parallel array of multiprocessors Thread block per multiprocessor Global memory Local Memory (Shared Memory/L1) Host and device very high latency Thread block only low latency Warp of threads Thread Registers Thread only very low latency
32 Thread level parallelism Data
33 Thread level parallelism Data
34 Thread level parallelism Data
35 Thread level parallelism Data
36 Thread level parallelism Data
37 Thread level parallelism Data
38 Thread level parallelism Data
39 Thread level parallelism Data
40 Thread level parallelism Data
41 Thread level parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf
42 Instruction Level Parallelism... c = c * b + a;
43 Instruction Level Parallelism... c = c * b + a; d = d * b + a;
44 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;...
45 Instruction Level Parallelism... c = c * b + a; d = d * b + a; e = e * b + a;... Commands are non-blocking only dependent operations block multiprocessor.
46 Instruction Level Parallelism http://www.cs.berkeley.edu/~volkov/volkov10-gtc.pdf
47 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.
48 Memory latency Volkov, V., and J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. 2008.
49 Ideal program flow 1. Load data form global to shared memory. 2. Process data in shared memory. 3. Write data from shared to global memory.
50 CUDA Compute Unified Device Architecture. Proprietary solution by NVIDIA. But very similar to OpenCL. Low-level software interface based on ANSI C. Explicit parallelization, memory handling,... Latest hardware has full C++ support.
51 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.
52 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.
53 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.
54 CUDA execution model A program is executed as grid of thread blocks. Each thread block and thread has unique {1,2,3}D thread block and thread identifier.
55 CUDA program structure Kernel: function executed on the device that can be called from the host. Host program: Device and memory initialization and execution management. C/C++/Fortran program.
56 CUDA Host Extensions Kernel invocation mykernel<<< grid_dim, block_dim >>>(in, out) Memory management cudamalloc(), cudamemcpy(), cudafree(),... Device management... cudagetdevicecount(), cudasetdevice(),...
57 CUDA Device Extensions Memory declaration shared float smem[1024] Synchronisation (barrier) synchthreads() Atomic operations atomicadd(), atomicexch(), atomicxor(),... Thread identifier threadidx, blockidx, blockdim,...
58 CUDA Program Flow 1. Upload data to process into device memory. 2. Define execution environment. 3. Launch kernel. Read input data into shared memory. Process data. Write result from shared to global memory. 4. Read result back to host memory.
59 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1]
60 CUDA Example Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap
61 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b
62 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b
63 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b
64 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b
65 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap x b
66 CUDA Example a Simple 1D convolution a[x] =b[x 1] + 3.0b[x]+b[x +1] Assume infinite signals and ignore necessary block overlap b block block block block
67 CUDA Example global void conv( float* in, float* out) {
68 CUDA Example global void conv( float* in, float* out) { shared float smem[1024];
69 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads();
70 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1];
71 CUDA Example global void conv( float* in, float* out) { shared float smem[1024]; tid = threadidx.x + blockdim.x * blockidx.x; smem[threadidx.x] = in[tid]; syncthreads(); float res = smem[threadidx.x-1]; res += 3.0 * smem[threadidx.x]; res += smem[threadidx.x+1]; } out[tid] = res;
72 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float);
73 CUDA Example host void run() { // read input data... uint mem_size = signal_size * sizeof(float); float* d_in, d_out; cudamalloc( (void**) &d_in, mem_size); cudamalloc( (void**) &d_out, mem_size); cudamemcpy( d_in, signal, mem_size, cudamemcpyhosttodevice);
74 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1;
75 CUDA Example // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out);
76 CUDA Example } // assume signal_size > 512 dim3 threads, blocks; threads = 512; blocks = (signal_size / 512) + 1; // run kernel conv<<<blocks,threads>>>( d_in, d_out); cudathreadsynchronize(); // copy result back to the host array result cudamemcpy( result, d_out, mem_size, cudamemcpydevicetohost);
77 Final Remarks Data parallel co-processor: task parallel array of data parallel multiprocessor. Efficient programs require to keep hardware architecture in mind. - Execution model and organization of threads. - Memory hierarchy and latencies.