Introduction to GPGPU and GPU-architectures
|
|
- Dominic Simon
- 6 years ago
- Views:
Transcription
1 Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak
2 Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks / Department of Electrical Engineering - ES group 28 June
3 1. What is a GPU? It s a bird It s a plane It s a GPU! / Department of Electrical Engineering - ES group 28 3 June 2012
4 Graphics Processing Unit ref: "How GPUs Work", / Department of Electrical Engineering - ES group 28 June
5 Trends in chip design Frequency wall Energy wall ILP wall / Department of Electrical Engineering - ES group 28 June
6 Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Cost effective Re-use: just connect existing processors or processor cores / Department of Electrical Engineering - ES group 28 June
7 How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Intel Penryn ( 2) half area spend on processing large on-chip caches Die photo of Intel Penryn (source: Intel) / Department of Electrical Engineering - ES group 28 June
8 How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. 15x NVidia Kepler GPU most area spend on processing relatively small on-chip memories huge off-chip memory latencies / Department of Electrical Engineering - ES group 28 June
9 Why GPU? according to NVidia Jen-Hsun Huang, CEO Ref: NVidia GPUs give a 1000x performance boost! / Department of Electrical Engineering - ES group 28 June
10 Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs. NVidia GTX280 Paul Otellini, CEO Intel Ref: Debunking the 100X GPU vs. CPU Myth, ISCA / Department of Electrical Engineering - ES group 28 June
11 Why GPU performance boost NVidia GPUs give a 1000x performance boost! Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Jen-Hsun Huang, CEO Paul Otellini, CEO Intel / Department of Electrical Engineering - ES group 28 June
12 2. How to program a GPU? / Department of Electrical Engineering - ES group June 2012
13 NVIDIA GPU architecture / Department of Electrical Engineering - ES group 28 June
14 NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file 8800GT (2006) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June
15 GPU programming Sequential Data parallel Thread parallel Programming model C Vector CUDA Architecture CPU SIMD SIMT Let s start from C and CPU / Department of Electrical Engineering - ES group 28 June
16 Let's Start with C and CPU int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC / Department of Electrical Engineering - ES group 28 June
17 Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE. / Department of Electrical Engineering - ES group 28 June
18 Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } / Department of Electrical Engineering - ES group Assembly code of inner-loop Looks like the previous example, but SSE instructions execute on 4 ALUs. lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) 28 June
19 How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } / Department of Electrical Engineering - ES group 28 June
20 Programmer's View of GPUs A GPU contains multiple SIMD Units. / Department of Electrical Engineering - ES group 28 June
21 Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. / Department of Electrical Engineering - ES group 28 June
22 What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces / Department of Electrical Engineering - ES group 28 June
23 Let's Start Again from C int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } convert into CUDA int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June
24 3. How does a GPU schedule all these threads? / Department of Electrical Engineering - ES group June 2012
25 Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads / Department of Electrical Engineering - ES group 28 June
26 What Is the Thread Hierarchy? thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June
27 How Are Threads Scheduled? / Department of Electrical Engineering - ES group 28 June
28 Executing many, many threads int A[2][4]; A[256][128]; kernelf<<<(2,1),(4,1)>>>(a); kernelf<<<(256,1),(128,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } 28 June
29 Warps as a group of threads A warp is a group of 32 threads from the same threadblock Warps are generated dynamically by the hardware kernelf<<<(256,1),(128,1)>>>(a); 1 Grid 256 thread blocks 128 threads per block 4 warps / Department of Electrical Engineering - ES group 28 June
30 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 0 Warp: 0 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
31 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 2 Warp: 1 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 1 (threads 32-63) Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
32 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 4 Warp: 2 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Instruction 0 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
33 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 6 Warp: 3 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) Instruction 0 / Department of Electrical Engineering - ES group 28 June
34 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 8 Warp: 0 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 1 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
35 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 10 Warp: 1 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Instruction 1 Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
36 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 12 Warp: 2 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Instruction 1 Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June
37 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 14 Warp: 3 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Instruction 1 Warp 3, Instruction 1 / Department of Electrical Engineering - ES group 28 June
38 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 16 Warp: 0 Instruction: 2 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 2 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Warp 3, Instruction 1 Warp 0, Instruction 2 And so on... / Department of Electrical Engineering - ES group 28 June
39 Scheduling warps from multiple thread blocks Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file TB0 - Warp 0, Instruction 0 TB0 - Warp 1, Instruction 0 TB0 - Warp 2, Instruction 0 TB0 - Warp 3, Instruction 0 TB1 - Warp 0, Instruction 0 TB1 - Warp 1, Instruction 0 TB1 - Warp 2, Instruction 0 TB1 - Warp 3, Instruction 0 TB0 - Warp 0 (threads 0-31) TB0 - Warp 0, Instruction 1 TB0 - Warp 1, Instruction 1 TB0 - Warp 1 (threads 32-63) TB0 - Warp 2, Instruction 1 TB0 - Warp 3, Instruction 1 TB1 - Warp 0, Instruction 1 TB0 - Warp 2 (threads 64-95) TB1 - Warp 1, Instruction 1 TB1 - Warp 2, Instruction 1 TB0 - Warp 3 (threads ) TB1 - Warp 3, Instruction 1 TB1 - Warp 0 (threads 0-31) And so on... TB1 - Warp 1 (threads 32-63) TB1 - Warp 2 (threads 64-95) / Department TB1 - of Warp Electrical 3 (threads Engineering ) - ES group 28 June
40 Inside the warp scheduler i-cache Scheduler Register file Ref: Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS GPGPU-Sim / Department of Electrical Engineering - ES group 28 June
41 4. Performance bottlenecks How do I get the maximum out of my GPU? / Department of Electrical Engineering - ES group June 2012
42 Thread divergence Example, per cluster: 128 threads (4 warps) 16 processing cores i-cache Scheduler Register file Warp 0, Instruction j%2==1 Warp 1, Instruction j%2==1 Warp 2, Instruction j%2==1 Warp 3, Instruction j%2==1 device kernelf(a){ i = blockidx.x; j = threadidx.x; if(j%2 == 1) A[i][j]++; else A[i][j]--; } Warp 0, Instruction A[i][j]++ Warp 1, Instruction A[i][j]++ Warp 2, Instruction A[i][j]++ Warp 3, Instruction A[i][j]++ Warp 0, Instruction A[i][j]-- Warp 1, Instruction A[i][j]-- Warp 2, Instruction A[i][j]-- Warp 3, Instruction A[i][j]-- / Department of Electrical Engineering - ES group 28 June
43 GPU memory hierarchy Register Per thread i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Shared memory Per thread block Global memory Per kernel SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem / Department of Electrical Engineering - ES group 28 June
44 On-chip shared / local memory Th.0 Bank 0 Th.0 Bank 0 Th.0 Bank 0 Th.1 Bank 1 Th.1 Bank 1 Th.1 Bank 1 Th.2 Bank 2 Th.2 Bank 2 Th.2 Bank 2 Th.3 Bank 3 Th.3 Bank 3 Th.3 Bank 3 Th.4 Bank 4 Th.4 Bank 4 Th.4 Bank 4 Th.5 Bank 5 Th.5 Bank 5 Th.5 Bank 5 Th.6 Bank 6 Th.6 Bank 6 Th.6 Bank 6 Th.7 Bank 7 Th.7 Bank 7 Th.7 Bank 7 No bank conflict Bank conflict No bank conflict / Department of Electrical Engineering - ES group 28 June
45 Off-chip global memory Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced / Department of Electrical Engineering - ES group 28 June
46 CPU GPU memory PCI-e time CPU GPU GPU compute GPU CPU frame CPU GPU GPU compute CPU GPU GPU CPU GPU compute GPU CPU / Department of Electrical Engineering - ES group 28 June
47 Example: Matrix Multiplication global void matrixmul_naive(float* C, float* A, float* B, int wa, int wb) { // Accumulate row i of A and column j of B int i = blockidx.y * blockdim.y + threadidx.y; int j = blockidx.x * blockdim.x + threadidx.x; float accu=0; for(int k=0; k<wa; k++) { accu = accu + A[ i * wa + k ] * B[ k * wb + j ]; } } // Write the block sub-matrix to device memory; element C[ i * wb + j ] = accu; / Department of Electrical Engineering - ES group 28 June
48 Example: Matrix Multiplication global void matrixmul_prefetch( float* C, float* A, float* B, int wa, int wb) { // Block index int bx = blockidx.x; int by = blockidx.y; // Thread index int tx = threadidx.x; int ty = threadidx.y; // Declaration of the shared memory array As used to // store the sub-matrix of A shared float As[BLOCK_SIZE * BLOCK_SIZE]; shared float As2[BLOCK_SIZE * BLOCK_SIZE]; float *prefetch = As; float *prefetch2 = As2; // Declaration of the shared memory array Bs used to // store the sub-matrix of B // shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; float cv[block_size] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // Index of the first sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bbegin = BLOCK_SIZE * VECTOR_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bstep = BLOCK_SIZE * wb; int cbegin = wb * BLOCK_SIZE * by + VECTOR_SIZE * BLOCK_SIZE * bx; // Csub is used to store the element of the block sub-matrix // that is computed by the thread // float Csub = 0; float *Ap = &A[aBegin + wa * ty +tx]; float *ap = &prefetch[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap[i] = Ap[wA * i]; } syncthreads(); // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix Ap = &A[a + astep + wa * ty +tx]; float *ap2 = &prefetch2[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap2[i] = Ap[wA * i]; } ap = &prefetch[0]; float *bp = &B[b + BLOCK_SIZE * ty + tx]; #pragma unroll for(int i = 0; i < BLOCK_SIZE; i++){ float bv = bp[0]; cv[0] += ap[0] * bv; cv[1] += ap[1] * bv; } } cv[2] += ap[2] * bv; cv[3] += ap[3] * bv; cv[4] += ap[4] * bv; cv[5] += ap[5] * bv; cv[6] += ap[6] * bv; cv[7] += ap[7] * bv; cv[8] += ap[8] * bv; cv[9] += ap[9] * bv; cv[10] += ap[10] * bv; cv[11] += ap[11] * bv; cv[12] += ap[12] * bv; cv[13] += ap[13] * bv; cv[14] += ap[14] * bv; cv[15] += ap[15] * bv; ap += BLOCK_SIZE; bp += wb; // Synchronize to make sure the matrices are loaded syncthreads(); // swap As and As2 float *prefetch_temp = prefetch; prefetch = prefetch2; prefetch2 = prefetch_temp; // Write the block sub-matrix to device memory; // each thread writes one element float *Cp = &C[cBegin]; Cp += BLOCK_SIZE * ty + tx; int cstep = wb; #pragma unroll for(int i=0; i<block_size; i++){ Cp[0] = cv[i]; Cp += cstep; } } / Department of Electrical Engineering - ES group 28 June
49 Example: Matrix Multiplication / Department of Electrical Engineering - ES group 28 June
50 Present: AMD vs. NVidia Graphics core next HD 7970 Kepler GTX 680 s 925 MHz GHz Max. performance (sp) 3.9 Tflop/s 3.1 Tflop/s Max. performance (dp) 947 Gflop/s 129 Gflop/s Bandwidth 264 GB/s 192 GB/s Gflops/W Prog. language OpenCL CUDA & OpenCL / Department of Electrical Engineering - ES group 28 June
51 Future GPU and CPU on one chip AMD Fusion Intel i5 2500K / Department of Electrical Engineering - ES group 28 June
52 ARM & GPU Present Carma (NVidia) Tegra 3 (ARM) CUDA GPU Future Embedded (ARM) CPU Embedded GPU ARM Mali PowerVR / Department of Electrical Engineering - ES group 28 June
53 Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Current GPUs on the market Gflops / W SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June
54 Questions / Department of Electrical Engineering - ES group 28 June
Device Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationIntroduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015
Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGPU Computing with CUDA
GPU Computing with CUDA Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Milano
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Memory Layout in CUDA Execution Scheduling in CUDA February 15, 2011 Dan Negrut, 2011 ME964 UW-Madison They have computers, and they may have
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationA Detailed GPU Cache Model Based on Reuse Distance Theory
A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationGPU COMPUTING. Ana Lucia Varbanescu (UvA)
GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationCS : Many-core Computing with CUDA
CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationGPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU
More informationCE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture
CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationIntroduction to CUDA
Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationGPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies
GPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies Dr. Frank Mueller Nishanth Balasubramanian North Carolina State University Infrastructure: ARC cluster at NCSU l Hardware 2x AMD
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationGPU Computing with CUDA
GPU Computing with CUDA Dan Negrut Andrew Seidl Daniel Melanz Simulation-Based Engineering Lab Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Bucuresti
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationPPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR
PPAR: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 This lecture: CUDA programming We have seen some GPU architecture
More informationProgramming Parallel Computers
ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationCS521 CSE IITG 11/23/2012
Hide vector width using scalar threads. 1 2 Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs) 3 TPC TPC TPC TPC Texture Cluster (TPC) Texture Unit Streaming Array SM SM SM TPC TPC
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationProgramming Parallel Computers
ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2015 users.ics.aalto.fi/suomela/ppc-2015/ Introduction Modern computers have high-performance
More information2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.
Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationMemory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model
Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model Sunpyo Hong Hyesoon Kim ECE School of Computer Science Georgia Institute of Technology {shong9, hyesoon}@cc.gatech.edu
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More information