Introduction to GPGPUs
|
|
- Shon Bryan
- 5 years ago
- Views:
Transcription
1 Introduction to GPGPUs Sandra Wienke, M.Sc. PPCES 2012 Rechen- und Kommunikationszentrum (RZ)
2 Links General GPGPU Community: GPU Computing Community: CUDA Nvidia CUDA Zone (Toolkit, Profiler, SDK, documentation, ): PGI s CUDA Fortran: PGI s CUDA-x86: Slide 2
3 Links OpenCL Khronos Group (Specification, Reference Pages, ): OpenCL + Nvidia OpenCL + AMD: OpenCL + Intel: PGI Accelerator Accelerator Model: User Forum: Slide 3
4 Books David Kirk und Wen-Mei W. Hwu: Programming Massively Parallel Processors A Hands-on Approach (2010) Jason Sanders und Edward Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming (2010) Slide 4
5 Books A. Munshi, B. Gaster, T. Mattson, J. Fung, D. Ginsburg: OpenCL Programming Guide (2011) B. Gaster, D. Kaeli, L. Howes, P. Mistry, D. Schaa: Heterogeneous Computing with OpenCL (2011) Slide 5
6 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 6
7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, General purpose Slide 7
8 Known Parallelization on CPU level Shared-memory programming OpenMP: Parallel regions by pragmas (threads) Distributed-memory programming MPI: Message passing among processors Performance metrics FLOPS: Floating Point Operations per Second Memory bandwidth/ throughput [GB/s] Latency [cycles] Speedup: S = T serial / T parallel Trend towards multicore architectures Clock frequency at physical limit Slide 8
9 NVIDIA Corporation 2010 Comparison CPU GPU 8 cores Massively Parallel Processors Manycore Architecture CPU GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Fast switching Slide 9
10 Comparison CPU GPU Similar # of transistors but different design NVIDIA Corporation 2010 CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation Slide 10
11 Comparison CPU GPU Considerations for GPU parallelization Hardware-related programming Knowledge of hardware essential Code restructuring usually needed (kernel, data management, data transfer, tuning) Very small shared memory Global synchronization not possible within one kernel Number of suitable problems limited Why GPGPUs? Slide 11
12 Motivation for GPUs Performance: High rate of Flops achievable! Little overhead (threads), 1000s of threads (Massive) data parallelism in application Independent data Uniform operations Heterogeneous computer architecture (CPU + GPU) Asynchronous computations, overlapping OpenMP/MPI + GPU parallelization Relative low cost + power consumption ( GreenIT ) Compared to computers/clusters having a similar performance GPU available in almost every computer Slide 12
13 Some (programmable) GPU types NVIDIA GeForce: 8800GTX, GT220, GTX 470, Quadro: 6000, FX 4800, NVS 450, Tesla: C870, C1060, C2050, AMD Radeon: HD 3870, HD 5850, FirePro: 3D V3800, 3D V9800, FireStream: 9350, Here we will go into NVIDIA- GPUs. However, the fundamentals also apply to GPUs of other vendors. Slide 13
14 Example SAXPY SAXPY = Single-precision real Alpha X Plus Y: y x y void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; } // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); } free(x); free(y); return 0; Slide 14
15 Example SAXPY Outlook: SAXPY for GPUs (CUDA C) global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } Slide 15
16 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 16
17 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors cores/ streaming processors (SP) i.a. Floating point & integer unit streaming multiprocessors (SM, MP) Each comprises 32 cores Memory hierarchy multiprocessor Peak performance SP: 1.03 TFlops GPU DP: 515 GFlops ECC support Compute capability: 2.0 Defines features, e.g. double precision capability, memory access pattern Slide 17
18 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory Slide 18
19 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance Slide 19
20 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Slide 20
21 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 21
22 GPGPU paradigms CUDA C/C++ (NVIDIA): architecture + programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ PGI Accelerator Model C/Fortran (PGI): Programming using pragmas (syntax similar to OpenMP), NVIDIA GPUs OpenACC C/Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov (NVIDIA GPUs) Slide 22
23 Paradigm CUDA = Compute Unified Device Architecture CUDA C/C++ (NVIDIA) Based on industry standard C/C++ Extensions, e.g. built-in variables, function/variable type qualifiers Restrictions, e.g. kernel function recursions Driver API (low level), Runtime API (higher level) CUDA Fortran (PGI) Analogous to NVIDIA s CUDA C, some additional features Only available with the PGI compilers Brief timeline Nov 06: Introduction of CUDA, G80 GPU architecture Jun 07: CUDA Toolkit 1.0 Jun 08: GT200 GPU architecture March 10: Fermi GPU architecture Jan 12: CUDA Toolkit 4.1 Slide 23
24 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs Select input/output data Control decisions float x = input[threadid]; float y = func(x); output[threadid] = y; Slide 24
25 Programming model Threads are grouped into blocks Blocks are grouped into a grid Slide 25
26 Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 1D Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 ID-tuples for threads and blocks Kernel 2 2D Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) Slide 26
27 Programming model (CUDA C) Setup GPU (e.g. driver, envrionment variables) Download + install CUDA Toolkit (cf. Links section) grey = background information CUDA C Compiling module load cuda nvcc arch=sm_20 saxpy.cu nvcc: Nvidia s compiler for C/C++ GPU code -arch=sm_20: Set compute capability 2.0 # on our cluster Sets certain architecture features, e.g. enabling double precision floating point operations Slide 27
28 Programming model (CUDA C) Kernel code Function qualifiers: global, device, host Built-in variables: griddim: contains dimensions of grid (type dim3) blockdim : contains dimensions of block (type dim3) blockidx : contains block index within grid (type uint3) threadidx: contains thread index within block (type uint3) Compute unique IDs, e.g. global 1D Idx: gidx = blockidx.x * blockdim.x + threadidx.x Kernel usage Compiling with nvcc (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): func<<<dimgrid, dimblock>>> (parameter) Slide 28
29 Programming model (CUDA Fortran) Setup GPU (e.g. driver) Setup PGI Compiler blue = background information CUDA Fortran Compiling module switch intel pgi[/version] pgf90 Mcuda[=cc20,4.0] saxpy.cuf -Mcuda: Enables CUDA Fortran extensions # on our cluster cc20: Generates code for device with compute capability : Uses CUDA Toolkit 4.0 cuf: free-format CUDA Fortran program CUF: program is processed by preprocessor before being compiled Slide 29
30 Programming model (CUDA Fortran) Kernel code Subroutine/function qualifiers: attributes(global), attributes(device), attributes(host) Built-in variables: griddim: contains dimensions of grid (type(dim3)) blockdim : contains dimensions of block (type(dim3)) blockidx : contains block index within grid (type(dim3)) threadidx: contains thread index within block (type(dim3)) Compute unique IDs, e.g. global 1D Idx: gidx = (blockidx%x-1) * blockdim%x + threadidx%x Kernel usage Compiling with pgf90 or pgfortran (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): call func<<<dimgrid, dimblock>>> (parameter) Slide 30
31 Example SAXPY: Kernel usage global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; } if (i < n){ y[i] = a*x[i] + y[i]; } C/C++ module saxpy use cudafor Fortran contains attributes(global) subroutine saxpy_parallel (n, a, x, y) integer, intent(in), value :: n, a real, intent(in), device :: x(n) real, intent(inout), device :: y(n) integer :: i i = blockdim%x * (blockidx%x - 1 ) + threadidx%x if (i <= n) then y(i) = a*x(i) + y(i) end if end subroutine saxpy_parallel end module saxpy int main(int argc, char* argv[]) { [..] // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n,2.0,d_x,d_y); [..] } program main use saxpy [..]! Invoke parallel SAXPY kernel threadsperblock = dim3(128,1,1) blockspergrid = dim3(n/threadsperblock%x,1,1) call saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, a, d_x, d_y) [..] end program main Slide 31
32 Programming model Why blocks? Cooperation of threads within a block possible Synchronization (barrier) Share data/ results using Shared Memory Scalability Fast communication between n threads is not feasible when n large But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors Number of blocks (with #threads fixed)? Few: many threads can communicate Many: Good scaling Slide 32
33 Programming model: scalability G84 (very old architecture) NVIDIA Corporation Slide 33
34 Programming model: scalability G80 (medium old architecture) NVIDIA Corporation Slide 34
35 Programming model: scalability GT200 (last architecture) NVIDIA Corporation Idle Idle Idle Slide 35
36 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 36
37 Execution model Host-directed execution model Main program runs on host Certain code regions run on device Execution configuration: <<<blockspergrid, threadsperblock>>> Warps Threads execute as groups of 32 Threads in warp share same program counter Single instruction multiple threads (SIMT) Slide 37
38 Execution model Thread Core Each thread is executed by a core Block Multiprocessor Each block is executed on a multiprocessor Several concurrent blocks can reside on a MP depending on memory resources Grid (Kernel) Device Each kernel is executed on a device Slide 38
39 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 39
40 Memory model Host + device memory = separate entities No coherence between host + device Manual data synchronization/transfer Host (De-)Allocates device memory (global, constant, texture) Triggers data transfer Device Works on device memory (hierarchy) Slide 40
41 Memory model Thread Registers Local memory Block Shared memory: Tesla C1060: 16 KB; Fermi: 16KB shared +48KB L1 ODER 48KB shared +16KB L1 on-chip Device Multiprocessor 1 Multiprocessor n Registers Core 1 Registers Core m Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Grid/ application Constant memory 64 KB; read-only; off-chip; cached Global memory up to 6 GB; off-chip Fermi: L2 cache Host Host Memory Slide 41
42 Memory model (CUDA C) Variable type qualifiers device, shared, constant Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 42
43 Memory model (CUDA Fortran) Variable type qualifiers attributes(device), attributes(shared), attributes(constant), attributes(pinned), attributes(value) Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) By assignment statements var_host = var_dev (CPU to GPU transfer) var_dev = var_host (GPU to CPU transfer) var1_dev = var2_dev (Copy on GPU) Runtime routine cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 43
44 Example SAXPY: Memory int main(int argc,char* argv[]){ } float* h_x,*h_y; // host pointer // Allocate and initialize h_x and h_y float *d_x,*d_y; // device pointer cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; C/C++ program main use saxpy real, allocatable :: h_x(:), h_y(:)! host pointer real, allocatable, device :: d_x(:), d_y(:)! Device pointer allocate(h_x(n),h_y(n),d_x(n),d_y(n))! Initialize h_x and h_y d_x = h_x d_y = h_y! Invoke parallel real,intent(in), SAXPY kernel h_y = d_y deallocate(h_x, h_y, d_x, d_y) end program main Fortran attributes(global) subroutine saxpy_parallel (n, a, x, y) integer,intent(in),value ::n,a device::x(n) real,intent(inout),device::y(n) [..] end subroutine saxpy_parallel Slide 44
45 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 45
46 Summary 3 steps for a basic program with CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } 2. Launch kernel 3. Transfer data to CPU + free data on GPU Slide 46
47 Summary Processing flow Copy data from host to device Execute GPU code (kernel) in parallel Copy data from device to host Kernel executes grid of blocks of threads Memory hierarchy on GPU Thread: registers, local Block: shared Grid: global Use GPUs properly! Launch many many threads Uniform operations on data ( thread ID) Use all available resources (GPU + CPU) Slide 47
48 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 48
49 CUDA Tools Debugger cuda-gdb Extended gdb (usable via ddd), NVIDIA (free of charge) cuda-memcheck Discovers memory access errors, NVIDIA (free of charge) TotalView DDT Parallel Nsight Profiling/ tracing Visual Profiler VampirTrace GUI (memory) debugger, Linux, RogueWave GUI (memory) debugger, Linux, Allinea Windows, integrated in Visual Studio, NVIDIA (free of charge) Performance analysis w/ HW counters, NVIDIA (free of charge) Performance monitoring (tracing), TU Dresden Slide 49
50 CUDA Libraries NVIDIA... cublas Dense linear algebra (subset of BLAS) cusparse Sparse linear algebra cufft Discrete Fourier transforms curand Random number generation NPP Signal and image processing Thrust STL/Boost style template lib (e.g. scan, sort, reduce, transform) math.h Basics, exponentials, trigonometry,.. (e.g. sin, ceil, round) Third party CULA Dense/sparse linear algebra (subset of LAPACK) MAGMA Dense linear algebra (subset of BLAS, LAPACK) IMSL Fortran numerical library utilizes cublas NAG Numeric libraries (e.g. RNGs) libjacket Math, signal processing, image processing, statistics Open Source cudpp Data parallel primitives (e.g. scan, sort, reduction) CUSP Sparse linear algebra, graph computations OpenCurrent Partial differential equations Slide 50
Introduction to GPGPUs
Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationHIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS
HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationGPU Programming Paradigms
GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationGPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.
GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationOpenMP and GPU Programming
OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi https://github.com/eruffaldi/course_openmpgpu PERCeptual RObotics Laboratory, TeCIP Scuola Superiore Sant Anna Pisa,Italy e.ruffaldi@sssup.it April
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationHIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS
HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA ? WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CP PU + GPU Heterogeneous Computing Low Latency
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationGPU COMPUTING. Ana Lucia Varbanescu (UvA)
GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationCUDA Parallel Programming Model. Scalable Parallel Programming with CUDA
CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationGeneral-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationGPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA
GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationExotic Methods in Parallel Computing [GPU Computing]
Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods
More informationTutorial: Parallel programming technologies on hybrid architectures HybriLIT Team
Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationGPU Computing with CUDA. Part 2: CUDA Introduction
GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationIntroduction to CUDA (1 of n*)
Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements
More informationGetting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationParallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008
Parallel Systems Course: Chapter IV GPU Programming Jan Lemeire Dept. ETRO November 6th 2008 GPU Message-passing Programming with Parallel CUDAMessagepassing Parallel Processing Processing Overview 1.
More information