Introduction to GPGPUs

Size: px
Start display at page:

Download "Introduction to GPGPUs"

Transcription

1 Introduction to GPGPUs Sandra Wienke, M.Sc. PPCES 2012 Rechen- und Kommunikationszentrum (RZ)

2 Links General GPGPU Community: GPU Computing Community: CUDA Nvidia CUDA Zone (Toolkit, Profiler, SDK, documentation, ): PGI s CUDA Fortran: PGI s CUDA-x86: Slide 2

3 Links OpenCL Khronos Group (Specification, Reference Pages, ): OpenCL + Nvidia OpenCL + AMD: OpenCL + Intel: PGI Accelerator Accelerator Model: User Forum: Slide 3

4 Books David Kirk und Wen-Mei W. Hwu: Programming Massively Parallel Processors A Hands-on Approach (2010) Jason Sanders und Edward Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming (2010) Slide 4

5 Books A. Munshi, B. Gaster, T. Mattson, J. Fung, D. Ginsburg: OpenCL Programming Guide (2011) B. Gaster, D. Kaeli, L. Howes, P. Mistry, D. Schaa: Heterogeneous Computing with OpenCL (2011) Slide 5

6 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 6

7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, General purpose Slide 7

8 Known Parallelization on CPU level Shared-memory programming OpenMP: Parallel regions by pragmas (threads) Distributed-memory programming MPI: Message passing among processors Performance metrics FLOPS: Floating Point Operations per Second Memory bandwidth/ throughput [GB/s] Latency [cycles] Speedup: S = T serial / T parallel Trend towards multicore architectures Clock frequency at physical limit Slide 8

9 NVIDIA Corporation 2010 Comparison CPU GPU 8 cores Massively Parallel Processors Manycore Architecture CPU GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Fast switching Slide 9

10 Comparison CPU GPU Similar # of transistors but different design NVIDIA Corporation 2010 CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation Slide 10

11 Comparison CPU GPU Considerations for GPU parallelization Hardware-related programming Knowledge of hardware essential Code restructuring usually needed (kernel, data management, data transfer, tuning) Very small shared memory Global synchronization not possible within one kernel Number of suitable problems limited Why GPGPUs? Slide 11

12 Motivation for GPUs Performance: High rate of Flops achievable! Little overhead (threads), 1000s of threads (Massive) data parallelism in application Independent data Uniform operations Heterogeneous computer architecture (CPU + GPU) Asynchronous computations, overlapping OpenMP/MPI + GPU parallelization Relative low cost + power consumption ( GreenIT ) Compared to computers/clusters having a similar performance GPU available in almost every computer Slide 12

13 Some (programmable) GPU types NVIDIA GeForce: 8800GTX, GT220, GTX 470, Quadro: 6000, FX 4800, NVS 450, Tesla: C870, C1060, C2050, AMD Radeon: HD 3870, HD 5850, FirePro: 3D V3800, 3D V9800, FireStream: 9350, Here we will go into NVIDIA- GPUs. However, the fundamentals also apply to GPUs of other vendors. Slide 13

14 Example SAXPY SAXPY = Single-precision real Alpha X Plus Y: y x y void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; } // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); } free(x); free(y); return 0; Slide 14

15 Example SAXPY Outlook: SAXPY for GPUs (CUDA C) global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } Slide 15

16 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 16

17 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors cores/ streaming processors (SP) i.a. Floating point & integer unit streaming multiprocessors (SM, MP) Each comprises 32 cores Memory hierarchy multiprocessor Peak performance SP: 1.03 TFlops GPU DP: 515 GFlops ECC support Compute capability: 2.0 Defines features, e.g. double precision capability, memory access pattern Slide 17

18 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory Slide 18

19 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance Slide 19

20 NVIDIA Corporation 2010 Processing flow Host-directed execution model PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Slide 20

21 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 21

22 GPGPU paradigms CUDA C/C++ (NVIDIA): architecture + programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ PGI Accelerator Model C/Fortran (PGI): Programming using pragmas (syntax similar to OpenMP), NVIDIA GPUs OpenACC C/Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov (NVIDIA GPUs) Slide 22

23 Paradigm CUDA = Compute Unified Device Architecture CUDA C/C++ (NVIDIA) Based on industry standard C/C++ Extensions, e.g. built-in variables, function/variable type qualifiers Restrictions, e.g. kernel function recursions Driver API (low level), Runtime API (higher level) CUDA Fortran (PGI) Analogous to NVIDIA s CUDA C, some additional features Only available with the PGI compilers Brief timeline Nov 06: Introduction of CUDA, G80 GPU architecture Jun 07: CUDA Toolkit 1.0 Jun 08: GT200 GPU architecture March 10: Fermi GPU architecture Jan 12: CUDA Toolkit 4.1 Slide 23

24 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs Select input/output data Control decisions float x = input[threadid]; float y = func(x); output[threadid] = y; Slide 24

25 Programming model Threads are grouped into blocks Blocks are grouped into a grid Slide 25

26 Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 1D Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 ID-tuples for threads and blocks Kernel 2 2D Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) Slide 26

27 Programming model (CUDA C) Setup GPU (e.g. driver, envrionment variables) Download + install CUDA Toolkit (cf. Links section) grey = background information CUDA C Compiling module load cuda nvcc arch=sm_20 saxpy.cu nvcc: Nvidia s compiler for C/C++ GPU code -arch=sm_20: Set compute capability 2.0 # on our cluster Sets certain architecture features, e.g. enabling double precision floating point operations Slide 27

28 Programming model (CUDA C) Kernel code Function qualifiers: global, device, host Built-in variables: griddim: contains dimensions of grid (type dim3) blockdim : contains dimensions of block (type dim3) blockidx : contains block index within grid (type uint3) threadidx: contains thread index within block (type uint3) Compute unique IDs, e.g. global 1D Idx: gidx = blockidx.x * blockdim.x + threadidx.x Kernel usage Compiling with nvcc (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): func<<<dimgrid, dimblock>>> (parameter) Slide 28

29 Programming model (CUDA Fortran) Setup GPU (e.g. driver) Setup PGI Compiler blue = background information CUDA Fortran Compiling module switch intel pgi[/version] pgf90 Mcuda[=cc20,4.0] saxpy.cuf -Mcuda: Enables CUDA Fortran extensions # on our cluster cc20: Generates code for device with compute capability : Uses CUDA Toolkit 4.0 cuf: free-format CUDA Fortran program CUF: program is processed by preprocessor before being compiled Slide 29

30 Programming model (CUDA Fortran) Kernel code Subroutine/function qualifiers: attributes(global), attributes(device), attributes(host) Built-in variables: griddim: contains dimensions of grid (type(dim3)) blockdim : contains dimensions of block (type(dim3)) blockidx : contains block index within grid (type(dim3)) threadidx: contains thread index within block (type(dim3)) Compute unique IDs, e.g. global 1D Idx: gidx = (blockidx%x-1) * blockdim%x + threadidx%x Kernel usage Compiling with pgf90 or pgfortran (creating PTX code) Kernel arguments can be passed directly to the kernel Kernel invocation with execution configuration (chevron syntax): call func<<<dimgrid, dimblock>>> (parameter) Slide 30

31 Example SAXPY: Kernel usage global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; } if (i < n){ y[i] = a*x[i] + y[i]; } C/C++ module saxpy use cudafor Fortran contains attributes(global) subroutine saxpy_parallel (n, a, x, y) integer, intent(in), value :: n, a real, intent(in), device :: x(n) real, intent(inout), device :: y(n) integer :: i i = blockdim%x * (blockidx%x - 1 ) + threadidx%x if (i <= n) then y(i) = a*x(i) + y(i) end if end subroutine saxpy_parallel end module saxpy int main(int argc, char* argv[]) { [..] // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n,2.0,d_x,d_y); [..] } program main use saxpy [..]! Invoke parallel SAXPY kernel threadsperblock = dim3(128,1,1) blockspergrid = dim3(n/threadsperblock%x,1,1) call saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, a, d_x, d_y) [..] end program main Slide 31

32 Programming model Why blocks? Cooperation of threads within a block possible Synchronization (barrier) Share data/ results using Shared Memory Scalability Fast communication between n threads is not feasible when n large But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors Number of blocks (with #threads fixed)? Few: many threads can communicate Many: Good scaling Slide 32

33 Programming model: scalability G84 (very old architecture) NVIDIA Corporation Slide 33

34 Programming model: scalability G80 (medium old architecture) NVIDIA Corporation Slide 34

35 Programming model: scalability GT200 (last architecture) NVIDIA Corporation Idle Idle Idle Slide 35

36 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 36

37 Execution model Host-directed execution model Main program runs on host Certain code regions run on device Execution configuration: <<<blockspergrid, threadsperblock>>> Warps Threads execute as groups of 32 Threads in warp share same program counter Single instruction multiple threads (SIMT) Slide 37

38 Execution model Thread Core Each thread is executed by a core Block Multiprocessor Each block is executed on a multiprocessor Several concurrent blocks can reside on a MP depending on memory resources Grid (Kernel) Device Each kernel is executed on a device Slide 38

39 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 39

40 Memory model Host + device memory = separate entities No coherence between host + device Manual data synchronization/transfer Host (De-)Allocates device memory (global, constant, texture) Triggers data transfer Device Works on device memory (hierarchy) Slide 40

41 Memory model Thread Registers Local memory Block Shared memory: Tesla C1060: 16 KB; Fermi: 16KB shared +48KB L1 ODER 48KB shared +16KB L1 on-chip Device Multiprocessor 1 Multiprocessor n Registers Core 1 Registers Core m Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Grid/ application Constant memory 64 KB; read-only; off-chip; cached Global memory up to 6 GB; off-chip Fermi: L2 cache Host Host Memory Slide 41

42 Memory model (CUDA C) Variable type qualifiers device, shared, constant Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 42

43 Memory model (CUDA Fortran) Variable type qualifiers attributes(device), attributes(shared), attributes(constant), attributes(pinned), attributes(value) Memory management cudamalloc(pointertogpumem, size) cudafree(pointertogpumem) Memory transfer (synchronous) By assignment statements var_host = var_dev (CPU to GPU transfer) var_dev = var_host (GPU to CPU transfer) var1_dev = var2_dev (Copy on GPU) Runtime routine cudamemcpy(dest, src, size, direction) direction: cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Slide 43

44 Example SAXPY: Memory int main(int argc,char* argv[]){ } float* h_x,*h_y; // host pointer // Allocate and initialize h_x and h_y float *d_x,*d_y; // device pointer cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; C/C++ program main use saxpy real, allocatable :: h_x(:), h_y(:)! host pointer real, allocatable, device :: d_x(:), d_y(:)! Device pointer allocate(h_x(n),h_y(n),d_x(n),d_y(n))! Initialize h_x and h_y d_x = h_x d_y = h_y! Invoke parallel real,intent(in), SAXPY kernel h_y = d_y deallocate(h_x, h_y, d_x, d_y) end program main Fortran attributes(global) subroutine saxpy_parallel (n, a, x, y) integer,intent(in),value ::n,a device::x(n) real,intent(inout),device::y(n) [..] end subroutine saxpy_parallel Slide 44

45 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 45

46 Summary 3 steps for a basic program with CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; // Pointer to CPU memory // Allocate and initialize h_x and h_y float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } 2. Launch kernel 3. Transfer data to CPU + free data on GPU Slide 46

47 Summary Processing flow Copy data from host to device Execute GPU code (kernel) in parallel Copy data from device to host Kernel executes grid of blocks of threads Memory hierarchy on GPU Thread: registers, local Block: shared Grid: global Use GPUs properly! Launch many many threads Uniform operations on data ( thread ID) Use all available resources (GPU + CPU) Slide 47

48 Contents Motivation GPU Architecture (Fermi) Programming Model Execution Model Memory Model Summary Tools & Libs Slide 48

49 CUDA Tools Debugger cuda-gdb Extended gdb (usable via ddd), NVIDIA (free of charge) cuda-memcheck Discovers memory access errors, NVIDIA (free of charge) TotalView DDT Parallel Nsight Profiling/ tracing Visual Profiler VampirTrace GUI (memory) debugger, Linux, RogueWave GUI (memory) debugger, Linux, Allinea Windows, integrated in Visual Studio, NVIDIA (free of charge) Performance analysis w/ HW counters, NVIDIA (free of charge) Performance monitoring (tracing), TU Dresden Slide 49

50 CUDA Libraries NVIDIA... cublas Dense linear algebra (subset of BLAS) cusparse Sparse linear algebra cufft Discrete Fourier transforms curand Random number generation NPP Signal and image processing Thrust STL/Boost style template lib (e.g. scan, sort, reduce, transform) math.h Basics, exponentials, trigonometry,.. (e.g. sin, ceil, round) Third party CULA Dense/sparse linear algebra (subset of LAPACK) MAGMA Dense linear algebra (subset of BLAS, LAPACK) IMSL Fortran numerical library utilizes cublas NAG Numeric libraries (e.g. RNGs) libjacket Math, signal processing, image processing, statistics Open Source cudpp Data parallel primitives (e.g. scan, sort, reduction) CUSP Sparse linear algebra, graph computations OpenCurrent Partial differential equations Slide 50

Introduction to GPGPUs

Introduction to GPGPUs Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

GPU Programming Paradigms

GPU Programming Paradigms GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh. GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

OpenMP and GPU Programming

OpenMP and GPU Programming OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi https://github.com/eruffaldi/course_openmpgpu PERCeptual RObotics Laboratory, TeCIP Scuola Superiore Sant Anna Pisa,Italy e.ruffaldi@sssup.it April

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA ? WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CP PU + GPU Heterogeneous Computing Low Latency

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU Computing with CUDA. Part 2: CUDA Introduction GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008 Parallel Systems Course: Chapter IV GPU Programming Jan Lemeire Dept. ETRO November 6th 2008 GPU Message-passing Programming with Parallel CUDAMessagepassing Parallel Processing Processing Overview 1.

More information