CUDA Introduction Part I PATC GPU Programming Course 2017

Size: px
Start display at page:

Download "CUDA Introduction Part I PATC GPU Programming Course 2017"

Transcription

1 CUDA Introduction Part I PATC GPU Programming Course 2017 Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version

2 Outline Introduction GPU History Architecture Comparison Jülich Systems App Showcase The GPU Platform 3 Core Features Memory Asynchronicity SIMT High Throughput Summary Programming GPUs Libraries About CUDA Alternatives Directives Thrust CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 2 45

3 History of GPUs 50 Shaders of Gray 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [1]»GPU«coined by NVIDIA [2] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2016 Top 500: > 1 /10 with GPUs [3], Green 500: 2 /3 of top 50 with GPUs [4] Andreas Herten CUDA Introduction I 24 April 2017 # 3 45

4 Status Quo Across Architectures Memory Bandwidth GFLOP/sec Theoretical Peak Performance, Double Precision Xeon Phi 7120 (KNC) INTEL Xeon CPUs Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 HD 6970 Tesla C1060 HD 6970 Tesla K20 X5482 Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20X FirePro W9100 X5492 HD 7970 GHz Ed. Tesla K40 FirePro S9150 W5590 HD 8970 Tesla P100 X5680 Tesla K40 X5690 E E v3 E v2 E v3 E v4 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]

5 Status Quo Across Architectures Memory Bandwidth GB/sec INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis HD 3870 HD 4870 Tesla C1060 HD 5870 Theoretical Peak Memory Bandwidth Comparison HD 6970 HD 6970 HD 7970 GHz Ed. Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20 HD 8970 FirePro W9100 Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X E E v2 E v3 FirePro S9150 E v3 Tesla P100 Xeon Phi 7290 (KNL) E v4 Graphic: Rupp [5] X5482 X5492 W5590 X5680 X End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45

6 Status Quo Across Architectures Memory Bandwidth INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis Theoretical Peak Memory Bandwidth Comparison Tesla P100 GB/sec Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 X5482 Tesla C1060 HD 6970 HD 7970 GHz Ed. X5492 Tesla C1060 HD 6970 W5590 Tesla C2050 HD 8970 FirePro W9100 Tesla M2090 X5680 Tesla K20 FirePro S9150 X5690 E E v2 E v3 E v3 E v End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]

7 JURECA Jülich s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) 1.8 (CPU) (GPU) PFLOP/s peak performance (#70) Dedicated visualization nodes Mellanox EDR InfiniBand Andreas Herten CUDA Introduction I 24 April 2017 # 5 45

8 JULIA JURON JURON A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 10 cores) Per Node: 4 NVIDIA Tesla P100 cards, connected via NVLink. GPU: 0.38 PFLOP/s peak performance Dedicated visualization nodes Andreas Herten CUDA Introduction I 24 April 2017 # 6 45

9 Getting GPU-Acquainted Some Applications TASK Dot Product GEMM Location of Code: CudaBasics/exercises/tasks/getting_started/ See Instructions.rst for hints. N-Body Mandelbrot Andreas Herten CUDA Introduction I 24 April 2017 # 7 45

10 Getting GPU-Acquainted Some Applications CPU GPU DDot Benchmark CPU GPU DGEMM Benchmark TASK MFLOP/s GFLOP/s Length of Vector Size of Square Matrix GFLOP/s GPU SP 2 GPUs SP 4 GPUs SP 1 GPU DP 2 GPUs DP 4 GPUs DP N-Body Benchmark Number of Particles MPixel/s Mandelbrot Benchmark CPU GPU Width of Image Andreas Herten CUDA Introduction I 24 April 2017 # 7 45

11 The GPU Platform Andreas Herten CUDA Introduction I 24 April 2017 # 8 45

12 CPU vs. GPU Transporting one Andreas Herten CUDA Introduction I 24 April 2017 Graphics: Lee [6] and Shearings Holidays [7] A matter of specialties Transporting many # 9 45

13 CPU vs. GPU Chip Control ALU ALU ALU ALU Cache DRAM DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 10 45

14 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45

15 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45

16 Memory GPU memory ain t no CPU memory Unified Virtual Addressing GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Host ALU ALU ALU ALU Control Cache DRAM PCIe <16 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45

17 Memory GPU memory ain t no CPU memory Unified Memory GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Values for P100: 16 GB RAM, 720 GB/s Host ALU ALU ALU ALU Control Cache DRAM NVLink 80 GB/s HBM2 <720 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45

18 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

19 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

20 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

21 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 14 45

22 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten CUDA Introduction I 24 April 2017 # 15 45

23 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 16 45

24 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

25 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

26 SIMT Of threads and warps Multiprocessor A 0 Vector B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

27 Low Latency vs. High Throughput Maybe GPU s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor: High Throughput W 1 W2 W 3 W4 Thread/Warp Processing Context Switch Ready Waiting Andreas Herten CUDA Introduction I 24 April 2017 # 18 45

28 CPU vs. GPU Let s summarize this! Optimized for low latency Optimized for high throughput Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt + High bandwidth main memory + Latency tolerant (parallelism) + More compute resources + High performance per watt Limited memory capacity Low per-thread performance Extension card Andreas Herten CUDA Introduction I 24 April 2017 # 19 45

29 Programming GPUs Andreas Herten CUDA Introduction I 24 April 2017 # 20 45

30 Preface: CPU A simple CPU program! SAXPY: y = a x + y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { } for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 21 45

31 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! Wizard: Breazell [9] Andreas Herten CUDA Introduction I 24 April 2017 # 22 45

32 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! cublas cusparse Wizard: Breazell [9] cufft curand CUDA Math th ano Andreas Herten CUDA Introduction I 24 April 2017 # 22 45

33 cublas Parallel algebra GPU-parallel BLAS (all 152 routines) Single, double, complex data types Constant competition with Intel s MKL Multi-GPU support Andreas Herten CUDA Introduction I 24 April 2017 # 23 45

34 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); cudafree(d_x); cudafree(d_y); cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45

35 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); Initialize float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); Allocate GPU memory Copy data to GPU Call BLAS routine cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); Copy result to host cudafree(d_x); cudafree(d_y); Finalize cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45

36 cublas Task Implement a matrix-matrix multiplication TASK Location of code: CudaBasics/excercises/tasks/cublas/ Look at Instructions.rst for instructions 1 Implement call to double-precision GEMM of cublas 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./dgemm_um N, where N=100, 200, Check cublas documentation for details on cublasdgemm() JURON Getting Started module load nvidia/cuda/8.0 bsub -I -R "rusage[ngpus_shared=1]" hostname bsub -Is -R "rusage[ngpus_shared=1]" -tty /bin/bash bsub -Is -R "rusage[ngpus_shared=1]" -tty -XF nvvp Andreas Herten CUDA Introduction I 24 April 2017 # 25 45

37 Programming GPUs About CUDA Alternatives Andreas Herten CUDA Introduction I 24 April 2017 # 26 45

38 ! Parallelism Libraries are not enough? You think you want to write your own GPU code? Andreas Herten CUDA Introduction I 24 April 2017 # 27 45

39 Primer on Parallel Scaling Amdahl s Law Possible maximum speedup for N parallel processors Total Time t = t serial + t parallel N Processors t(n) = t s + t p /N Speedup s(n) = t/t(n) = t s+t p t s +t p /N Speedup Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Number of Processors Andreas Herten CUDA Introduction I 24 April 2017 # 28 45

40 ! Parallelism Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain? Andreas Herten CUDA Introduction I 24 April 2017 # 29 45

41 Alternatives The twilight There are alternatives to CUDA C, which can ease the pain OpenACC Thrust PyCUDA Other alternatives (for completeness) CUDA Fortran OpenMP OpenCL Andreas Herten CUDA Introduction I 24 April 2017 # 30 45

42 Programming GPUs Directives Andreas Herten CUDA Introduction I 24 April 2017 # 31 45

43 GPU Programming with Directives Keepin you portable Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {}; Also: Generalized API functions acc_copy(); Compiler interprets directives, creates according instructions Pro Portability Other compiler? No problem! To it, it s a serial program Different target architectures from same code Easy to program Con Only few compilers Not all the raw power available Harder to debug Easy to program wrong Andreas Herten CUDA Introduction I 24 April 2017 # 32 45

44 GPU Programming with Directives The power of two. OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5 #pragma omp target map(tofrom:y), map(to:x) #pragma omp teams num_teams(10) num_threads(10) #pragma omp distribute for ( ) { #pragma omp parallel for for ( ) { // } } OpenACC Similar to OpenMP, but more specifically for GPUs For C/C++ and Fortran Andreas Herten CUDA Introduction I 24 April 2017 # 33 45

45 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

46 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

47 OpenACC Code example TASK void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; See JSC OpenACC course in October! float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

48 Programming GPUs Thrust Andreas Herten CUDA Introduction I 24 April 2017 # 35 45

49 Thrust Iterators! Iterators everywhere! Thrust CUDA = C++ STL Template library Based on iterators Data-parallel primitives (scan(), sort(), reduce(), ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Andreas Herten CUDA Introduction I 24 April 2017 # 36 45

50 Thrust Code example int a = 42; int n = 10; thrust::host_vector<float> x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); x = d_x; Andreas Herten CUDA Introduction I 24 April 2017 # 37 45

51 Thrust Task Let s sort some randomness TASK Location of code: CudaBasics/excercises/tasks/thrust/ Look at Instructions.rst for instructions 1 Sort random numbers with Thrust on CPU and GPU 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./thrustsort Check Thrust documentation for details on thrust::sort() Andreas Herten CUDA Introduction I 24 April 2017 # 38 45

52 Summary of Acceleration Possibilities Application Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easy Acceleration Flexible Acceleration Andreas Herten CUDA Introduction I 24 April 2017 # 39 45

53 Programming GPUs CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 40 45

54 CUDA SAXPY With runtime-managed data transfers global void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudamallocmanaged(&x, n * sizeof(float)); cudamallocmanaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudadevicesynchronize(); Andreas Herten CUDA Introduction I 24 April 2017 # 41 45

55 CUDA C/C++ Warp the kernel, it s a thread! Methods to exploit parallelism: Thread Block Block Grid All in 3D Threads run on individual CUDA cores Lightweight fast switchting! 1000s threads execute simultaneously Warp: Bundle of (currently) 32 threads executing at the exact same time on one multiprocessor only hardware concept! Andreas Herten CUDA Introduction I 24 April 2017 # 42 45

56 CUDA C/C++ Kernels Execution unit: kernel Function executing in parallel on device global kernel(int a, float * b) { } Access own ID by global variables threadidx.x, blockidx.y, Execution order non-deterministic! Might even be serial Only threads in one warp (32 threads of block) can communicate reliably/quickly All threads execute same code, but can take different paths (beware of slow branches!) Further keywords: Atomic operations (locking); shared memory; thread synchronization CUDA SAXPY! Andreas Herten CUDA Introduction I 24 April 2017 # 43 45

57 CUDA API?!? Jan! Andreas Herten CUDA Introduction I 24 April 2017 # 44 45

58 Conclusions of Part 1 GPUs achieve performance by specialized hardware Acceleration can be done by different means Libraries are the easiest Thrust, OpenACC can give first entry point Full power with CUDA CUDA parallelizes for GPUs with many threads Thank you for your attention! a.herten@fz-juelich.de Andreas Herten CUDA Introduction I 24 April 2017 # 45 45

59 Appendix Glossary References Andreas Herten CUDA Introduction I 24 April 2017 # 1 8

60 Glossary I API A programmatic interface to software by well-defined functions. Short for application programming interface. 43, 57, 60 ATI Canada-based GPUs manufacturing company; bought by AMD in , 60 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 41, 49, 53, 54, 55, 56, 57, 58, 60 GCC The GNU Compiler Collection, the collection of open source compilers, among other for C and Fortran. 60 Andreas Herten CUDA Introduction I 24 April 2017 # 2 8

61 Glossary II LLVM An open Source compiler infrastructure, providing, among others, Clang for C. 60 NVIDIA US technology company creating GPUs. 3, 7, 8, 60 NVLink NVIDIA s communication protocol connecting CPU GPU and GPU GPU with 80 GB/s. PCI-Express: 16 GB/s. 8, 60 OpenACC Directive-based programming, primarily for many-core machines. 41, 44, 45, 46, 47, 60 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 41, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 3 8

62 Glossary III OpenGL The Open Graphics Library, an API for rendering graphics across different hardware architectures. 3, 60 OpenMP Directive-based programming, primarily for multi-threaded machines. 41, 44, 60 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 memory. 8, 60 SAXPY Single-precision A X + Y. A simple code example of scaling a vector and adding an offset. 30, 54, 56, 60 Tesla The GPU product line for general purpose computing computing of NVIDIA. 7, 8, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 4 8

63 Glossary IV Thrust A parallel algorithms library for (among others) GPUs. See 41, 49, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 5 8

64 References I [1] Kenneth E. Hoff III et al. Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp ISBN: DOI: / URL: (page 3). [2] Chris McClanahan. History and Evolution of GPU Architecture. In: A Survey Paper (2010). URL: (page 3). [3] Jack Dongarra et al. TOP500. Nov URL: (page 3). Andreas Herten CUDA Introduction I 24 April 2017 # 6 8

65 References II [4] Jack Dongarra et al. Green500. Nov URL: (page 3). [5] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: (pages 4 6). [6] Mark Lee. Picture: kawasaki ninja. URL: (page 12). [7] Shearings Holidays. Picture: Shearings coach 636. URL: (page 12). Andreas Herten CUDA Introduction I 24 April 2017 # 7 8

66 References III [8] Nvidia Corporation. Pictures: Pascal Blockdiagram, Pascal Multiprocessor. Pascal Architecture Whitepaper. URL: whitepaper/pascal-architecture-whitepaper.pdf (pages 25, 26). [9] Wes Breazell. Picture: Wizard. URL: (pages 31, 32). Andreas Herten CUDA Introduction I 24 April 2017 # 8 8

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS GPU ACCELERATORS AT JSC OF THREADS AND KERNELS 28 May 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline GPUs at JSC GPU Architecture Empirical Motivation Comparisons

More information

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Using JURECA's GPU Nodes

Using JURECA's GPU Nodes Mitglied der Helmholtz-Gemeinschaft Using JURECA's GPU Nodes Willi Homberg Supercomputing Centre Jülich (JSC) Introduction to the usage and programming of supercomputer resources in Jülich 23-24 May 2016

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016 LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages Approaches to GPU Computing Libraries, OpenACC Directives, and Languages Add GPUs: Accelerate Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

A case study of performance portability with OpenMP 4.5

A case study of performance portability with OpenMP 4.5 A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW

More information

ARCHER Champions 2 workshop

ARCHER Champions 2 workshop ARCHER Champions 2 workshop Mike Giles Mathematical Institute & OeRC, University of Oxford Sept 5th, 2016 Mike Giles (Oxford) ARCHER Champions 2 Sept 5th, 2016 1 / 14 Tier 2 bids Out of the 8 bids, I know

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives

More information