CUDA Introduction Part I PATC GPU Programming Course 2017
|
|
- Belinda Ramsey
- 6 years ago
- Views:
Transcription
1 CUDA Introduction Part I PATC GPU Programming Course 2017 Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version
2 Outline Introduction GPU History Architecture Comparison Jülich Systems App Showcase The GPU Platform 3 Core Features Memory Asynchronicity SIMT High Throughput Summary Programming GPUs Libraries About CUDA Alternatives Directives Thrust CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 2 45
3 History of GPUs 50 Shaders of Gray 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [1]»GPU«coined by NVIDIA [2] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2016 Top 500: > 1 /10 with GPUs [3], Green 500: 2 /3 of top 50 with GPUs [4] Andreas Herten CUDA Introduction I 24 April 2017 # 3 45
4 Status Quo Across Architectures Memory Bandwidth GFLOP/sec Theoretical Peak Performance, Double Precision Xeon Phi 7120 (KNC) INTEL Xeon CPUs Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 HD 6970 Tesla C1060 HD 6970 Tesla K20 X5482 Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20X FirePro W9100 X5492 HD 7970 GHz Ed. Tesla K40 FirePro S9150 W5590 HD 8970 Tesla P100 X5680 Tesla K40 X5690 E E v3 E v2 E v3 E v4 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]
5 Status Quo Across Architectures Memory Bandwidth GB/sec INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis HD 3870 HD 4870 Tesla C1060 HD 5870 Theoretical Peak Memory Bandwidth Comparison HD 6970 HD 6970 HD 7970 GHz Ed. Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20 HD 8970 FirePro W9100 Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X E E v2 E v3 FirePro S9150 E v3 Tesla P100 Xeon Phi 7290 (KNL) E v4 Graphic: Rupp [5] X5482 X5492 W5590 X5680 X End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45
6 Status Quo Across Architectures Memory Bandwidth INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis Theoretical Peak Memory Bandwidth Comparison Tesla P100 GB/sec Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 X5482 Tesla C1060 HD 6970 HD 7970 GHz Ed. X5492 Tesla C1060 HD 6970 W5590 Tesla C2050 HD 8970 FirePro W9100 Tesla M2090 X5680 Tesla K20 FirePro S9150 X5690 E E v2 E v3 E v3 E v End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]
7 JURECA Jülich s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) 1.8 (CPU) (GPU) PFLOP/s peak performance (#70) Dedicated visualization nodes Mellanox EDR InfiniBand Andreas Herten CUDA Introduction I 24 April 2017 # 5 45
8 JULIA JURON JURON A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 10 cores) Per Node: 4 NVIDIA Tesla P100 cards, connected via NVLink. GPU: 0.38 PFLOP/s peak performance Dedicated visualization nodes Andreas Herten CUDA Introduction I 24 April 2017 # 6 45
9 Getting GPU-Acquainted Some Applications TASK Dot Product GEMM Location of Code: CudaBasics/exercises/tasks/getting_started/ See Instructions.rst for hints. N-Body Mandelbrot Andreas Herten CUDA Introduction I 24 April 2017 # 7 45
10 Getting GPU-Acquainted Some Applications CPU GPU DDot Benchmark CPU GPU DGEMM Benchmark TASK MFLOP/s GFLOP/s Length of Vector Size of Square Matrix GFLOP/s GPU SP 2 GPUs SP 4 GPUs SP 1 GPU DP 2 GPUs DP 4 GPUs DP N-Body Benchmark Number of Particles MPixel/s Mandelbrot Benchmark CPU GPU Width of Image Andreas Herten CUDA Introduction I 24 April 2017 # 7 45
11 The GPU Platform Andreas Herten CUDA Introduction I 24 April 2017 # 8 45
12 CPU vs. GPU Transporting one Andreas Herten CUDA Introduction I 24 April 2017 Graphics: Lee [6] and Shearings Holidays [7] A matter of specialties Transporting many # 9 45
13 CPU vs. GPU Chip Control ALU ALU ALU ALU Cache DRAM DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 10 45
14 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45
15 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45
16 Memory GPU memory ain t no CPU memory Unified Virtual Addressing GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Host ALU ALU ALU ALU Control Cache DRAM PCIe <16 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45
17 Memory GPU memory ain t no CPU memory Unified Memory GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Values for P100: 16 GB RAM, 720 GB/s Host ALU ALU ALU ALU Control Cache DRAM NVLink 80 GB/s HBM2 <720 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45
18 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45
19 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45
20 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45
21 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 14 45
22 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten CUDA Introduction I 24 April 2017 # 15 45
23 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 16 45
24 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Andreas Herten CUDA Introduction I 24 April 2017 # 17 45
25 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45
26 SIMT Of threads and warps Multiprocessor A 0 Vector B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45
27 Low Latency vs. High Throughput Maybe GPU s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor: High Throughput W 1 W2 W 3 W4 Thread/Warp Processing Context Switch Ready Waiting Andreas Herten CUDA Introduction I 24 April 2017 # 18 45
28 CPU vs. GPU Let s summarize this! Optimized for low latency Optimized for high throughput Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt + High bandwidth main memory + Latency tolerant (parallelism) + More compute resources + High performance per watt Limited memory capacity Low per-thread performance Extension card Andreas Herten CUDA Introduction I 24 April 2017 # 19 45
29 Programming GPUs Andreas Herten CUDA Introduction I 24 April 2017 # 20 45
30 Preface: CPU A simple CPU program! SAXPY: y = a x + y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { } for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 21 45
31 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! Wizard: Breazell [9] Andreas Herten CUDA Introduction I 24 April 2017 # 22 45
32 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! cublas cusparse Wizard: Breazell [9] cufft curand CUDA Math th ano Andreas Herten CUDA Introduction I 24 April 2017 # 22 45
33 cublas Parallel algebra GPU-parallel BLAS (all 152 routines) Single, double, complex data types Constant competition with Intel s MKL Multi-GPU support Andreas Herten CUDA Introduction I 24 April 2017 # 23 45
34 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); cudafree(d_x); cudafree(d_y); cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45
35 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); Initialize float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); Allocate GPU memory Copy data to GPU Call BLAS routine cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); Copy result to host cudafree(d_x); cudafree(d_y); Finalize cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45
36 cublas Task Implement a matrix-matrix multiplication TASK Location of code: CudaBasics/excercises/tasks/cublas/ Look at Instructions.rst for instructions 1 Implement call to double-precision GEMM of cublas 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./dgemm_um N, where N=100, 200, Check cublas documentation for details on cublasdgemm() JURON Getting Started module load nvidia/cuda/8.0 bsub -I -R "rusage[ngpus_shared=1]" hostname bsub -Is -R "rusage[ngpus_shared=1]" -tty /bin/bash bsub -Is -R "rusage[ngpus_shared=1]" -tty -XF nvvp Andreas Herten CUDA Introduction I 24 April 2017 # 25 45
37 Programming GPUs About CUDA Alternatives Andreas Herten CUDA Introduction I 24 April 2017 # 26 45
38 ! Parallelism Libraries are not enough? You think you want to write your own GPU code? Andreas Herten CUDA Introduction I 24 April 2017 # 27 45
39 Primer on Parallel Scaling Amdahl s Law Possible maximum speedup for N parallel processors Total Time t = t serial + t parallel N Processors t(n) = t s + t p /N Speedup s(n) = t/t(n) = t s+t p t s +t p /N Speedup Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Number of Processors Andreas Herten CUDA Introduction I 24 April 2017 # 28 45
40 ! Parallelism Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain? Andreas Herten CUDA Introduction I 24 April 2017 # 29 45
41 Alternatives The twilight There are alternatives to CUDA C, which can ease the pain OpenACC Thrust PyCUDA Other alternatives (for completeness) CUDA Fortran OpenMP OpenCL Andreas Herten CUDA Introduction I 24 April 2017 # 30 45
42 Programming GPUs Directives Andreas Herten CUDA Introduction I 24 April 2017 # 31 45
43 GPU Programming with Directives Keepin you portable Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {}; Also: Generalized API functions acc_copy(); Compiler interprets directives, creates according instructions Pro Portability Other compiler? No problem! To it, it s a serial program Different target architectures from same code Easy to program Con Only few compilers Not all the raw power available Harder to debug Easy to program wrong Andreas Herten CUDA Introduction I 24 April 2017 # 32 45
44 GPU Programming with Directives The power of two. OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5 #pragma omp target map(tofrom:y), map(to:x) #pragma omp teams num_teams(10) num_threads(10) #pragma omp distribute for ( ) { #pragma omp parallel for for ( ) { // } } OpenACC Similar to OpenMP, but more specifically for GPUs For C/C++ and Fortran Andreas Herten CUDA Introduction I 24 April 2017 # 33 45
45 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45
46 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45
47 OpenACC Code example TASK void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; See JSC OpenACC course in October! float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45
48 Programming GPUs Thrust Andreas Herten CUDA Introduction I 24 April 2017 # 35 45
49 Thrust Iterators! Iterators everywhere! Thrust CUDA = C++ STL Template library Based on iterators Data-parallel primitives (scan(), sort(), reduce(), ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Andreas Herten CUDA Introduction I 24 April 2017 # 36 45
50 Thrust Code example int a = 42; int n = 10; thrust::host_vector<float> x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); x = d_x; Andreas Herten CUDA Introduction I 24 April 2017 # 37 45
51 Thrust Task Let s sort some randomness TASK Location of code: CudaBasics/excercises/tasks/thrust/ Look at Instructions.rst for instructions 1 Sort random numbers with Thrust on CPU and GPU 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./thrustsort Check Thrust documentation for details on thrust::sort() Andreas Herten CUDA Introduction I 24 April 2017 # 38 45
52 Summary of Acceleration Possibilities Application Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easy Acceleration Flexible Acceleration Andreas Herten CUDA Introduction I 24 April 2017 # 39 45
53 Programming GPUs CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 40 45
54 CUDA SAXPY With runtime-managed data transfers global void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudamallocmanaged(&x, n * sizeof(float)); cudamallocmanaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudadevicesynchronize(); Andreas Herten CUDA Introduction I 24 April 2017 # 41 45
55 CUDA C/C++ Warp the kernel, it s a thread! Methods to exploit parallelism: Thread Block Block Grid All in 3D Threads run on individual CUDA cores Lightweight fast switchting! 1000s threads execute simultaneously Warp: Bundle of (currently) 32 threads executing at the exact same time on one multiprocessor only hardware concept! Andreas Herten CUDA Introduction I 24 April 2017 # 42 45
56 CUDA C/C++ Kernels Execution unit: kernel Function executing in parallel on device global kernel(int a, float * b) { } Access own ID by global variables threadidx.x, blockidx.y, Execution order non-deterministic! Might even be serial Only threads in one warp (32 threads of block) can communicate reliably/quickly All threads execute same code, but can take different paths (beware of slow branches!) Further keywords: Atomic operations (locking); shared memory; thread synchronization CUDA SAXPY! Andreas Herten CUDA Introduction I 24 April 2017 # 43 45
57 CUDA API?!? Jan! Andreas Herten CUDA Introduction I 24 April 2017 # 44 45
58 Conclusions of Part 1 GPUs achieve performance by specialized hardware Acceleration can be done by different means Libraries are the easiest Thrust, OpenACC can give first entry point Full power with CUDA CUDA parallelizes for GPUs with many threads Thank you for your attention! a.herten@fz-juelich.de Andreas Herten CUDA Introduction I 24 April 2017 # 45 45
59 Appendix Glossary References Andreas Herten CUDA Introduction I 24 April 2017 # 1 8
60 Glossary I API A programmatic interface to software by well-defined functions. Short for application programming interface. 43, 57, 60 ATI Canada-based GPUs manufacturing company; bought by AMD in , 60 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 41, 49, 53, 54, 55, 56, 57, 58, 60 GCC The GNU Compiler Collection, the collection of open source compilers, among other for C and Fortran. 60 Andreas Herten CUDA Introduction I 24 April 2017 # 2 8
61 Glossary II LLVM An open Source compiler infrastructure, providing, among others, Clang for C. 60 NVIDIA US technology company creating GPUs. 3, 7, 8, 60 NVLink NVIDIA s communication protocol connecting CPU GPU and GPU GPU with 80 GB/s. PCI-Express: 16 GB/s. 8, 60 OpenACC Directive-based programming, primarily for many-core machines. 41, 44, 45, 46, 47, 60 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 41, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 3 8
62 Glossary III OpenGL The Open Graphics Library, an API for rendering graphics across different hardware architectures. 3, 60 OpenMP Directive-based programming, primarily for multi-threaded machines. 41, 44, 60 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 memory. 8, 60 SAXPY Single-precision A X + Y. A simple code example of scaling a vector and adding an offset. 30, 54, 56, 60 Tesla The GPU product line for general purpose computing computing of NVIDIA. 7, 8, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 4 8
63 Glossary IV Thrust A parallel algorithms library for (among others) GPUs. See 41, 49, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 5 8
64 References I [1] Kenneth E. Hoff III et al. Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp ISBN: DOI: / URL: (page 3). [2] Chris McClanahan. History and Evolution of GPU Architecture. In: A Survey Paper (2010). URL: (page 3). [3] Jack Dongarra et al. TOP500. Nov URL: (page 3). Andreas Herten CUDA Introduction I 24 April 2017 # 6 8
65 References II [4] Jack Dongarra et al. Green500. Nov URL: (page 3). [5] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: (pages 4 6). [6] Mark Lee. Picture: kawasaki ninja. URL: (page 12). [7] Shearings Holidays. Picture: Shearings coach 636. URL: (page 12). Andreas Herten CUDA Introduction I 24 April 2017 # 7 8
66 References III [8] Nvidia Corporation. Pictures: Pascal Blockdiagram, Pascal Multiprocessor. Pascal Architecture Whitepaper. URL: whitepaper/pascal-architecture-whitepaper.pdf (pages 25, 26). [9] Wes Breazell. Picture: Wizard. URL: (pages 31, 32). Andreas Herten CUDA Introduction I 24 April 2017 # 8 8
GPU ACCELERATORS AT JSC OF THREADS AND KERNELS
GPU ACCELERATORS AT JSC OF THREADS AND KERNELS 28 May 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline GPUs at JSC GPU Architecture Empirical Motivation Comparisons
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationGPU ARCHITECTURE Chris Schultz, June 2017
Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationUsing JURECA's GPU Nodes
Mitglied der Helmholtz-Gemeinschaft Using JURECA's GPU Nodes Willi Homberg Supercomputing Centre Jülich (JSC) Introduction to the usage and programming of supercomputer resources in Jülich 23-24 May 2016
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationVSC Users Day 2018 Start to GPU Ehsan Moravveji
Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationLECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016
LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationApproaches to GPU Computing. Libraries, OpenACC Directives, and Languages
Approaches to GPU Computing Libraries, OpenACC Directives, and Languages Add GPUs: Accelerate Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationARCHER Champions 2 workshop
ARCHER Champions 2 workshop Mike Giles Mathematical Institute & OeRC, University of Oxford Sept 5th, 2016 Mike Giles (Oxford) ARCHER Champions 2 Sept 5th, 2016 1 / 14 Tier 2 bids Out of the 8 bids, I know
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationNOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer
NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationWHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016
WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationHigh-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA
High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA
ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives
More information