CUDA Introduction Part I PATC GPU Programming Course 2017

Size: px

Start display at page:

Download "CUDA Introduction Part I PATC GPU Programming Course 2017"

Belinda Ramsey
6 years ago
Views:

1 CUDA Introduction Part I PATC GPU Programming Course 2017 Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version

2 Outline Introduction GPU History Architecture Comparison Jülich Systems App Showcase The GPU Platform 3 Core Features Memory Asynchronicity SIMT High Throughput Summary Programming GPUs Libraries About CUDA Alternatives Directives Thrust CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 2 45

3 History of GPUs 50 Shaders of Gray 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [1]»GPU«coined by NVIDIA [2] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2016 Top 500: > 1 /10 with GPUs [3], Green 500: 2 /3 of top 50 with GPUs [4] Andreas Herten CUDA Introduction I 24 April 2017 # 3 45

4 Status Quo Across Architectures Memory Bandwidth GFLOP/sec Theoretical Peak Performance, Double Precision Xeon Phi 7120 (KNC) INTEL Xeon CPUs Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 HD 6970 Tesla C1060 HD 6970 Tesla K20 X5482 Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20X FirePro W9100 X5492 HD 7970 GHz Ed. Tesla K40 FirePro S9150 W5590 HD 8970 Tesla P100 X5680 Tesla K40 X5690 E E v3 E v2 E v3 E v4 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]

5 Status Quo Across Architectures Memory Bandwidth GB/sec INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis HD 3870 HD 4870 Tesla C1060 HD 5870 Theoretical Peak Memory Bandwidth Comparison HD 6970 HD 6970 HD 7970 GHz Ed. Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20 HD 8970 FirePro W9100 Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X E E v2 E v3 FirePro S9150 E v3 Tesla P100 Xeon Phi 7290 (KNL) E v4 Graphic: Rupp [5] X5482 X5492 W5590 X5680 X End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45

6 Status Quo Across Architectures Memory Bandwidth INTEL Xeon CPUs NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis Theoretical Peak Memory Bandwidth Comparison Tesla P100 GB/sec Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 X5482 Tesla C1060 HD 6970 HD 7970 GHz Ed. X5492 Tesla C1060 HD 6970 W5590 Tesla C2050 HD 8970 FirePro W9100 Tesla M2090 X5680 Tesla K20 FirePro S9150 X5690 E E v2 E v3 E v3 E v End of Year Andreas Herten CUDA Introduction I 24 April 2017 # 4 45 Graphic: Rupp [5]

7 JURECA Jülich s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) 1.8 (CPU) (GPU) PFLOP/s peak performance (#70) Dedicated visualization nodes Mellanox EDR InfiniBand Andreas Herten CUDA Introduction I 24 April 2017 # 5 45

8 JULIA JURON JURON A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 10 cores) Per Node: 4 NVIDIA Tesla P100 cards, connected via NVLink. GPU: 0.38 PFLOP/s peak performance Dedicated visualization nodes Andreas Herten CUDA Introduction I 24 April 2017 # 6 45

9 Getting GPU-Acquainted Some Applications TASK Dot Product GEMM Location of Code: CudaBasics/exercises/tasks/getting_started/ See Instructions.rst for hints. N-Body Mandelbrot Andreas Herten CUDA Introduction I 24 April 2017 # 7 45

10 Getting GPU-Acquainted Some Applications CPU GPU DDot Benchmark CPU GPU DGEMM Benchmark TASK MFLOP/s GFLOP/s Length of Vector Size of Square Matrix GFLOP/s GPU SP 2 GPUs SP 4 GPUs SP 1 GPU DP 2 GPUs DP 4 GPUs DP N-Body Benchmark Number of Particles MPixel/s Mandelbrot Benchmark CPU GPU Width of Image Andreas Herten CUDA Introduction I 24 April 2017 # 7 45

11 The GPU Platform Andreas Herten CUDA Introduction I 24 April 2017 # 8 45

12 CPU vs. GPU Transporting one Andreas Herten CUDA Introduction I 24 April 2017 Graphics: Lee [6] and Shearings Holidays [7] A matter of specialties Transporting many # 9 45

13 CPU vs. GPU Chip Control ALU ALU ALU ALU Cache DRAM DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 10 45

14 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45

15 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 11 45

16 Memory GPU memory ain t no CPU memory Unified Virtual Addressing GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Host ALU ALU ALU ALU Control Cache DRAM PCIe <16 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45

17 Memory GPU memory ain t no CPU memory Unified Memory GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Values for P100: 16 GB RAM, 720 GB/s Host ALU ALU ALU ALU Control Cache DRAM NVLink 80 GB/s HBM2 <720 GB/s DRAM Device Andreas Herten CUDA Introduction I 24 April 2017 # 12 45

18 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

19 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

20 Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory L2 DRAM Andreas Herten CUDA Introduction I 24 April 2017 # 13 45

21 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 14 45

22 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten CUDA Introduction I 24 April 2017 # 15 45

23 GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Andreas Herten CUDA Introduction I 24 April 2017 # 16 45

24 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

25 SIMT Of threads and warps Vector A 0 B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3

26 SIMT Of threads and warps Multiprocessor A 0 Vector B 0 C 0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A 1 B 1 C 1 + = A 2 A 3 B 2 B 3 C 2 C 3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [8] Pascal GP100 Andreas Herten CUDA Introduction I 24 April 2017 # 17 45

27 Low Latency vs. High Throughput Maybe GPU s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor: High Throughput W 1 W2 W 3 W4 Thread/Warp Processing Context Switch Ready Waiting Andreas Herten CUDA Introduction I 24 April 2017 # 18 45

Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt + High bandwidth main memory +

28 CPU vs. GPU Let s summarize this! Optimized for low latency Optimized for high throughput Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt + High bandwidth main memory + Latency tolerant (parallelism) + More compute resources + High performance per watt Limited memory capacity Low per-thread performance Extension card Andreas Herten CUDA Introduction I 24 April 2017 # 19 45

29 Programming GPUs Andreas Herten CUDA Introduction I 24 April 2017 # 20 45

30 Preface: CPU A simple CPU program! SAXPY: y = a x + y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { } for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 21 45

31 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! Wizard: Breazell [9] Andreas Herten CUDA Introduction I 24 April 2017 # 22 45

32 Libraries The truth is out there! Programming GPUs is easy: Just don t! Use applications & libraries! cublas cusparse Wizard: Breazell [9] cufft curand CUDA Math th ano Andreas Herten CUDA Introduction I 24 April 2017 # 22 45

33 cublas Parallel algebra GPU-parallel BLAS (all 152 routines) Single, double, complex data types Constant competition with Intel s MKL Multi-GPU support Andreas Herten CUDA Introduction I 24 April 2017 # 23 45

34 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); cudafree(d_x); cudafree(d_y); cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45

35 cublas Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); Initialize float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); Allocate GPU memory Copy data to GPU Call BLAS routine cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); Copy result to host cudafree(d_x); cudafree(d_y); Finalize cublasdestroy(handle); Andreas Herten CUDA Introduction I 24 April 2017 # 24 45

36 cublas Task Implement a matrix-matrix multiplication TASK Location of code: CudaBasics/excercises/tasks/cublas/ Look at Instructions.rst for instructions 1 Implement call to double-precision GEMM of cublas 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./dgemm_um N, where N=100, 200, Check cublas documentation for details on cublasdgemm() JURON Getting Started module load nvidia/cuda/8.0 bsub -I -R "rusage[ngpus_shared=1]" hostname bsub -Is -R "rusage[ngpus_shared=1]" -tty /bin/bash bsub -Is -R "rusage[ngpus_shared=1]" -tty -XF nvvp Andreas Herten CUDA Introduction I 24 April 2017 # 25 45

37 Programming GPUs About CUDA Alternatives Andreas Herten CUDA Introduction I 24 April 2017 # 26 45

38 ! Parallelism Libraries are not enough? You think you want to write your own GPU code? Andreas Herten CUDA Introduction I 24 April 2017 # 27 45

39 Primer on Parallel Scaling Amdahl s Law Possible maximum speedup for N parallel processors Total Time t = t serial + t parallel N Processors t(n) = t s + t p /N Speedup s(n) = t/t(n) = t s+t p t s +t p /N Speedup Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99% Number of Processors Andreas Herten CUDA Introduction I 24 April 2017 # 28 45

40 ! Parallelism Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain? Andreas Herten CUDA Introduction I 24 April 2017 # 29 45

41 Alternatives The twilight There are alternatives to CUDA C, which can ease the pain OpenACC Thrust PyCUDA Other alternatives (for completeness) CUDA Fortran OpenMP OpenCL Andreas Herten CUDA Introduction I 24 April 2017 # 30 45

42 Programming GPUs Directives Andreas Herten CUDA Introduction I 24 April 2017 # 31 45

43 GPU Programming with Directives Keepin you portable Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {}; Also: Generalized API functions acc_copy(); Compiler interprets directives, creates according instructions Pro Portability Other compiler? No problem! To it, it s a serial program Different target architectures from same code Easy to program Con Only few compilers Not all the raw power available Harder to debug Easy to program wrong Andreas Herten CUDA Introduction I 24 April 2017 # 32 45

44 GPU Programming with Directives The power of two. OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5 #pragma omp target map(tofrom:y), map(to:x) #pragma omp teams num_teams(10) num_threads(10) #pragma omp distribute for ( ) { #pragma omp parallel for for ( ) { // } } OpenACC Similar to OpenMP, but more specifically for GPUs For C/C++ and Fortran Andreas Herten CUDA Introduction I 24 April 2017 # 33 45

45 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

46 OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

47 OpenACC Code example TASK void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) } y[i] = a * x[i] + y[i]; int a = 42; int n = 10; See JSC OpenACC course in October! float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten CUDA Introduction I 24 April 2017 # 34 45

48 Programming GPUs Thrust Andreas Herten CUDA Introduction I 24 April 2017 # 35 45

49 Thrust Iterators! Iterators everywhere! Thrust CUDA = C++ STL Template library Based on iterators Data-parallel primitives (scan(), sort(), reduce(), ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Andreas Herten CUDA Introduction I 24 April 2017 # 36 45

50 Thrust Code example int a = 42; int n = 10; thrust::host_vector<float> x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); x = d_x; Andreas Herten CUDA Introduction I 24 April 2017 # 37 45

51 Thrust Task Let s sort some randomness TASK Location of code: CudaBasics/excercises/tasks/thrust/ Look at Instructions.rst for instructions 1 Sort random numbers with Thrust on CPU and GPU 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./thrustsort Check Thrust documentation for details on thrust::sort() Andreas Herten CUDA Introduction I 24 April 2017 # 38 45

52 Summary of Acceleration Possibilities Application Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easy Acceleration Flexible Acceleration Andreas Herten CUDA Introduction I 24 April 2017 # 39 45

53 Programming GPUs CUDA C/C++ Andreas Herten CUDA Introduction I 24 April 2017 # 40 45

54 CUDA SAXPY With runtime-managed data transfers global void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudamallocmanaged(&x, n * sizeof(float)); cudamallocmanaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudadevicesynchronize(); Andreas Herten CUDA Introduction I 24 April 2017 # 41 45

55 CUDA C/C++ Warp the kernel, it s a thread! Methods to exploit parallelism: Thread Block Block Grid All in 3D Threads run on individual CUDA cores Lightweight fast switchting! 1000s threads execute simultaneously Warp: Bundle of (currently) 32 threads executing at the exact same time on one multiprocessor only hardware concept! Andreas Herten CUDA Introduction I 24 April 2017 # 42 45

56 CUDA C/C++ Kernels Execution unit: kernel Function executing in parallel on device global kernel(int a, float * b) { } Access own ID by global variables threadidx.x, blockidx.y, Execution order non-deterministic! Might even be serial Only threads in one warp (32 threads of block) can communicate reliably/quickly All threads execute same code, but can take different paths (beware of slow branches!) Further keywords: Atomic operations (locking); shared memory; thread synchronization CUDA SAXPY! Andreas Herten CUDA Introduction I 24 April 2017 # 43 45

57 CUDA API?!? Jan! Andreas Herten CUDA Introduction I 24 April 2017 # 44 45

58 Conclusions of Part 1 GPUs achieve performance by specialized hardware Acceleration can be done by different means Libraries are the easiest Thrust, OpenACC can give first entry point Full power with CUDA CUDA parallelizes for GPUs with many threads Thank you for your attention! a.herten@fz-juelich.de Andreas Herten CUDA Introduction I 24 April 2017 # 45 45

59 Appendix Glossary References Andreas Herten CUDA Introduction I 24 April 2017 # 1 8

60 Glossary I API A programmatic interface to software by well-defined functions. Short for application programming interface. 43, 57, 60 ATI Canada-based GPUs manufacturing company; bought by AMD in , 60 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 41, 49, 53, 54, 55, 56, 57, 58, 60 GCC The GNU Compiler Collection, the collection of open source compilers, among other for C and Fortran. 60 Andreas Herten CUDA Introduction I 24 April 2017 # 2 8

61 Glossary II LLVM An open Source compiler infrastructure, providing, among others, Clang for C. 60 NVIDIA US technology company creating GPUs. 3, 7, 8, 60 NVLink NVIDIA s communication protocol connecting CPU GPU and GPU GPU with 80 GB/s. PCI-Express: 16 GB/s. 8, 60 OpenACC Directive-based programming, primarily for many-core machines. 41, 44, 45, 46, 47, 60 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 41, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 3 8

62 Glossary III OpenGL The Open Graphics Library, an API for rendering graphics across different hardware architectures. 3, 60 OpenMP Directive-based programming, primarily for multi-threaded machines. 41, 44, 60 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 memory. 8, 60 SAXPY Single-precision A X + Y. A simple code example of scaling a vector and adding an offset. 30, 54, 56, 60 Tesla The GPU product line for general purpose computing computing of NVIDIA. 7, 8, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 4 8

63 Glossary IV Thrust A parallel algorithms library for (among others) GPUs. See 41, 49, 60 Andreas Herten CUDA Introduction I 24 April 2017 # 5 8

64 References I [1] Kenneth E. Hoff III et al. Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp ISBN: DOI: / URL: (page 3). [2] Chris McClanahan. History and Evolution of GPU Architecture. In: A Survey Paper (2010). URL: (page 3). [3] Jack Dongarra et al. TOP500. Nov URL: (page 3). Andreas Herten CUDA Introduction I 24 April 2017 # 6 8

65 References II [4] Jack Dongarra et al. Green500. Nov URL: (page 3). [5] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: (pages 4 6). [6] Mark Lee. Picture: kawasaki ninja. URL: (page 12). [7] Shearings Holidays. Picture: Shearings coach 636. URL: (page 12). Andreas Herten CUDA Introduction I 24 April 2017 # 7 8

66 References III [8] Nvidia Corporation. Pictures: Pascal Blockdiagram, Pascal Multiprocessor. Pascal Architecture Whitepaper. URL: whitepaper/pascal-architecture-whitepaper.pdf (pages 25, 26). [9] Wes Breazell. Picture: Wizard. URL: (pages 31, 32). Andreas Herten CUDA Introduction I 24 April 2017 # 8 8

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS 28 May 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline GPUs at JSC GPU Architecture Empirical Motivation Comparisons