AXEL KOEHLER GPU Computing Update

Size: px

Start display at page:

Download "AXEL KOEHLER GPU Computing Update"

Marianna Marjorie O’Connor’
5 years ago
Views:

1 AXEL KOEHLER GPU Computing Update

2 Agenda Introduction GPU Computing Introduction into GPU Programming Kepler GPU Architecture GPU Applications Future Developments 2

3 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra 3

4 Continued Demand for Compute Power And the Power Crisis in (Super) Computing Comprehensive Earth System Model Coupled simulation of entire cells Exaflop 25 MW Petaflop Teraflop 850 KW Simulation of combustion for new high-efficiency, lowemision engines. Predictive calculations for supernovae Gigaflop 60 KW

with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Lots of

5 Multi-core CPUs Industry has gone multi-core as a first response to power issues Performance through parallelism, not frequency But CPUs are fundamentally designed for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops. 5

6 Accelerated Computing Add GPUs: Accelerate Applications CPUs: designed to run a few tasks quickly. GPUs: designed to run many tasks efficiently. 6

DRAM I/F Giga Thread Host I/F DRAM I/F Energy efficient GPU Performance = Throughput Fixed function hardware Transistors are primarily devoted to data processing Less leaky cache SIMT thread

7 DRAM I/F Giga Thread Host I/F DRAM I/F Energy efficient GPU Performance = Throughput Fixed function hardware Transistors are primarily devoted to data processing Less leaky cache SIMT thread execution DRAM I/F Groups of threads formed into warps which always executing same instruction DRAM I/F Some threads become inactive when code path diverges Cooperative sharing of units with SIMT L2 DRAM I/F eg. fetch instruction on behalf of several threads or read memory location and broadcast to several registers DRAM I/F Lack of speculation reduces overhead Minimal Overhead Hardware managed parallel thread execution and handling of divergence 8

8 CPU Pizza Delivery Process: Delivery truck delivers one pizza and then moves to next house Original Idea by Jedox 9

9 NVIDIA GPU Pizza Delivery Process: Many deliveries to many houses Original Idea by Jedox 10

Fastest, Most Energy Efficient Supercomputers World s Fastest Open Science Supercomputer 18,688 Tesla K20X GPU Accelerators 27 Petaflops Peak 90% of

10 Fastest, Most Energy Efficient Supercomputers World s Fastest Open Science Supercomputer 18,688 Tesla K20X GPU Accelerators 27 Petaflops Peak 90% of Performance from GPUs World s Most Energy Efficient Supercomputer 128 Tesla K20 GPU Accelerators 3150 MFLOPS/Watt $100k Energy & 300 Tons of CO 2 Saving Per Year 11

Tesla Kepler Family World s Fastest and Most Efficient HPC Accelerators GPUs Single Precision Peak (SGEMM) Double Precision Peak (DGEMM) Memory Size Memory Bandwidth (ECC off) System Solution Weather

11 Tesla Kepler Family World s Fastest and Most Efficient HPC Accelerators GPUs Single Precision Peak (SGEMM) Double Precision Peak (DGEMM) Memory Size Memory Bandwidth (ECC off) System Solution Weather & Climate, Physics, BioChemistry, CAE, K20X 3.95 TF (2.90 TF) 1.32 TF (1.22 TF) 6 GB 250 GB/s Server only Material Science K TF (2.61 TF) 1.17 TF (1.10 TF) 5 GB 208 GB/s Server + Workstation Image, Signal, Video, Seismic K TF 0.19 TF 8 GB 320 GB/s Server only 12

12 Introduction into GPU Programming 13

13 Minimum Change, Big Speed-up Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 14

14 Parallel Computing Platform Multiple Programming Approaches Libraries Drop-in Acceleration OpenACC Directives Easily Accelerate Applications Programming Languages Maximum Flexibility Development Environment Parallel Nsight IDE Linux, Mac and Windows GPU Debugging and Profiling CUDA-GDB debugger NVIDIA Visual Profiler Third Party Tools DDT, TotalView, Vampir, Compiler Open Compiler Tools Enables compiling new languages to CUDA platform, and CUDA languages to other architectures OpenACC Compiler Hardware Capabilities SMX DynamicParallelism HyperQ GPUDirect 15

15 GPU Accelerated Libraries Drop-in Acceleration for your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix NVIDIA cufft, cublas, cusparse Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA curand Data Struct. & AI Sort, Scan, Zero Sum Visual Processing Image & Video NVIDIA NPP GPU AI Board Games NVIDIA Video Encode GPU AI Path Finding 16

16 Rapid Parallel C++ Development Resembles C++ STL High-level interface Flexible Enhances developer productivity Enables performance portability between GPUs and multicore CPUs CUDA, OpenMP, and TBB backends Extensible and customizable Integrates with existing software Open source // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); or 17

17 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 18

18 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 19

19 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory 20

20 Heterogeneous Computing Terminology: Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) A function which runs on a GPU is called a kernel Each parallel invocation of a function running on the GPU is called a block - A block can identify itself by reading blockidx.x Each block is then broken up into threads - A thread can identify itself by reading threadidx.x - The total number of threads per block can be read with blockdim.x 21

21 GPUs: C, C++, Fortran, Python Programmable Standard C Code Parallel C Code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } //Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } //Invoke parallel SAXPY kernel w/ 256 threads/blk int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); 22

, NVIDIA Nsight Eclipse Edition Power of GPU Computing +

GPU code refactoring Semantic highlighting of CUDA code

Simultaneously debug of CPU and GPU Inspect variables across

Profiler Quickly identifies performance issues Integrated

22 , NVIDIA Nsight Eclipse Edition Power of GPU Computing + Productivity of Eclipse CUDA-Aware Editor Automated CPU to GPU code refactoring Semantic highlighting of CUDA code Integrated code samples & docs Nsight Debugger Simultaneously debug of CPU and GPU Inspect variables across CUDA threads Use breakpoints & single-step debugging Nsight Profiler Quickly identifies performance issues Integrated expert system Automated analysis Source line correlation Available for Linux and Mac OS 23

.. End Program myscience Your original Fortran or C code OpenACC Compiler Hint Works on multicore CPUs

23 OpenACC Directives CPU GPU Easy, Open, Powerful Simple Compiler hints Program myscience... serial code...!$acc region do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end region... End Program myscience Your original Fortran or C code OpenACC Compiler Hint Works on multicore CPUs & many core GPUs Compiler Parallelizes code Future Integration into OpenMP standard planned 24

24 Basic Concepts CPU Memory Transfer data GPU Memory PCI Bus CPU Offload computation GPU For efficiency, decouple data movement and compute off-load 26

25 Jacobi Relaxation Iterate until converged iter = 0 do while ( err.gt tol.and. iter.gt. iter_max ) iter = iter + 1 err = 0.0 Iterate across elements of matrix do j=1,m do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i, j+1) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do if( mod(iter,100).eq.0.or. iter.eq.1 ) print*, iter, err A = Anew end do Calculate new value from neighbours 29

26 OpenMP CPU Implementation iter = 0 do while ( err.gt tol.and. iter.gt. iter_max ) iter = iter + 1 err = 0.0!$omp parallel do shared(m,n,anew,a) reduction(max:err) do j=1,m do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i, j+1) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do!$omp end parallel do if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do Parallelise code inside region Close off region 30

27 OpenACC GPU Implementation!$acc data copy(a,anew) iter = 0 do while ( err.gt tol.and. iter.gt. iter_max ) iter = iter + 1 err = 0.0!$acc parallel reduction( max:err ) do j=1,m do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do!$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do!$acc end data Copy arrays into GPU memory within region Parallelise code inside region Close off parallel region Close off data region, copy data back 31

28 Improved OpenACC GPU Implementation!$acc data copyin(a), copyout(anew) iter = 0 do while ( err.gt tol.and. iter.gt. iter_max ) Reduced data movement iter = iter + 1 err = 0.0!$acc parallel reduction( max:err ) do j=1,m do i=1,n Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do!$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do!$acc end data 32

29 More Parallelism!$acc data copyin(a), create(anew) iter = 0 do while ( err.gt. tol.and. iter.gt. iter_max ) Anew now only exists on GPU iter = iter + 1 err = 0.0!$acc parallel reduction( max:err ) do j=1,m do i=1,n Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) ) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do!$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err!$acc parallel A = Anew!$acc end parallel end do!$acc end data Find maximum over all iterations Add second parallel region inside data region 33

30 More Performance!$acc data copyin(a), create(anew) iter = 0 do while ( err.gt. tol.and. iter.gt. iter_max ) iter = iter + 1 err = 0.0!$acc kernels loop reduction( max:err ), gang(32), worker(8) do j=1,m do i=1,n Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) ) err = max( err, abs(anew(i,j)-a(i,j)) ) end do end do!$acc end kernels loop if( mod(iter,100).eq.0 ) print*, iter, err!$acc parallel A = Anew!$acc end parallel end do!$acc end data 30% faster than default schedule 34

31 Additions for OpenACC 2.0 Procedure calls Separate compilation Nested parallelism Device-specific tuning, multiple devices Data management features and global data Multiple host thread support Loop directive additions Asynchronous behavior additions New API routines for target platforms (CUDA, OpenCL, Intel Coprocessor Offload Infrastructure) See 35

32 Get Started with GPU Programming Watch Explore Get CUDA Access Tools Learn with Tutorials Join the Community bit.ly/gpugetstarted developer.nvidia.com/get-started-parallel-computing 36

33 Develop on GeForce, Deploy on Tesla GeForce GTX Titan Tesla K20X/K20 Designed for Gamers & Developers 1+ Teraflop Double Precision Performance Dynamic Parallelism Hyper-Q for CUDA Streams Available Everywhere! Designed for Cluster Deployment ECC 24x7 Runtime GPU Monitoring Cluster Management GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support 37

34 Kepler GPU Architecture 38

35 DP GFLOPS per Watt Focus on Power Efficiency Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta 2 Fermi FP Tesla CUDA

36 Kepler GK110 Block Diagram 7.1B Transistors 15 SMX units 1.3 TFLOP FP MB L2 Cache 384-bit GDDR5 PCI Express Gen3 compliant 40

37 Kepler GK110 SMX vs Fermi SM 3x sustained perf/w Ground up redesign for perf/w 6x the SP FP units 4x the DP FP units Significantly slower FU clocks Processors are getting wider, not faster 41

Parallelism Less Back-Forth, Simpler Code CPU

38 Kepler Features Make GPU Coding Easier Hyper-Q Speedup Legacy MPI Apps FERMI 1 Work Queue Dynamic Parallelism Less Back-Forth, Simpler Code CPU Fermi GPU CPU Kepler GPU KEPLER 32 Concurrent Work Queues 42

39 Speedup vs. Dual K20 Relative Sorting Performance GPU Coding Made Easier & More Efficient Hyper-Q: 32 MPI jobs per GPU Easy Speed-up for Legacy MPI Apps Dynamic Parallelism: GPU Generates Own Work Less Effort, Higher Performance CP2K- Quantum Chemistry Quicksort 20x 3x 4.0x 2x 15x 3.0x 10x 2.0x 5x 1.0x 0x x Number of GPUs Increasing Problem Size (# of Elements) Millions K20 with Hyper-Q K20 without Hyper-Q Without Dynamic Parallelism With Dynamic Parallelism 43

40 Kepler Enables Full NVIDIA GPUDirect RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU2 GPU1 CPU Server 1 PCI-e Network Card Network Network Card PCI-e Server 2 45

41 MVAPICH2 Performance with GPUDirect RDMA Bi-Directional Bandwidth Latency Slides courtesy of DK Panda 49

42 CUDA Compiler Contributed to Open Source LLVM Developers want to build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs, x86 CUDA C, C++, Fortran NVIDIA GPUs LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support 54

MATLAB Parallel Computing Most popular math functions on GPUs Random

Min/max SVD Cholesky and LU factorization Use GPU with MATLAB

elementwise on the GPU (arrayfun) Create kernels from existing CUDA

43 MATLAB Parallel Computing Most popular math functions on GPUs Random number generation FFT Matrix multiplications Solvers Convolutions Min/max SVD Cholesky and LU factorization Use GPU with MATLAB built-in functions ( gpuarray gather ) Execute MATLAB functions elementwise on the GPU (arrayfun) Create kernels from existing CUDA code and PTX files MATLAB Compiler support (GPU acceleration without MATLAB installed) 56

Enabling More Programming Languages CUDA Python @cuda.

newzr if (zr*zr+zi*zi) >= 4: return i return 255 CUDA Programming, Python

jit(argtypes=[uint8[:,:], f8, f8, f8, f8, uint32]) def mandel_kernel(img,

44 Enabling More Programming Languages CUDA argtypes=[f8, f8, uint32], device=true) def mandel(x, y, max_iters): zr, zi = 0.0, 0.0 for i in range(max_iters): newzr = (zr*zr-zi*zi)+x zi = 2*zr*zi+y zr = newzr if (zr*zr+zi*zi) >= 4: return i return 255 CUDA Programming, Python f8, f8, f8, f8, uint32]) def mandel_kernel(img, xmin, xmax ymin, ymax, iters): x, y = cuda.grid(2) if x < img.shape[0] and y < img.shape[1]: img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]), min_y+y*((max_y-min_y)/img.shape[1]), iters) gimage = np.zeros((1024, 1024), dtype = np.uint8) d_image = cuda.to_device(gimage) mandel_kernel[(32,32), (32,32)](d_image, -2.0, 1.0, -1.0, 1.0, 20) d_image.to_host() 57

45 GPU Applications 58

46 Wide Adoption of Tesla GPUs Oil and gas Edu/Research Government Life Sciences Finance Manufacturing Reverse Time Migration Kirchoff Time Migration Reservoir Sim Astrophysics Lattice QCD Molecular Dynamics Weather / Climate Modeling Signal Processing Satellite Imaging Video Analytics Synthetic Aperture Radar Bio-chemistry Bio-informatics Material Science Sequence Analysis Genomics Risk Analytics Monte Carlo Options Pricing Insurance modeling Structural Mechanics Computational Fluid Dynamics Machine Vision Electromagnetics 59

effective drugs Run at NCSA Blue Waters (3000 GPUs) More efficient & cost-effective solar cells 1.

47 Recent Scientific Breakthroughs using GPUs Breakthrough in HIV research Fastest simulation for Silicon for Solar Cells Gordon Bell Prize Stronger, Lighter Metals Discover the chemical structure of HIV's capsid to build more effective drugs Run at NCSA Blue Waters (3000 GPUs) More efficient & cost-effective solar cells 1.87 Petaflop / sec perf on 7168 GPUs on Tianhe-1A, Lighter, Stronger Metals for More Fuel-Efficient Cars 4224 GPUs at Tokyo Tech, Japan 60

48 GPU-Accelerated Applications 61

49 62

QC: All key codes are ported/optimizing: Active GPU acceleration projects: Abinit, BigDFT, CP2K, GAMESS, Gaussian, GPAW,

50 Overview of Life & Material Accelerated Apps MD: All key codes are available AMBER, CHARMM, DESMOND, DL_POLY, GROMACS, LAMMPS, NAMD GPU only codes: ACEMD, HOOMD-Blue Great multi-gpu performance Focus: scaling to large numbers of GPUs / nodes QC: All key codes are ported/optimizing: Active GPU acceleration projects: Abinit, BigDFT, CP2K, GAMESS, Gaussian, GPAW, NWChem, Quantum Espresso, VASP & more GPU only code: TeraChem Analytical instruments actively recruiting Bioinformatics market development 63

51 Integration of Compute and Visualisation GPU Operation Mode All_On enables graphics capabilities for K20/K20X server GPUs nvidia-smi --gom=0 NVIDIA index - Scalable Big Data Visualization Remote visualization tools like ParaView

52 GPUs for control systems GPUs will be used in many experiments for controlling Examples: Triggering and tracking for CERN experiments Signal processing for Lofar or Square Kilometre Array (SKA) 66

53 GPUs and Big Data GPUs Today Computational acceleration for Big Data Visualization Accelerating the Cloud + Mobile transformation GPUs Tomorrow Converged architecture for Big Data and Compute 67

54 The Future 68

55 NVIDIA Research NVIDIA is doing exascale research and development in processor architecture circuits high-speed signaling programming models / algorithms Fast Forward Echelon Concept is to use thousands of efficient, throughput-optimized cores to perform the bulk of the work, with a handful of latencyoptimized cores to perform the serial computation. Goal: all codes with high parallelism should map well onto future hybrid processors 69

Which Takes More Energy? Performing a 64-bit floating-point FMA: 893,500.288914668 43.

80815564 Or moving the three 64-bit operands 20 mm across the die: This one takes over 4.

56 Which Takes More Energy? Performing a 64-bit floating-point FMA: 893, = 39,226, = 39,226, Or moving the three 64-bit operands 20 mm across the die: This one takes over 4.7x the energy today (40nm)! It s getting worse: in10nm, relative cost will be 17x! Loading the data from off chip takes >> 100x the energy. 70

57 Communication Takes More Energy Than Arithmetic 64-bit DP 20pJ 20mm 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 50 pj 500 pj Efficient off-chip link 1 nj 71

58 What is important for the future? Its not about the FLOPS Its about data movements Algorithms should be designed to perform more work per unit data movement Programming systems should further optimize this data movement Architectures should facilitate this by providing an exposed hierarchy and efficient communication 72

59 Summary NVIDIA provides a powerful development platform for parallel computing Compilers, Libraries, Integrated development environments (IDEs), Profiler, Debugger, Open Compiler SDK, 3 rd party tools Power is the main HPC constraint Vast majority of work must be done by cores designed for efficiency Data movement dominates the power GPU computing has a sustainable model aligned with technology trends, supported by consumer markets Start now to parallelize your code and to implement onto the available GPU hardware To find parallel algorithms is the most difficult and time consuming part The implementation is the easy part 74

GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing. Axel Koehler Sr. Solution Architect HPC GPU Computing Axel Koehler Sr. Solution Architect HPC 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2 Continued Demand for Ever Faster Supercomputers First-principles