NVIDIA : FLOP WARS, ÉPISODE III François Courteille Ecole Polytechnique 4-June-13

Size: px

Start display at page:

Download "NVIDIA : FLOP WARS, ÉPISODE III François Courteille Ecole Polytechnique 4-June-13"

Hugh Harvey
5 years ago
Views:

1 NVIDIA : FLOP WARS, ÉPISODE III François Courteille fcourteille@nvidia.com Ecole Polytechnique 4-June-13 1

2 OUTLINE NVIDIA and GPU Computing Roadmap Inside Kepler Architecture SXM Hyper-Q Dynamic Parallelism Computing and Visualizing : OpenGL support Programming GPUs The Software Ecosystem OpenACC : Libraries Languages and Frameworks Application porting examples : MiniFE & Enzo 2

3 NVIDIA - Core Technologies and Brands GPU Mobile Cloud GeForce Quadro, Tesla Founded 1993 Tegra Invented GPU 1999 Computer Graphics VGX GeForce GRID 3

4 4

5 The March of GPUs Gflops/s 1400 Peak Double Precision FP Kepler GBytes/s 250 Peak Memory Bandwidth Kepler M1060 Nehalem 3 GHz Fermi M2070 Westmere 3 GHz Fermi+ M core Sandy Bridge 3 GHz M1060 Nehalem 3 GHz Fermi M2070 Westmere 3 GHz Fermi+ M core Sandy Bridge 3 GHz Double Precision: NVIDIA GPU Double Precision: x86 CPU NVIDIA GPU (ECC off) x86 CPU 5

DP GFLOPS per Watt Tesla CUDA Architecture Roadmap 32 16 8 4 Kepler Dynamic Parallelism

6 DP GFLOPS per Watt Tesla CUDA Architecture Roadmap Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta Stacked DRAM 2 Fermi FP Tesla CUDA

7 NVIDIA Tesla GPUs for HPC

8 NVIDIA Tesla Series Products Data Center Workstation 8

9 Kepler GPU Fastest, Most Efficient HPC Architecture Ever SMX 3x Performance per Watt Hyper-Q Dynamic Parallelism Easy Speed-up for Legacy MPI Apps Parallel Programming Made Easier than Ever 9

Supercomputing Weather / Climate Modeling Molecular

Sciences Defense / Govt Oil and Gas Structural

Biochemistry Bioinformatics Material Science Signal

Time Migration Kirchoff Time Migration Q2 Q3 Q4 Tesla

10 Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Manufacturing Life Sciences Defense / Govt Oil and Gas Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Biochemistry Bioinformatics Material Science Signal Processing Image Processing Video Analytics Reverse Time Migration Kirchoff Time Migration Q2 Q3 Q4 Tesla M2090 Tesla M2075 Tesla K10 Fermi Kepler GK104 Tesla K20 Kepler GK110 10

Tesla K10 Same Power, 2x Performance of Fermi Product Name M2090 K10 GPU Architecture Fermi Kepler GK104 # of GPUs 1 2 Board Per GPU Single Precision Flops 1.3 TF 4.58 TF 2.

11 Tesla K10 Same Power, 2x Performance of Fermi Product Name M2090 K10 GPU Architecture Fermi Kepler GK104 # of GPUs 1 2 Board Per GPU Single Precision Flops 1.3 TF 4.58 TF 2.29 TF Double Precision Flops 0.66 TF TF TF # CUDA Cores Memory size 6 GB 8 GB 4GB Memory BW (ECC off) GB/s 320 GB/s 160GB/s PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s Board Power 225 watts 225 watts 11

Tesla K10 vs M2090: 2x Performance / Watt 2.50 2.00 1.50 225W 1.00 0.50 450W 0.

12 Tesla K10 vs M2090: 2x Performance / Watt W W 0.00 Seismic Processing LAMMPS NAMD AMBER* Radio Astronomy Cross-Correlator Nbody Defense (Integer Ops) * 2 instances of AMBER running JAC 12

TFLOPS Tesla K20 Family: 3x Faster Than Fermi Tesla K20X Tesla K20X Tesla

22 TF 1.17 TF 1.10 TF 1.25 Double Precision FLOPS (DGEMM) 1.

13 TFLOPS Tesla K20 Family: 3x Faster Than Fermi Tesla K20X Tesla K20X Tesla K20 # CUDA Cores Peak Double Precision Peak DGEMM 1.32 TF 1.22 TF 1.17 TF 1.10 TF 1.25 Double Precision FLOPS (DGEMM) 1.22 TFLOPS Peak Single Precision Peak SGEMM 3.95 TF 2.90 TF 3.52 TF 2.61 TF TFLOPS.43 TFLOPS Xeon E Tesla M2090 Tesla K20X Memory Bandwidth 250 GB/s 208 GB/s Memory size 6 GB 5 GB Total Board Power 235W 225W 13

14 Tesla K20X: Faster,Efficient TFlops 1.5 Double Precision (DGEMM) 94% Efficiency TFlops Single Precision (SGEMM) GB/s 300 Memory Bandwidth (STREAM Triad) 70% Efficiency Tesla K20X 0.0 Tesla K20X 0 Tesla K20X 14 Source: Intel

15 Up to 10x on Leading Applications Speedup vs. Dual Socket CPUs 20.0x Performance Across Science Domains 15.0x 10.0x 5.0x 0.0x WL-LSMS- Material Science Chroma- Physics SPECFEM3D- Earth Sciences AMBER- Molecular Dynamics 1xCPU + 1xM2090 1xCPU + 1xK20X CPU: E5-2687w 3.10 GHz Sandy 15 Bridge

16 Titan: World s Fastest Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs Petaflops Sustained Performance on Linpack 16

World s Most Energy Efficient Supercomputer Greener than Xeon Phi, Xeon CPU 3150 MFLOPS/Watt 128 Tesla K20 Accelerators $100k Energy Savings / Yr MFLOPS/Watt 3000 2000 1000

17 World s Most Energy Efficient Supercomputer Greener than Xeon Phi, Xeon CPU 3150 MFLOPS/Watt 128 Tesla K20 Accelerators $100k Energy Savings / Yr MFLOPS/Watt Tons of CO 2 Saved / Yr CINECA Eurora 0 CINECA Eurora- Tesla K20 NICS Beacon- Greenest Xeon Phi System C-DAC- Greenest CPU System Liquid-Cooled Eurotech Aurora Tigon 17

18 GPU Test Drive Double your Fermi Performance with Kepler GPUs

19 Tesla K20/K20X Details 19

20 Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP MB L2 Cache 384-bit GDDR5 PCI Express Gen2/Gen3 20

21 Kepler GK110 SMX vs Fermi SM 3x sustained perf/w Ground up redesign for perf/w 6x the SP FP units 4x the DP FP units Significantly slower FU clocks Processors are getting wider, not faster 21

22 Hyper-Q 22

23 Hyper-Q Improves Concurrency Stream 1 Stream 2 Stream 3 A -- B -- C Stream 1 A B C P Q R X Y Z A--B--C P--Q--R X--Y--Z Hardware Work Queue P -- Q -- R Stream 2 X -- Y -- Z Stream 3 A--B--C P--Q--R X--Y--Z Multiple Hardware Work Queues Streams are separate [ABC] & [PQR] & [XYZ] run concurrently Fermi allows 16-way concurrency Up to 16 grids can run at once But CUDA streams multiplex into a single queue Overlap only at stream edges Kepler allows 32-way concurrency One work queue per stream Concurrency at full-stream level No inter-stream dependencies Any launch ordering 23

24 GPU Utilization % GPU Utilization % Hyper-Q Max GPU Utilization, Slashes CPU Idle Time Time 0 Time 24

Better Utilization with Hyper-Q FERMI 1 Work Queue Grid

hardware queues (CUDA streams) KEPLER 32 Concurrent Work

Particularly interesting for MPI applications when combined

25 Better Utilization with Hyper-Q FERMI 1 Work Queue Grid Management Unit selects most appropriate task from up to 32 hardware queues (CUDA streams) KEPLER 32 Concurrent Work Queues Improves scheduling of concurrently executed grids Particularly interesting for MPI applications when combined with Multi Process Server, but not limited to MPI applications 25

26 Hyper-Q with Multiple MPI Ranks with CP2K Hyper-Q with multiple MPI ranks leads to 2.5X speedup over single MPI rank using the GPU Blog post by Peter Messmer of NVIDIA 26

27 Dynamic Parallelism Simpler Code, More General, Higher Performance CPU Kepler GPU Better load balancing for dynamic workloads when work-per-block is data-dependent ( e.g. Adaptive Mesh CFD ) Too coarse Too fine Just right Launch new kernels from the GPU Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid 27

$setup(data); } global void B(float *data) { do_stuff(data); }$ A <<<... >>> (data); B <<<... >>> (data); C <<<.

.. >>> (data); Y <<<... >>> (data); Z <<<.

28 Unified Runtime Interface int main() { float *data; setup(data); } global void B(float *data) { do_stuff(data); } A <<<... >>> (data); B <<<... >>> (data); C <<<... >>> (data); cudadevicesynchronize(); return 0; X <<<... >>> (data); Y <<<... >>> (data); Z <<<... >>> (data); cudadevicesynchronize(); do_more_stuff(data); CPU main A B C GPU X Y Z Dynamic Parallelism 28

Stellar Simulation: Supernova radial sections 100s 1000s of matrices per section Dynamic Parallelism Better Aggregation of Small Tasks Batched LU-Decomposition with Kepler dswap() GPU Control Grid

29 Stellar Simulation: Supernova radial sections 100s 1000s of matrices per section Dynamic Parallelism Better Aggregation of Small Tasks Batched LU-Decomposition with Kepler dswap() GPU Control Grid dswap() dswap() dswap() dscal() dscal() dscal() dscal() dtrsm() dtrsm() dtrsm() dtrsm() dgemm() dgemm() dgemm() dgemm() GPU Control Grid Each GPU thread in grid controls one matrix (e.g. LU-Decomp) Each thread launches new GPU grids for BLAS operations No need to recode entire BLAS library to support batching 29

30 CPU is Free Dynamic Parallelism Better Programming Model - Simpler Code LU decomposition (Fermi) LU decomposition (Kepler) dgetrf(n, N) { for j=1 to N for i=1 to 64 idamax<<<>>> memcpy dswap<<<>>> memcpy dscal<<<>>> dger<<<>>> next i } memcpy dlaswap<<<>>> dtrsm<<<>>> dgemm<<<>>> next j idamax(); dswap(); dscal(); dger(); dlaswap(); dtrsm(); dgemm(); dgetrf(n, N) { dgetrf<<<>>> synchronize(); } dgetrf(n, N) { for j=1 to N for i=1 to 64 idamax<<<>>> dswap<<<>>> dscal<<<>>> dger<<<>>> next i dlaswap<<<>>> dtrsm<<<>>> dgemm<<<>>> next j } CPU Code GPU Code CPU Code GPU Code 30

31 CUDA Dynamic Parallelism and Programmer Productivity 31

32 GPU Management: nvidia-smi Multi-GPU systems are widely available Different systems are set up differently Want to get quick information on - Approximate GPU utilization - Approximate memory footprint - Number of GPUs - ECC state - Driver version Thu Nov 1 09:10: NVIDIA-SMI Driver Version: GPU Name Bus-Id Disp. Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Tesla K20X 0000:03:00.0 Off Off N/A 30C P8 28W / 235W 0% 12MB / 6143MB 0% Default Tesla K20X 0000:85:00.0 Off Off N/A 28C P8 26W / 235W 0% 12MB / 6143MB 0% Default Compute processes: GPU Memory GPU PID Process name Usage ============================================================================= No running compute processes found Inspect and modify GPU state 32

33 OpenGL and Tesla Tesla K20/K20X for high performance Compute Tesla K20/K20X for Graphics and Compute Use interop to mix OpenGL and Compute Tesla K20 / K20X 33

34 NVIDIA index Cluster-based graphics infrastructure Real-time manipulation of huge datasets Combine volume and surface rendering Project size scales with cluster size Interactive collaboration with global teams 34

HPC long running Application w/ Data HPC +

35 HPC long running Application w/ Data HPC + Viz Readback Viz frames of HPC results New Apps Encoding Raytracing (iray, optix) realityserver (CUDA) Desktop Workstation ISV App < Remoted / Backracked > Server CITIRX HDX VMware MS RemoteFX NICE DCV Rack / Blade WS HP RGS Dell Teradici Tesla NVIDIA GRID (Passive Thermal) MAXIMUS-QUADRO(Active Thermals) 35

36 NVIDIA NON Tesla GPUs for HPC

37 Introducing GeForce GTX TITAN The Ultimate CUDA Development GPU Personal Supercomputer on Your Desktop 2688 CUDA Cores 4.5 Teraflops Single Precision 1.27 Teraflops Double Precision 288 GB/s Memory Bandwidth 37

38 Performance Fastest DP of 1.31TFLOPS on Tesla K20X Optimized for Infiniband with NVIDIA GPUDirect Faster Shuffle instructions Tuning and Optimization Support from NVIDIA Experts Tesla Advantage ECC protection Reliability Tested to run real-world workloads 24/7 at 100% utilization 3 year warranty and prioritized support for bugs/feature requests ISVs certify only on Tesla NVIDIA technical support Longer life cycle for continuity and cluster expansion Built for HPC Integrated solutions from Tier 1 OEMs Hyper-Q for accelerating MPI based workloads Tools for GPU Management and Monitoring (Nvhealthmon, nvsmi/nvml) Enterprise OS support Solution expertise provided by CUDA engineers and technical staff Peta-scale designed, tested and optimized 38

39 Accelerated Computing 10x Performance, 5x Energy Efficiency CPU Optimized for Serial Tasks GPU Accelerator Optimized for Many Parallel Tasks 39

GPU Accelerated Apps Grows 60% # of Apps 200 150 100 50 0 40% Increase 61% Increase 2010 2011 2012 Top

QMCPACK Quantum Espresso GAMESS COSMO GEOS-5 Chroma Denovo GTC ANSYS Mechanical MSC Nastran SIMULIA Abaqus

40 GPU Accelerated Apps Grows 60% # of Apps % Increase 61% Increase Top Supercomputing Apps Computational Chemistry Material Science Climate & Weather Physics CAE AMBER CHARMM GROMACS QMCPACK Quantum Espresso GAMESS COSMO GEOS-5 Chroma Denovo GTC ANSYS Mechanical MSC Nastran SIMULIA Abaqus Accelerated, In Development LAMMPS NAMD DL_POLY Gaussian NWChem VASP CAM-SE NIM WRF GTS ENZO MILC ANSYS Fluent OpenFOAM LS-DYNA 40

41 200+ GPU-Accelerated Applications 41

42 42

43 Small Changes, Big Speed-up Application Code GPU Compute-Intensive Functions Use GPU to Parallelize Rest of Sequential CPU Code CPU + 43

44 44

45 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives (OpenACC) Directives Programming Languages (CUDA,..) High Level Languages (Matlab,..) CUDA Libraries are interoperable with OpenACC CUDA Language is interoperable with OpenACC Easiest Approach Maximum Performance No Need for Programming Expertise 45

46 OpenACC Directives CPU GPU Program myscience... serial code...!$acc region do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end region... End Program myscience Your original Fortran or C code OpenACC Compiler Hint Easy, Open, Powerful Simple Compiler hints Works on multicore CPUs & many core GPUs Compiler Parallelizes code Future Integration into OpenMP standard planned 46

47 Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n); } #pragma acc kernels for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n); } 47

48 OpenACC: Easy and Portable do i = 1, 2560 do j = 1, fa(i) = a * fa(i) + fb(i) end do end do Serial Code: SAXPY OpenACC: Runs on GPUs and Xeon Phi threadid Thread Block 0 Use 2 levels of HDW parallelism Thread Block N !$acc parallel loop do i = 1, 2560!dir$ unroll 1000 do j = 1, fa(i) = a * fa(i) + fb(i) end do end do float x = input[threadid]; float y = func(x); output[threadid] float x = input[threadid]; float y = func(x); output[threadid] 48

49 Additions for OpenACC 2.0 Procedure calls Separate compilation Nested parallelism Device-specific tuning, multiple devices Data management features and global data Multiple host thread support Loop directive additions Asynchronous behavior additions New API routines for target platforms (CUDA, OpenCL, Intel Coprocessor Offload Infrastructure) See 49

multi-core) ELAN Computational Electro-Magnetics Goals: optimize w/ less effort, preserve code base Kernels 6.

50 (from GTC 2013) Applying OpenACC to Legacy Codes Exploit GPU with LESS effort; maintain ONE legacy source Example: REAL-WORLD application tuning using directives (comparing CPU+GPU vs. multi-core) ELAN Computational Electro-Magnetics Goals: optimize w/ less effort, preserve code base Kernels 6.5X to 13X faster than 16-core Xeon Overall speedup 3.2X COSMO Weather Goal: preserve physics code (22% of runtime), augmenting dynamics kernels already in CUDA Physics speedup 4.2X vs. multi-core Xeon Results from EMGS, MeteoSwiss/CSCS 50

Small Effort. Real Impact. Large Oil Company Univ. of Houston Uni. Of Melbourne Ufa State Aviation GAMESS-UK Prof. M.A. Kayali Prof. Kerry Black Prof. Arthur Dr.

51 Small Effort. Real Impact. Large Oil Company Univ. of Houston Uni. Of Melbourne Ufa State Aviation GAMESS-UK Prof. M.A. Kayali Prof. Kerry Black Prof. Arthur Dr. Wilkinson, 3x in 7 days 20x in 2 days 65x in 2 days Yuldashev Prof. Naidoo Solving billions of equations iteratively for oil production at world s largest petroleum reservoirs Studying magnetic systems for innovations in magnetic storage media and memory, field sensors, and Better understand complex reasons by lifecycles of snapper fish in Port Phillip Bay 7x in 4 Weeks Generating stochastic geological models of oilfield reservoirs with borehole data 10x Used for various fields such as investigating biofuel production and molecular sensors. 51

52 Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 2 f(x, y) = 0 A(i,j+1) A(i-1,j) A(i,j) A(i+1,j) A k+1 i, j = A k(i 1, j) + A k i + 1, j + A k i, j 1 + A k i, j A(i,j-1) 52

53 Jacobi Iteration Fortran Code do while ( err > tol.and. iter < iter_max ) err=0._fp_kind Iterate until converged do j=1,m do i=1,n Anew(i,j) =.25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i, j-1) + A(i, j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do iter = iter +1 end do Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays 53

Allocate Anew on accelerator!$acc kernels do j=1,m do i=1,n Anew(i,j) =.

54 Jacobi Iteration: OpenACC Fortran Code!$acc data copy(a), create(anew) do while ( err > tol.and. iter < iter_max ) err=0._fp_kind Copy A in at beginning of loop, out at end. Allocate Anew on accelerator!$acc kernels do j=1,m do i=1,n Anew(i,j) =.25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i, j-1) + A(i, j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do!$acc end kernels... iter = iter +1 end do!$acc end data 54

55 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 55

on GPU and Multicore NVIDIA cufft IMSL Library ArrayFire Building-block Matrix

56 Some GPU-accelerated Libraries NVIDIA cublas NVIDIA curand NVIDIA cusparse NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cufft IMSL Library ArrayFire Building-block Matrix Algorithms Computations for CUDA Sparse Linear Algebra C++ STL Features for CUDA 56

57 Explore the CUDA (Libraries) Ecosystem CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-toolsecosystem 57

58 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 58

59 GPU Programming Languages Numerical analytics MATLAB, Mathematica, LabVIEW Fortran OpenACC, CUDA Fortran C OpenACC, CUDA C C++ Thrust, CUDA C++ Python C# PyCUDA, Copperhead, NumbaPro (Continuum Analytics) GPU.NET, Hybridizer(AltiMesh) 59

60 Get Started Today These languages are supported on all CUDA-capable GPUs. You might already have a CUDA-capable GPU in your laptop or desktop PC! CUDA C/C++ GPU.NET Thrust C++ Template Library CUDA Fortran PyCUDA (Python) MATLAB matlab-gpu.html Mathematica -in-8/cuda-and-opencl-support/ 60

61 Easiest Way to Learn CUDA 50k Enrolled 127 Countries Learn from the Best Prof. John Owens UC Davis Dr. David Luebke NVIDIA Research Prof. Wen-mei W. Hwu U of Illinois Heterogeneous Parallel Programming Anywhere, Any Time Online Worldwide Self Paced $$ It s Free! No Tuition No Hardware No Books Introduction to Parallel Programming Engage with an Active Community Forums and Meetups Hands-on Projects 61

62 NVIDIA Tesla Update Supercomputing 12 Sumit Gupta Thank You General Manager Tesla Accelerated Computing 62

Accelerating High Performance Computing.

Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational