GPU Computing fuer rechenintensive Anwendungen Axel Koehler NVIDIA
GeForce Quadro Tegra Tesla 2
Continued Demand for Ever Faster Supercomputers First-principles simulation of combustion for new high-efficiency, lowemision engines. Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models. Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies. Coupled simulation of entire cells at molecular, genetic, chemical and biological levels. 3
Power Crisis in Supercomputing Exaflop 25,000,000 Watts Petaflop Teraflop 850,000 Watts 8,200,000 Watts Titan K- ORNL Computer 12,600,000 Watts 60,000 Watts Gigaflop 1982 1996 2008 2020 4
Multi-core CPUs Industry has gone multi-core as a first response to power issues Performance through parallelism, not frequency But CPUs are fundamentally designed for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops. 5
Accelerated Computing Add GPUs: Accelerate Applications CPUs: designed to run a few tasks quickly. GPUs: designed to run many tasks efficiently. 6
Fixed function hardware Transistors are primarily devoted to data processing Less leaky cache SIMT thread execution Groups of threads formed into warps which always executing same instruction Cooperative sharing of units with SIMT eg. fetch instruction on behalf of several threads or read memory location and broadcast to several registers Lack of speculation reduces overhead Minimal Overhead Energy efficient GPU Performance = Throughput Hardware managed parallel thread execution and handling of divergence 7
Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Life Sciences Biochemistry Bioinformatics Material Science Tesla K20/K20X Kepler GK110 Manufacturing Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Defense / Govt Oil and Gas Signal Processing Image Processing Video Analytics Reverse Time Migration Kirchoff Time Migration Tesla K10 Kepler GK104 8
Tesla K10: 3072 Power-Efficient Cores Product Name M2090 K10 GPU Architecture Fermi Kepler GK104 # of GPUs 1 2 Per GPU Board Single Precision Flops 1.3 TF 2.29 TF 4.58 TF Double Precision Flops 0.66 TF 0.095 TF 0.190 TF # CUDA Cores 512 1536 3072 Memory size 6 GB 4GB 8 GB Memory BW (ECC off) 177.6 GB/s 160GB/s 320 GB/s PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s 9
Tesla K20X/K20: Same Power, 3x DP Perf of Fermi Product Name M2090 K20 K20X Peak Single Precision Peak SGEMM 1.3 Tflops 0.87 TF 3.52 TF 2.99 TF 3.95 TF 3.35 TF Peak Double Precision Peak DGEMM 0.66 Tflops 0.43 TF 1.17 TF 1.0 TF 1.32 TF 1.12 TF Memory size 6 GB 5 GB 6 GB Memory BW (ECC off) New CUDA Features ECC Features 177.6 GB/s 208 GB/s 250 GB/s - DRAM, Caches & Reg Files GPUDirect w/ RDMA, Hyper-Q, Dynamic Parallelism DRAM, Caches & Reg Files # CUDA Cores 512 2496 2688 Total Board Power 225W 225W 225W 3x Double Precision Hyper-Q, Dynamic Parallelism CFD, FEA, Finance, Physics 10
Kepler GK110 Block Diagram 7.1B Transistors 15 SMX units 1.3 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3 compliant 11
Kepler GK110 SMX vs Fermi SM 3x sustained perf/w Ground up redesign for perf/w 6x the SP FP units 4x the DP FP units Slower FU clocks ~4x the overall instruction throughput 2x register file size (64K regs) 2x threadblocks (16) & 1.33x threads (2K) 12
Speedup vs. Dual CPU FERMI 1 Work Queue Hyper-Q Easier Speedup of Legacy MPI Apps 20x CP2K- Quantum Chemistry Strong Scaling: 864 water molecules 15x 10x 2.5x KEPLER 32 Concurrent Work Queues 5x 0x 0 5 10 15 20 Number of GPUs K20 with Hyper-Q 16 MPI ranks per node K20 without Hyper-Q 1 MPI rank per node 13
What is Dynamic Parallelism? The ability to launch new kernels from the GPU Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid CPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself
NVIDIA GPUDirect Support for RDMA Direct Communication Between GPUs and PCIe devices System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU1 GPU2 CPU PCI-e PCI-e Network Card Network Network Card Node 1 Node 2 15
Minimum Change, Big Speed-up Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 16
Ways to Accelerate Applications Applications Libraries Directives (OpenACC) Programming Languages (CUDA,..) High Level Languages (Matlab,..) CUDA Libraries are interoperable with OpenACC CUDA Language is interoperable with OpenACC Easiest Approach Maximum Performance No Need for Programming Expertise 17
MATLAB Parallel Computing On-ramp to GPU Computing Most popular math functions on GPUs Random number generation FFT Matrix multiplications Solvers Convolutions Min/max SVD Cholesky and LU factorization MATLAB Compiler support (GPU acceleration without MATLAB installed) GPU features in Communications Systems Toolbox 18
Wide Adoption of Tesla GPUs Oil and gas Edu/Research Government Life Sciences Finance Manufacturing Reverse Time Migration Kirchoff Time Migration Reservoir Sim Astrophysics Lattice QCD Molecular Dynamics Weather / Climate Modeling Signal Processing Satellite Imaging Video Analytics Synthetic Aperture Radar Bio-chemistry Bio-informatics Material Science Sequence Analysis Genomics Risk Analytics Monte Carlo Options Pricing Insurance modeling Structural Mechanics Computational Fluid Dynamics Machine Vision Electromagnetics 19
Over 200 GPU-Accelerated Applications http://www.nvidia.com/teslaapps/ 20
ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical ANSYS Fluent 13.0 Dec 2010 14.0 Dec 2011 14.5 Nov 2012 SMP, Single GPU, Sparse and PCG/JCG Solvers + Distributed ANSYS; + Multi-node Support + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support Radiation Heat Transfer (beta) + Radiation HT; + GPU AMG Solver (beta), Single GPU 15.0 Q4-2013 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning 21
ANSYS Mechanical Number of Jobs Per Day ANSYS Mechanical 14.5 GPU Acceleration 500 400 Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs CPU + GPU CPU Only Higher is Better 394 V14sp-5 Model 300 200 100 0 312 Fermi 1.9x Acceleration 163 Westmere Xeon X5690 3.47 GHz 8 Cores + Tesla C2075 Kepler 1.9x Acceleration 213 Sandy Bridge Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20 Turbine geometry 2,100,000 DOF SOLID187 FEs Static, nonlinear One iteration (final solution requires 25) Distributed ANSYS 14.5 Direct sparse solver Results from Supermicro X9DR3-F, 64GB memory 22
ANSYS Fluent 15.0 Multi-GPU Demonstration Multi-GPU Acceleration of a 16-Core ANSYS Fluent Simulation of External Aero 2.9X Solver Speedup Xeon E5-2667 CPUs + Tesla K20X GPUs CPU Configuration CPU + GPU Configuration 16-Core Server Node 8-Cores 8-Cores G1 G2 G3 G4 23
Factors Gain Over 1 Westmere CPU, 6 cores Abaqus 6.12 Solution Performance/Cost Westmere & Sandy Bridge CPUs w/ Tesla M2090 3 2 1 0 Performance Speed-up Solution Cost (h/w + gpus + s/w tokens) 1.76 1.13 1 CPU + 1 GPU (Westmere, 6 cores + 1 Tesla) 2.51 1.50 2 CPUs + 2 GPUs (Westmere, 12 cores + 2 Tesla) 2.21 1.21 1 CPU + 1 GPU (SB, 8 cores + 1 Tesla) 2.86 1.57 2 CPUs + 2 GPUs (SB, 16 cores + 2 Tesla) Solution Cost Basis - Abaqus License Tokens (Base: 5 Tokens for 1 CPU core, 1 Additional Token unlocks 1 additional CPU core or an entire GPU) - $8K for Server node cost - $2K for 1x Tesla 20-series Performance Basis s4b Engine Block model - 5M DOF - Solid FEs - Static, nonlinear - Five iterations - Direct sparse solver Baseline performance (1x) corresponds to 1 CPU (6 cores of Westmere x5670, 2.93GHz) Baseline hardware costs corresponds to 2x CPUs, 64GB memory; (1.0x) (Same cost used for Westmere & SB CPUs) 24
Factors Gain Over Base License Results Nastran Solution Price-Performance Gain NOTE: Based on MSC Nastran 2013 Alpha 10.00 8.00 6.00 4.00 2.00 0.00 Results from PSG cluster node, 2x Westmere 2.93GHz, 96GB memory, Linux/RHEL 6.2; 2x Tesla M2090, CUDA 4.2 CPU Speed-up GPU Speed-up Solution Cost 1.0 Nastran SMP License 1 Core 2.6 5.0 4.8 1.00 1.00 1.20 1.11 1.35 Nastran SMP 4 Cores Nastran DMP 8 Cores Nastran SMP + GPU License 4 Core + 1 GPU 9.9 Nastran DMP + GPU License 8 Cores + 2 GPUs Extra 15% cost yields ~200% performance (over 8 cores) Solution Cost Basis - Structures Package (Base SMP license) - Exterior Acoustics Package - Implicit HPC Package (DMP Network License) - GPU License - $10K for System cost - $4K for 2x Tesla 20-series Performance Basis SOL108 Model: - 7.47M DOF - NVH analysis - Trimmed Car Body * * 1 year lease for SW pricing 25
Speedup Compared to CPU Only AMBER: K20 Accelerates Simulations of All Sizes 30.00 25.00 25.56 28.00 Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) 20.00 Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 15.00 10.00 6.50 6.56 7.28 5.00 1.00 2.66 0.00 CPU All Molecules TRPcage GB JAC NVE PME Factor IX NVE PME Cellulose NVE PME Myoglobin GB Nucleosome GB Gain 28x throughput/performance by adding just one K20 GPU when compared to dual CPU performance Nucleosome 26
AMBER: Replace 8 Nodes with 1 K20 GPU 90.00 80.00 70.00 60.00 50.00 65.00 81.09 $32,000.00 35000 30000 25000 20000 Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores, $2000 per CPU) Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU ($2500 per GPU) 40.00 15000 30.00 20.00 10.00 $6,500.00 10000 5000 0.00 0 Nanoseconds/Day Cost Cut down simulation costs to ¼ and gain higher performance DHFR 27
Speedup Compared to CPU Only 3 2.5 2 NAMD: Accelerates Simulations of All Sizes 2.6 2.4 2.7 Running NAMD 2.9 with CUDA 4.0 ECC Off The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 1.5 1 1 0.5 0 CPU All Molecules ApoA1 F1-ATPase STMV Gain ~2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance Apolipoprotein A1 28
Energy Expended (kj) NAMD: Twice The Science Per Watt With K20 1200000 1000000 800000 Energy Used in Simulating 1 Nanosecond of ApoA1 Lower is better Running NAMD version 2.9 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU). Each green node contains 2x Intel Xeon X5550 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA K20 GPUs (225W per GPU) 600000 400000 Energy Expended = Power x Time 200000 0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs 29
Titan: World s Fastest Supercomputer 18,688 Tesla K20X Accelerators 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack 30
Power is the main HPC constraint Summary Vast majority of work must be done by cores designed for efficiency GPU computing has a sustainable model Aligned with technology trends, supported by consumer markets Increasing number of Open Source Community and Commercial ISV applications are already implemented Good speedup compared to CPU only runs Better Total Cost of Ownership (TCO) Cut down energy consumption 31
GPU Technology Conference 2013 March 18-21 San Jose, CA The one event you can t afford to miss Learn about leading-edge advances in GPU computing Explore the research as well as the commercial applications Discover advances in computational visualization Take a deep dive into parallel programming Ways to participate Speak share your work and gain exposure as a thought leader Register learn from the experts and network with your peers Exhibit/Sponsor promote your company as a key player in the GPU ecosystem www.gputechconf.com 32
GPU Computing fuer rechenintensive Anwendungen Axel Koehler NVIDIA (akoehler@nvidia.com)