GPU Computing fuer rechenintensive Anwendungen. Axel Koehler NVIDIA

Similar documents
GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing with NVIDIA s new Kepler Architecture

Timothy Lanfear, NVIDIA HPC

Accelerating High Performance Computing.

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

NVIDIA GPU TECHNOLOGY UPDATE

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

CUDA. Matthew Joyner, Jeremy Williams

Steve Scott, Tesla CTO SC 11 November 15, 2011

arxiv: v1 [physics.comp-ph] 4 Nov 2013

GPUs as better MPI Citizens

GPU COMPUTING WITH MSC NASTRAN 2013

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

n N c CIni.o ewsrg.au

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

The Visual Computing Company

General Purpose GPU Computing in Partial Wave Analysis

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

Tesla GPU Computing A Revolution in High Performance Computing

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Fundamental CUDA Optimization. NVIDIA Corporation

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CUDA Experiences: Over-Optimization and Future HPC

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

HIGH-PERFORMANCE COMPUTING

Building NVLink for Developers

CME 213 S PRING Eric Darve

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

TESLA V100 PERFORMANCE GUIDE May 2018

GPU Computing Ecosystem

MAHA. - Supercomputing System for Bioinformatics

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

ANSYS HPC Technology Leadership

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Recent Advances in ANSYS Toward RDO Practices Using optislang. Wim Slagter, ANSYS Inc. Herbert Güttler, MicroConsult GmbH

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Fundamental CUDA Optimization. NVIDIA Corporation

Mathematical computations with GPUs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

HPC Enabling R&D at Philip Morris International

VSC Users Day 2018 Start to GPU Ehsan Moravveji

GPU ARCHITECTURE Chris Schultz, June 2017

World s most advanced data center accelerator for PCIe-based servers

Trends in systems and how to get efficient performance

Kepler Overview Mark Ebersole

IBM CORAL HPC System Solution

HPC with Multicore and GPUs

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Performance Benefits of NVIDIA GPUs for LS-DYNA

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

High Performance Computing with Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA

The Future of GPU Computing

AXEL KOEHLER GPU Computing Update

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

HPC Technology Trends

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

ENABLING NEW SCIENCE GPU SOLUTIONS

PART I - Fundamentals of Parallel Computing

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Motivation Goal Idea Proposition for users Study

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Transcription:

GPU Computing fuer rechenintensive Anwendungen Axel Koehler NVIDIA

GeForce Quadro Tegra Tesla 2

Continued Demand for Ever Faster Supercomputers First-principles simulation of combustion for new high-efficiency, lowemision engines. Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models. Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies. Coupled simulation of entire cells at molecular, genetic, chemical and biological levels. 3

Power Crisis in Supercomputing Exaflop 25,000,000 Watts Petaflop Teraflop 850,000 Watts 8,200,000 Watts Titan K- ORNL Computer 12,600,000 Watts 60,000 Watts Gigaflop 1982 1996 2008 2020 4

Multi-core CPUs Industry has gone multi-core as a first response to power issues Performance through parallelism, not frequency But CPUs are fundamentally designed for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops. 5

Accelerated Computing Add GPUs: Accelerate Applications CPUs: designed to run a few tasks quickly. GPUs: designed to run many tasks efficiently. 6

Fixed function hardware Transistors are primarily devoted to data processing Less leaky cache SIMT thread execution Groups of threads formed into warps which always executing same instruction Cooperative sharing of units with SIMT eg. fetch instruction on behalf of several threads or read memory location and broadcast to several registers Lack of speculation reduces overhead Minimal Overhead Energy efficient GPU Performance = Throughput Hardware managed parallel thread execution and handling of divergence 7

Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Life Sciences Biochemistry Bioinformatics Material Science Tesla K20/K20X Kepler GK110 Manufacturing Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Defense / Govt Oil and Gas Signal Processing Image Processing Video Analytics Reverse Time Migration Kirchoff Time Migration Tesla K10 Kepler GK104 8

Tesla K10: 3072 Power-Efficient Cores Product Name M2090 K10 GPU Architecture Fermi Kepler GK104 # of GPUs 1 2 Per GPU Board Single Precision Flops 1.3 TF 2.29 TF 4.58 TF Double Precision Flops 0.66 TF 0.095 TF 0.190 TF # CUDA Cores 512 1536 3072 Memory size 6 GB 4GB 8 GB Memory BW (ECC off) 177.6 GB/s 160GB/s 320 GB/s PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s 9

Tesla K20X/K20: Same Power, 3x DP Perf of Fermi Product Name M2090 K20 K20X Peak Single Precision Peak SGEMM 1.3 Tflops 0.87 TF 3.52 TF 2.99 TF 3.95 TF 3.35 TF Peak Double Precision Peak DGEMM 0.66 Tflops 0.43 TF 1.17 TF 1.0 TF 1.32 TF 1.12 TF Memory size 6 GB 5 GB 6 GB Memory BW (ECC off) New CUDA Features ECC Features 177.6 GB/s 208 GB/s 250 GB/s - DRAM, Caches & Reg Files GPUDirect w/ RDMA, Hyper-Q, Dynamic Parallelism DRAM, Caches & Reg Files # CUDA Cores 512 2496 2688 Total Board Power 225W 225W 225W 3x Double Precision Hyper-Q, Dynamic Parallelism CFD, FEA, Finance, Physics 10

Kepler GK110 Block Diagram 7.1B Transistors 15 SMX units 1.3 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3 compliant 11

Kepler GK110 SMX vs Fermi SM 3x sustained perf/w Ground up redesign for perf/w 6x the SP FP units 4x the DP FP units Slower FU clocks ~4x the overall instruction throughput 2x register file size (64K regs) 2x threadblocks (16) & 1.33x threads (2K) 12

Speedup vs. Dual CPU FERMI 1 Work Queue Hyper-Q Easier Speedup of Legacy MPI Apps 20x CP2K- Quantum Chemistry Strong Scaling: 864 water molecules 15x 10x 2.5x KEPLER 32 Concurrent Work Queues 5x 0x 0 5 10 15 20 Number of GPUs K20 with Hyper-Q 16 MPI ranks per node K20 without Hyper-Q 1 MPI rank per node 13

What is Dynamic Parallelism? The ability to launch new kernels from the GPU Dynamically - based on run-time data Simultaneously - from multiple threads at once Independently - each thread can launch a different grid CPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

NVIDIA GPUDirect Support for RDMA Direct Communication Between GPUs and PCIe devices System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU1 GPU2 CPU PCI-e PCI-e Network Card Network Network Card Node 1 Node 2 15

Minimum Change, Big Speed-up Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 16

Ways to Accelerate Applications Applications Libraries Directives (OpenACC) Programming Languages (CUDA,..) High Level Languages (Matlab,..) CUDA Libraries are interoperable with OpenACC CUDA Language is interoperable with OpenACC Easiest Approach Maximum Performance No Need for Programming Expertise 17

MATLAB Parallel Computing On-ramp to GPU Computing Most popular math functions on GPUs Random number generation FFT Matrix multiplications Solvers Convolutions Min/max SVD Cholesky and LU factorization MATLAB Compiler support (GPU acceleration without MATLAB installed) GPU features in Communications Systems Toolbox 18

Wide Adoption of Tesla GPUs Oil and gas Edu/Research Government Life Sciences Finance Manufacturing Reverse Time Migration Kirchoff Time Migration Reservoir Sim Astrophysics Lattice QCD Molecular Dynamics Weather / Climate Modeling Signal Processing Satellite Imaging Video Analytics Synthetic Aperture Radar Bio-chemistry Bio-informatics Material Science Sequence Analysis Genomics Risk Analytics Monte Carlo Options Pricing Insurance modeling Structural Mechanics Computational Fluid Dynamics Machine Vision Electromagnetics 19

Over 200 GPU-Accelerated Applications http://www.nvidia.com/teslaapps/ 20

ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical ANSYS Fluent 13.0 Dec 2010 14.0 Dec 2011 14.5 Nov 2012 SMP, Single GPU, Sparse and PCG/JCG Solvers + Distributed ANSYS; + Multi-node Support + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support Radiation Heat Transfer (beta) + Radiation HT; + GPU AMG Solver (beta), Single GPU 15.0 Q4-2013 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning 21

ANSYS Mechanical Number of Jobs Per Day ANSYS Mechanical 14.5 GPU Acceleration 500 400 Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs CPU + GPU CPU Only Higher is Better 394 V14sp-5 Model 300 200 100 0 312 Fermi 1.9x Acceleration 163 Westmere Xeon X5690 3.47 GHz 8 Cores + Tesla C2075 Kepler 1.9x Acceleration 213 Sandy Bridge Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20 Turbine geometry 2,100,000 DOF SOLID187 FEs Static, nonlinear One iteration (final solution requires 25) Distributed ANSYS 14.5 Direct sparse solver Results from Supermicro X9DR3-F, 64GB memory 22

ANSYS Fluent 15.0 Multi-GPU Demonstration Multi-GPU Acceleration of a 16-Core ANSYS Fluent Simulation of External Aero 2.9X Solver Speedup Xeon E5-2667 CPUs + Tesla K20X GPUs CPU Configuration CPU + GPU Configuration 16-Core Server Node 8-Cores 8-Cores G1 G2 G3 G4 23

Factors Gain Over 1 Westmere CPU, 6 cores Abaqus 6.12 Solution Performance/Cost Westmere & Sandy Bridge CPUs w/ Tesla M2090 3 2 1 0 Performance Speed-up Solution Cost (h/w + gpus + s/w tokens) 1.76 1.13 1 CPU + 1 GPU (Westmere, 6 cores + 1 Tesla) 2.51 1.50 2 CPUs + 2 GPUs (Westmere, 12 cores + 2 Tesla) 2.21 1.21 1 CPU + 1 GPU (SB, 8 cores + 1 Tesla) 2.86 1.57 2 CPUs + 2 GPUs (SB, 16 cores + 2 Tesla) Solution Cost Basis - Abaqus License Tokens (Base: 5 Tokens for 1 CPU core, 1 Additional Token unlocks 1 additional CPU core or an entire GPU) - $8K for Server node cost - $2K for 1x Tesla 20-series Performance Basis s4b Engine Block model - 5M DOF - Solid FEs - Static, nonlinear - Five iterations - Direct sparse solver Baseline performance (1x) corresponds to 1 CPU (6 cores of Westmere x5670, 2.93GHz) Baseline hardware costs corresponds to 2x CPUs, 64GB memory; (1.0x) (Same cost used for Westmere & SB CPUs) 24

Factors Gain Over Base License Results Nastran Solution Price-Performance Gain NOTE: Based on MSC Nastran 2013 Alpha 10.00 8.00 6.00 4.00 2.00 0.00 Results from PSG cluster node, 2x Westmere 2.93GHz, 96GB memory, Linux/RHEL 6.2; 2x Tesla M2090, CUDA 4.2 CPU Speed-up GPU Speed-up Solution Cost 1.0 Nastran SMP License 1 Core 2.6 5.0 4.8 1.00 1.00 1.20 1.11 1.35 Nastran SMP 4 Cores Nastran DMP 8 Cores Nastran SMP + GPU License 4 Core + 1 GPU 9.9 Nastran DMP + GPU License 8 Cores + 2 GPUs Extra 15% cost yields ~200% performance (over 8 cores) Solution Cost Basis - Structures Package (Base SMP license) - Exterior Acoustics Package - Implicit HPC Package (DMP Network License) - GPU License - $10K for System cost - $4K for 2x Tesla 20-series Performance Basis SOL108 Model: - 7.47M DOF - NVH analysis - Trimmed Car Body * * 1 year lease for SW pricing 25

Speedup Compared to CPU Only AMBER: K20 Accelerates Simulations of All Sizes 30.00 25.00 25.56 28.00 Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) 20.00 Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 15.00 10.00 6.50 6.56 7.28 5.00 1.00 2.66 0.00 CPU All Molecules TRPcage GB JAC NVE PME Factor IX NVE PME Cellulose NVE PME Myoglobin GB Nucleosome GB Gain 28x throughput/performance by adding just one K20 GPU when compared to dual CPU performance Nucleosome 26

AMBER: Replace 8 Nodes with 1 K20 GPU 90.00 80.00 70.00 60.00 50.00 65.00 81.09 $32,000.00 35000 30000 25000 20000 Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off The eight (8) blue nodes each contain 2x Intel E5-2687W CPUs (8 Cores, $2000 per CPU) Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU ($2500 per GPU) 40.00 15000 30.00 20.00 10.00 $6,500.00 10000 5000 0.00 0 Nanoseconds/Day Cost Cut down simulation costs to ¼ and gain higher performance DHFR 27

Speedup Compared to CPU Only 3 2.5 2 NAMD: Accelerates Simulations of All Sizes 2.6 2.4 2.7 Running NAMD 2.9 with CUDA 4.0 ECC Off The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 1.5 1 1 0.5 0 CPU All Molecules ApoA1 F1-ATPase STMV Gain ~2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance Apolipoprotein A1 28

Energy Expended (kj) NAMD: Twice The Science Per Watt With K20 1200000 1000000 800000 Energy Used in Simulating 1 Nanosecond of ApoA1 Lower is better Running NAMD version 2.9 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU). Each green node contains 2x Intel Xeon X5550 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA K20 GPUs (225W per GPU) 600000 400000 Energy Expended = Power x Time 200000 0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs 29

Titan: World s Fastest Supercomputer 18,688 Tesla K20X Accelerators 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack 30

Power is the main HPC constraint Summary Vast majority of work must be done by cores designed for efficiency GPU computing has a sustainable model Aligned with technology trends, supported by consumer markets Increasing number of Open Source Community and Commercial ISV applications are already implemented Good speedup compared to CPU only runs Better Total Cost of Ownership (TCO) Cut down energy consumption 31

GPU Technology Conference 2013 March 18-21 San Jose, CA The one event you can t afford to miss Learn about leading-edge advances in GPU computing Explore the research as well as the commercial applications Discover advances in computational visualization Take a deep dive into parallel programming Ways to participate Speak share your work and gain exposure as a thought leader Register learn from the experts and network with your peers Exhibit/Sponsor promote your company as a key player in the GPU ecosystem www.gputechconf.com 32

GPU Computing fuer rechenintensive Anwendungen Axel Koehler NVIDIA (akoehler@nvidia.com)