Timothy Lanfear, NVIDIA HPC - PDF Free Download

GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC

Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision engines. Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models. Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies. Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.

10 EF 1 EF Exaflop Expectations First Exaflop Computer 100 PF 10 PF Titan 8.2 MW 1 PF 100 TF 10 TF Sum N=1 CM5 ~200 KW 1 TF 100 GF 10 GF N=500 Growing size, cost and power 1 GF 100 MF

Power: This Time It s Different In the Good Old Days Leakage was not important, and voltage scaled with feature size L = L/2 V = V/2 E = CV 2 = E/8 f = 2f D = 1/L 2 = 4D P = P Halve L and get 4x the transistors and 8x the capability for the same power! MF to GF to TF and almost to PF Technology was giving us 68% per year in perf/w! Processors realized ~50% per year in perf/w (spent it on single thread performance) The New Reality Leakage has limited threshold voltage, largely ending voltage scaling Halve L and get 2x the capability for the same power. At constant voltage, technology gives us only 19% per year in perf/w

The High Cost of Data Movement Fetching operands costs more than computing on them 64-bit DP 20 pj 26 pj 256 pj 256-bit access 8 kb SRAM 50 pj 256 bits 16 nj DRAM Rd/Wr 500 pj Efficient off-chip link 1 nj 20mm 28nm IC Relative cost grows with each generation Wire delay (ps/mm) not improving

HPC is Going Hybrid x86 CPU Fast single threads (serial work) PCIe Sandy Bridge 32nm 690 pj/flop GPU Extreme power-efficiency (throughput work) Kepler 28nm 132 pj/flop Do most work by cores optimized for extreme energy efficiency Still need a few cores optimized for fast serial work PCIe Xeon (AMD Fusion too) Intel MIC

Kepler Generation of GPUs Tesla K10 Tesla K20 Dual GK104 GPUs 3x Single Precision Video, Signal, Life Sciences, Seismic GK110 GPU 3x Double Precision CFD, FEA, Finance, Physics, etc.

Overarching Goals for Tesla Power Efficiency Ease of Programming And Portability Application Space Coverage

GK110 GPU KEPLER THE WORLD S FASTEST, MOST EFFICIENT HPC ACCELERATOR SMX Hyper-Q Dynamic Parallelism (power efficiency) (programmability and application coverage)

Titan: World s #1 Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops

Flagship Scientific Applications on Titan Material Science (WL-LSMS) Role of material disorder, statistics, and fluctuations in nanoscale materials and systems. Climate Change (CAM-SE) Answer questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns/statistics and tropical storms. Biofuels (LAMMPS) A multiple capability molecular dynamics code. Astrophysics (NRDF) Radiation transport critical to astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging. Combustion (S3D) Combustion simulations to enable the next generation of diesel/biofuels to burn more efficiently. Nuclear Energy (Denovo) Unprecedented high-fidelity radiation transport calculations that can be used in a variety of nuclear energy and technology applications.

Kepler GPU Performance Results Dual-socket comparison: CPU-GPU node vs. Dual-CPU node CPU = 8 core SandyBridge E5-2687w 3.10 GHz Chroma SPECFEM3D AMBER WS-LSMS NAMD Single-CPU+K20X Single-CPU+M2090 Dual-CPU Single-CPU+K20X Single-CPU+M2090 Dual-CPU Single-CPU+K20X Single-CPU+M2090 Dual-CPU Single-CPU+K20X Single-CPU+M2090 Dual-CPU Single-CPU+K20X Single-CPU+M2090 Dual-CPU 1.00 1.00 1.00 1.00 1.00 1.80 1.71 2.73 3.46 4.40 5.41 7.17 8.00 8.85 10.20 0 1 2 3 4 5 6 7 8 9 10 11

What Does The Future Hold?

The Future of HPC Programming Computers are not getting faster just wider Need to structure all HPC apps as throughput problems Locality within nodes much more important Need to expose locality (programming model) & explicitly manage memory hierarchy (compiler, runtime, autotuner) How can we enable programmers to code for future processors in a portable way?

Evolution of GPUs This Decade Integration (memory, processor types, network) Further concentration on locality (both HW and SW) Reducing overheads (intra-node, inter-node) Continued convergence with consumer technology

DP GFLOPS per Watt GPU Roadmap 32 16 8 4 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta Stacked DRAM 2 Fermi FP64 1 0.5 Tesla CUDA 2008 2010 2012 2014

The Future of HPC is Green Power is the constraint Vast majority of work must be done by cores designed for efficiency GPU computing has a sustainable model Aligned with technology trends, supported by consumer markets Future evolution will focus on: Integration (CPU, network, memory) Increased generality efficient on any code with high parallelism This is simply how computers will be built