ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU Peng Wang HPC Developer Technology

NVIDIA SuperPhones to SuperComputers

Computers no longer get faster, just wider

Architectural Features Common to All Processors NVIDIA Kepler AMD Southern Islands Intel Xeon Phi Intel Sandy Bridge AMD Bulldozer Processor Pipelined Multi-Issue SIMD Unit (SM / CU / Core / Module) 15 Streaming Multiprocessors 32 Compute Units 6 Cores 8 Cores 8 Modules Pipelined Multi-Issue SIMD Unit Control Unit S1 S2 S3 Processing Unit S4 S5 S6 Processing Unit S4 S5 S6 Processing Unit 32 threads (Warp) 64 threads (Wavefront) 16-SIMD (512-bit vector) 8-SIMD (256-bit vector) 8-SIMD (256-bit vector) S4 S5 S6

Evolution of GPUs: Codesign in Action! Kepler 7B xtors RIVA 128 3M xtors GeForce 256 23M xtors GeForce 3 6M xtors GeForce FX 25M xtors GeForce 88 681M xtors 1995 2 21 23 26 212 Fixed function Programmable shaders CUDA

By 216 the video game market is expected to reach $82 billion

Computer graphics require billions of parallel computations

Why are so many parallel operations needed? Millions of triangles Millions of pixels Image plane Camera Input triangle Transform vertices Tessellate Projection Rasterize Shade

Scientific simulations require quadrillions of parallel computations per second

An Unlikely Symbiosis Scientific computing and gaming is going in the SAME direction!

1 PARTICLE SIMULATION

PARTICLE SIMULATION HPC Ribosome simulated by NAMD, visualized by VMD Bond Atom Forces

PARTICLE SIMULATION GAMING Hair simulation NVIDIA Hair Demo

2 CONVOLUTION Source Pixel 1 1 1 1 2 2 1 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 Convolution kernel 4-4 New pixel value (destination pixel) -8 Center element of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels.

CONVOLUTION HPC RTM Reverse Time Migration Petroleum Geo Services complex wave interaction near a salt tooth propagated using AxRTM

CONVOLUTION GAMING Depth of field Halo 3 Bungie Studios

3 SOLVING PARTIAL DIFFERENTIAL EQUATIONS (PDEs) x t = f(x,t)

SOLVING PDEs HPC On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.

SOLVING PDEs GAMING Planetside 2 Sony Dark Void Capcom NVIDIA Turbulence Demo Dark Void Capcom

4 FAST FOURIER TRANSFORMATION (FFT) + = +

FFT HPC Turbulence simulation

FFT GAMING Ocean Simulation NVIDIA Ocean Demo

5 SPHERICAL HARMONICS

SPHERICAL HARMONICS - HPC Weather Prediction Close up of a mid-latitude cyclone Created by Gordon Bell Award winning atmospheric model AFES using SPH

SPHERICAL HARMONICS - GAMING Indirect Lighting Normal Diffuse Lighting With Indirect Lighting Robin Green, Spherical Harmonic Lighting: The Gritty Details Team Fortress 2 Valve Halo 3 Bungie

HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Gaming Ambient occlusion HPC Sparse Matrix vector multiply

HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Math Bound Team Fortress 2 Valve blood coagulation factor IX simulated by AMBER Gaming most vertex and pixel shaders HPC Simulation of proteins and lipids

LOOKING AHEAD GAMES Today Tomorrow?

LOOKING AHEAD - HPC Today Tomorrow?

Same Fundamental Hardware Design Requirement Power-limited Phone/tablet: ~1W PC: ~2W Supercomputer: ~2MW Energy efficiency

NVIDIA Leverages the GPU Technology Across Multiple Industries HPC = Incremental Investment GRID Visual Computing Appliance GeForce Consumer Graphics Quadro Professional Graphics Tegra Mobile Computing Tesla HPC GRID Cloud Computing GPU Kepler architecture

Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel computing ecosystem.

GPU Computing Momentum 28 213 1M Compute Capable GPUs 43M Compute-Capable GPUs 15K CUDA Toolkit Downloads 1.6M CUDA Toolkit Downloads 1 Supercomputer 5 Supercomputers 6 University Courses 64 University Courses 4, Academic Papers 37, Academic Papers

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Enabling More Programming Languages Developers want to build front-ends for Python, Java, R, DSLs CUDA C, C++, Fortran LLVM Compiler For CUDA New Language Support Target other processors like ARM, FPGAs, GPUs, x86 NVIDIA GPUs x86 CPUs New Processor Support

CPU OpenACC Directives OpenMP OpenACC CPU GPU main() { double pi =.; long i; #pragma omp parallel for reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); } main() { double pi =.; long i; #pragma acc parallel loop reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); }

PGI: An NVIDIA Company CUDA Fortran OpenACC CUDA x86

Unified Memory Software prototype working on Kepler in a future CUDA release Hardware support in Maxwell Tesla CUDA Fermi FP64 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory 28 21 212 214

Explicit Memory Copies No Longer Required void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted; cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); parallel_sort<<<... >>>( sorted, data, N); cudamemcpy(sorted, d_sorted, N,...); cudafree(d_data); cudafree(d_sorted); use_data(sorted); free(data); free(sorted); } use_data(sorted); free(data); free(sorted); }

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

DP GFLOPS per Watt More Performance per Watt 32 16 8 4 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta Stacked DRAM 2 Fermi FP64 1.5 Tesla CUDA 28 21 212 214

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Kayla Development Platform CUDA 5 OpenGL 4.3 Kick starts ARM + CUDA Ecosystem NAMD Ported in 2 Days Quad ARM + Kepler GPU https://developer.nvidia.com/kayla-platform Quad ARM + Any CUDA GPU

OpenPOWER Consortium

LOC LOC 7 Echelon Compute Node & System 218 Vision: Echelon Compute Node & System L2 256KB DRAM Stacks C C 7 SM L2 123 256KB DRAM DIMMs NoC SM 255 MC NV RAM System Interconnect NIC Node : 16 TF, 2 TB/s, 512+ GB Node 255 Cabinet : 4 PF, 128 TB Cabinet N-1 Echelon System (up to 1 EF) Key architectural features: Malleable memory hierarchy Hierarchical register files Hierarchical thread scheduling Place coherency/consistency Temporal SIMT & scalarization PGAS memory HW accelerated queues Active messages AMOs everywhere Collective engines Streamlined LOC/TOC interaction