ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

Size: px

Start display at page:

Download "ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology"

Emmeline Tyler
5 years ago
Views:

1 ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU Peng Wang HPC Developer Technology

2 NVIDIA SuperPhones to SuperComputers

3 Computers no longer get faster, just wider

Cores 8 Cores 8 Modules Pipelined Multi-Issue SIMD Unit Control Unit S1 S2 S3 Processing Unit S4 S5 S6 Processing Unit S4 S5 S6

4 Architectural Features Common to All Processors NVIDIA Kepler AMD Southern Islands Intel Xeon Phi Intel Sandy Bridge AMD Bulldozer Processor Pipelined Multi-Issue SIMD Unit (SM / CU / Core / Module) 15 Streaming Multiprocessors 32 Compute Units 6 Cores 8 Cores 8 Modules Pipelined Multi-Issue SIMD Unit Control Unit S1 S2 S3 Processing Unit S4 S5 S6 Processing Unit S4 S5 S6 Processing Unit 32 threads (Warp) 64 threads (Wavefront) 16-SIMD (512-bit vector) 8-SIMD (256-bit vector) 8-SIMD (256-bit vector) S4 S5 S6

5 Evolution of GPUs: Codesign in Action! Kepler 7B xtors RIVA 128 3M xtors GeForce M xtors GeForce 3 6M xtors GeForce FX 25M xtors GeForce M xtors Fixed function Programmable shaders CUDA

6 By 216 the video game market is expected to reach $82 billion

7 Computer graphics require billions of parallel computations

8 Why are so many parallel operations needed? Millions of triangles Millions of pixels Image plane Camera Input triangle Transform vertices Tessellate Projection Rasterize Shade

9 Scientific simulations require quadrillions of parallel computations per second

10 An Unlikely Symbiosis Scientific computing and gaming is going in the SAME direction!

11 1 PARTICLE SIMULATION

12 PARTICLE SIMULATION HPC Ribosome simulated by NAMD, visualized by VMD Bond Atom Forces

13 PARTICLE SIMULATION GAMING Hair simulation NVIDIA Hair Demo

14 2 CONVOLUTION Source Pixel Convolution kernel 4-4 New pixel value (destination pixel) -8 Center element of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels.

15 CONVOLUTION HPC RTM Reverse Time Migration Petroleum Geo Services complex wave interaction near a salt tooth propagated using AxRTM

16 CONVOLUTION GAMING Depth of field Halo 3 Bungie Studios

17 3 SOLVING PARTIAL DIFFERENTIAL EQUATIONS (PDEs) x t = f(x,t)

18 SOLVING PDEs HPC On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.

19 SOLVING PDEs GAMING Planetside 2 Sony Dark Void Capcom NVIDIA Turbulence Demo Dark Void Capcom

20 4 FAST FOURIER TRANSFORMATION (FFT) + = +

21 FFT HPC Turbulence simulation

22 FFT GAMING Ocean Simulation NVIDIA Ocean Demo

23 5 SPHERICAL HARMONICS

24 SPHERICAL HARMONICS - HPC Weather Prediction Close up of a mid-latitude cyclone Created by Gordon Bell Award winning atmospheric model AFES using SPH

25 SPHERICAL HARMONICS - GAMING Indirect Lighting Normal Diffuse Lighting With Indirect Lighting Robin Green, Spherical Harmonic Lighting: The Gritty Details Team Fortress 2 Valve Halo 3 Bungie

26 HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Gaming Ambient occlusion HPC Sparse Matrix vector multiply

27 HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Math Bound Team Fortress 2 Valve blood coagulation factor IX simulated by AMBER Gaming most vertex and pixel shaders HPC Simulation of proteins and lipids

28 LOOKING AHEAD GAMES Today Tomorrow?

29 LOOKING AHEAD - HPC Today Tomorrow?

30 Same Fundamental Hardware Design Requirement Power-limited Phone/tablet: ~1W PC: ~2W Supercomputer: ~2MW Energy efficiency

NVIDIA Leverages the GPU Technology Across

31 NVIDIA Leverages the GPU Technology Across Multiple Industries HPC = Incremental Investment GRID Visual Computing Appliance GeForce Consumer Graphics Quadro Professional Graphics Tegra Mobile Computing Tesla HPC GRID Cloud Computing GPU Kepler architecture

32 Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel computing ecosystem.

33 GPU Computing Momentum M Compute Capable GPUs 43M Compute-Capable GPUs 15K CUDA Toolkit Downloads 1.6M CUDA Toolkit Downloads 1 Supercomputer 5 Supercomputers 6 University Courses 64 University Courses 4, Academic Papers 37, Academic Papers

34 Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

35 Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Support Target other processors like ARM,

36 Enabling More Programming Languages Developers want to build front-ends for Python, Java, R, DSLs CUDA C, C++, Fortran LLVM Compiler For CUDA New Language Support Target other processors like ARM, FPGAs, GPUs, x86 NVIDIA GPUs x86 CPUs New Processor Support

$5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); } main() { double pi =.$

37 CPU OpenACC Directives OpenMP OpenACC CPU GPU main() { double pi =.; long i; #pragma omp parallel for reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); } main() { double pi =.; long i; #pragma acc parallel loop reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); }

38 PGI: An NVIDIA Company CUDA Fortran OpenACC CUDA x86

39 Unified Memory Software prototype working on Kepler in a future CUDA release Hardware support in Maxwell Tesla CUDA Fermi FP64 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory

$void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted;$ cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); parallel_sort<<<.

cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); parallel_sort<<<.

40 Explicit Memory Copies No Longer Required void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted; cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); parallel_sort<<<... >>>( sorted, data, N); cudamemcpy(sorted, d_sorted, N,...); cudafree(d_data); cudafree(d_sorted); use_data(sorted); free(data); free(sorted); } use_data(sorted); free(data); free(sorted); }

41 Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

DP GFLOPS per Watt More Performance per Watt 32 16 8 4 Kepler Dynamic Parallelism

42 DP GFLOPS per Watt More Performance per Watt Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta Stacked DRAM 2 Fermi FP Tesla CUDA

43 Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

44 Kayla Development Platform CUDA 5 OpenGL 4.3 Kick starts ARM + CUDA Ecosystem NAMD Ported in 2 Days Quad ARM + Kepler GPU Quad ARM + Any CUDA GPU

45 OpenPOWER Consortium

LOC LOC 7 Echelon Compute Node & System 218 Vision: Echelon Compute Node & System L2 256KB DRAM Stacks C C 7 SM L2 123 256KB DRAM DIMMs NoC SM 255 MC NV RAM System Interconnect NIC Node : 16 TF, 2

46 LOC LOC 7 Echelon Compute Node & System 218 Vision: Echelon Compute Node & System L2 256KB DRAM Stacks C C 7 SM L KB DRAM DIMMs NoC SM 255 MC NV RAM System Interconnect NIC Node : 16 TF, 2 TB/s, 512+ GB Node 255 Cabinet : 4 PF, 128 TB Cabinet N-1 Echelon System (up to 1 EF) Key architectural features: Malleable memory hierarchy Hierarchical register files Hierarchical thread scheduling Place coherency/consistency Temporal SIMT & scalarization PGAS memory HW accelerated queues Active messages AMOs everywhere Collective engines Streamlined LOC/TOC interaction

Future Directions for CUDA Presented by Robert Strzodka

Future Directions for CUDA Presented by Robert Strzodka Authored by Mark Harris NVIDIA Corporation Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel