GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1
NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2
Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Life Sciences Manufacturing Biochemistry Bioinformatics Material Science Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Tesla K20/K20X Kepler GK110 Defense / Govt Oil and Gas Signal Processing Image Processing Video Analytics Reverse Time Migration Kirchoff Time Migration Tesla K10 Kepler GK104 3
Product Name K10 K20 K20X GPU Architecture Kepler: GK104 GK110 GK110 # of GPUs 2 1 1 Peak Single Flops Peak SGEMM Peak Double Flops Peak DGEMM 4.58 TF (2.3 TF per GPU) 2.98 TF 0.19 TF (0.095 TF per GPU) 0.12 TF 3.52 TF 2.61 TF 1.17 TF 1.10 TF 3.95 TF 2.90 TF 1.32 TF 1.22 TF Memory size 8 GB (4GB per GPU) 5 GB 6 GB Memory BW (ECC off) 320 GB/s (160GB/s per GPU) 208 GB/s 250 GB/s New CUDA Features GPUDirect w/ RDMA GPUDirect (RDMA), Hyper-Q, Dynamic Parallelism ECC Features External DRAMs only DRAM, Caches & Reg Files # CUDA Cores 3072 (1536 per GPU) 2496 2688 Total Board Power 225W 225W 235W Board Type PCI-e Passive PCI-e Passive, Active, SXM PCI-e Passive SXM 4
Parallel Computing Platform Multiple Programming Approaches Libraries Drop-in Acceleration OpenACC Directives Easily Accelerate Applications Programming Languages Maximum Flexibility Development Environment Parallel Nsight IDE Linux, Mac and Windows GPU Debugging and Profiling CUDA-GDB debugger NVIDIA Visual Profiler Third Party Tools DDT, TotalView, Vampir, Compiler Open Compiler Tools Enables compiling new languages to CUDA platform, and CUDA languages to other architectures OpenACC Compiler Hardware Capabilities SMX DynamicParallelism HyperQ GPUDirect 5
Kepler Features Make GPU Coding Easier Hyper-Q Speedup Legacy MPI Apps FERMI 1 Work Queue Dynamic Parallelism Less Back-Forth, Simpler Code CPU Fermi GPU CPU Kepler GPU KEPLER 32 Concurrent Work Queues 6
Proxy A Multi-Process Runtime for MPI CUDA MPI Rank 0 CUDA MPI Rank 1 CUDA MPI Rank 2 CUDA MPI Rank 3 Why Speedups for MPI programs with low-gpu utilization How Multiple CPU processes on a single GPU simultaneously Proxy Server Client-server architecture Client processes share the same CUDA context When GPU Currently on Cray; Production on Linux with next the CUDA version 7
Kepler Enables Full NVIDIA GPUDirect RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU2 GPU1 CPU Server 1 PCI-e Network Card Network Network Card PCI-e Server 2 GPUDirect RDMA is a general approach and can also be used in conjunction with other PCIe devices (eg. flash memory devices) 8
CUDA Compiler Contributed to Open Source LLVM Developers want to build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs, x86 CUDA C, C++, Fortran NVIDIA GPUs LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support 9
Open Compiler Architecture NVCC.cu CUDA FE libcuda.lang NVVM IR LLVM Optimizer NVPTX CodeGen libnvvm Open Sourced PTX PTXAS Host Compiler CUDA Runtime CUDA Driver https://developer.nvidia.com/cuda-llvm-compiler
Scenarios for the Compiler SDK NVCC NVCC NVCC libcuda. LANG CUDA Fortran Open ACC libcuda. LANG DSL Front End libcuda. LANG x86 LLVM Backend libnvvm CUDA Runtime libnvvm CUDA Runtime DSL Runtime libnvvm CUDA Runtime x86 CUDA Runtime Building Production Quality Compilers Building Domain Specific Languages (DSL) Enabling Other Platforms JET MATLAB HALIDE
Enabling Research in GPU Computing CU++ CLANG x86 LLVM Backend libnvvm Custom Runtime
Proposed Additions for OpenACC 2.0 Address ambiguities in existing spec List of 30+ features to be added Nested parallelism Separate compilation Function calls Data directives for control, unstructured data, deep copy for C++ structures, noncontiguous memory Multiple devices Profiling interface Certification OpenACC test suite http://www.openacc.org 13
Growing OpenACC Support
GPU-Accelerated Applications www.nvidia.com/appscatalog 15
Summary The Kepler architecture focuses on performance, efficiency and programmability NVIDIA s Parallel Computing Platform is evolving Open Compiler Architecture and Compiler SDK play a very important role to broaden the GPU platform Strong growth in GPU accelerated applications in academia and industry 16
Thank you. Questions? Axel Koehler Sr. Solution Architect HPC akoehler@nvidia.com 17