GPU Computing with NVIDIA s new Kepler Architecture

Similar documents
GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing fuer rechenintensive Anwendungen. Axel Koehler NVIDIA

Accelerating High Performance Computing.

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Kepler Overview Mark Ebersole

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

Future Directions for CUDA Presented by Robert Strzodka

NVIDIA GPU TECHNOLOGY UPDATE

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

AXEL KOEHLER GPU Computing Update

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA. Matthew Joyner, Jeremy Williams

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

The Visual Computing Company

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPU Computing Ecosystem

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

OpenACC Course. Office Hour #2 Q&A

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Tesla GPU Computing A Revolution in High Performance Computing

General Purpose GPU Computing in Partial Wave Analysis

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

The following NVIDIA accelerators are available from HP, for use in certain HPE ProLiant DL-series, ML-series and SL-series servers.

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

n N c CIni.o ewsrg.au

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

THE LEADER IN VISUAL COMPUTING

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Tesla GPU Computing A Revolution in High Performance Computing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

NVIDIA GPU Computing Séminaire Calcul Hybride Aristote 25 Mars 2010

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

Mathematical computations with GPUs

The following NVIDIA accelerators are available from HPE, for use in certain HPE ProLiant DL-series, ML-series and SL-series servers.

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

The following NVIDIA accelerators are available from HPE, for use in certain HPE ProLiant DL-series, ML-series and SL-series servers.

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

The following NVIDIA accelerators are available from HPE, for use in certain HPE ProLiant DL-series, ML-series and SL-series servers.

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Pedraforca: a First ARM + GPU Cluster for HPC

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Designing a Domain-specific Language to Simulate Particles. dan bailey

NVIDIA : FLOP WARS, ÉPISODE III François Courteille Ecole Polytechnique 4-June-13

Best Practices for Deploying and Managing GPU Clusters

Timothy Lanfear, NVIDIA HPC

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The following NVIDIA accelerators are available from HPE, for use in certain HPE ProLiant DL-series, MLseries and SL-series servers.

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

High Performance Computing with Accelerators

LAMMPSCUDA GPU Performance. April 2011

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Technology for a better society. hetcomp.com

Illinois Proposal Considerations Greg Bauer

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Inside Kepler. Manuel Ujaldon Nvidia CUDA Fellow. Computer Architecture Department University of Malaga (Spain)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA Experiences: Over-Optimization and Future HPC

GPUs and Emerging Architectures

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Trends in HPC (hardware complexity and software challenges)

CUDA 6.0. Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Stan Posey, NVIDIA, Santa Clara, CA, USA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

designing a GPU Computing Solution

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

CST STUDIO SUITE R Supported GPU Hardware

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

Addressing Heterogeneity in Manycore Applications

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Accelerator programming with OpenACC

Parallel Computing. November 20, W.Homberg

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

The rcuda middleware and applications

Lecture 1: an introduction to CUDA

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Introduction to GPU hardware and to CUDA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Transcription:

GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1

NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2

Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Life Sciences Manufacturing Biochemistry Bioinformatics Material Science Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Tesla K20/K20X Kepler GK110 Defense / Govt Oil and Gas Signal Processing Image Processing Video Analytics Reverse Time Migration Kirchoff Time Migration Tesla K10 Kepler GK104 3

Product Name K10 K20 K20X GPU Architecture Kepler: GK104 GK110 GK110 # of GPUs 2 1 1 Peak Single Flops Peak SGEMM Peak Double Flops Peak DGEMM 4.58 TF (2.3 TF per GPU) 2.98 TF 0.19 TF (0.095 TF per GPU) 0.12 TF 3.52 TF 2.61 TF 1.17 TF 1.10 TF 3.95 TF 2.90 TF 1.32 TF 1.22 TF Memory size 8 GB (4GB per GPU) 5 GB 6 GB Memory BW (ECC off) 320 GB/s (160GB/s per GPU) 208 GB/s 250 GB/s New CUDA Features GPUDirect w/ RDMA GPUDirect (RDMA), Hyper-Q, Dynamic Parallelism ECC Features External DRAMs only DRAM, Caches & Reg Files # CUDA Cores 3072 (1536 per GPU) 2496 2688 Total Board Power 225W 225W 235W Board Type PCI-e Passive PCI-e Passive, Active, SXM PCI-e Passive SXM 4

Parallel Computing Platform Multiple Programming Approaches Libraries Drop-in Acceleration OpenACC Directives Easily Accelerate Applications Programming Languages Maximum Flexibility Development Environment Parallel Nsight IDE Linux, Mac and Windows GPU Debugging and Profiling CUDA-GDB debugger NVIDIA Visual Profiler Third Party Tools DDT, TotalView, Vampir, Compiler Open Compiler Tools Enables compiling new languages to CUDA platform, and CUDA languages to other architectures OpenACC Compiler Hardware Capabilities SMX DynamicParallelism HyperQ GPUDirect 5

Kepler Features Make GPU Coding Easier Hyper-Q Speedup Legacy MPI Apps FERMI 1 Work Queue Dynamic Parallelism Less Back-Forth, Simpler Code CPU Fermi GPU CPU Kepler GPU KEPLER 32 Concurrent Work Queues 6

Proxy A Multi-Process Runtime for MPI CUDA MPI Rank 0 CUDA MPI Rank 1 CUDA MPI Rank 2 CUDA MPI Rank 3 Why Speedups for MPI programs with low-gpu utilization How Multiple CPU processes on a single GPU simultaneously Proxy Server Client-server architecture Client processes share the same CUDA context When GPU Currently on Cray; Production on Linux with next the CUDA version 7

Kepler Enables Full NVIDIA GPUDirect RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU2 GPU1 CPU Server 1 PCI-e Network Card Network Network Card PCI-e Server 2 GPUDirect RDMA is a general approach and can also be used in conjunction with other PCIe devices (eg. flash memory devices) 8

CUDA Compiler Contributed to Open Source LLVM Developers want to build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs, x86 CUDA C, C++, Fortran NVIDIA GPUs LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support 9

Open Compiler Architecture NVCC.cu CUDA FE libcuda.lang NVVM IR LLVM Optimizer NVPTX CodeGen libnvvm Open Sourced PTX PTXAS Host Compiler CUDA Runtime CUDA Driver https://developer.nvidia.com/cuda-llvm-compiler

Scenarios for the Compiler SDK NVCC NVCC NVCC libcuda. LANG CUDA Fortran Open ACC libcuda. LANG DSL Front End libcuda. LANG x86 LLVM Backend libnvvm CUDA Runtime libnvvm CUDA Runtime DSL Runtime libnvvm CUDA Runtime x86 CUDA Runtime Building Production Quality Compilers Building Domain Specific Languages (DSL) Enabling Other Platforms JET MATLAB HALIDE

Enabling Research in GPU Computing CU++ CLANG x86 LLVM Backend libnvvm Custom Runtime

Proposed Additions for OpenACC 2.0 Address ambiguities in existing spec List of 30+ features to be added Nested parallelism Separate compilation Function calls Data directives for control, unstructured data, deep copy for C++ structures, noncontiguous memory Multiple devices Profiling interface Certification OpenACC test suite http://www.openacc.org 13

Growing OpenACC Support

GPU-Accelerated Applications www.nvidia.com/appscatalog 15

Summary The Kepler architecture focuses on performance, efficiency and programmability NVIDIA s Parallel Computing Platform is evolving Open Compiler Architecture and Compiler SDK play a very important role to broaden the GPU platform Strong growth in GPU accelerated applications in academia and industry 16

Thank you. Questions? Axel Koehler Sr. Solution Architect HPC akoehler@nvidia.com 17