ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

Similar documents
Future Directions for CUDA Presented by Robert Strzodka

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

CUDA. Matthew Joyner, Jeremy Williams

GPU ARCHITECTURE Chris Schultz, June 2017

Timothy Lanfear, NVIDIA HPC

GPU A rchitectures Architectures Patrick Neill May

GPU ARCHITECTURE Chris Schultz, June 2017

GPU Fundamentals Jeff Larkin November 14, 2016

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Steve Scott, Tesla CTO SC 11 November 15, 2011

Introduction to GPU hardware and to CUDA

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Technology for a better society. hetcomp.com

Antonio R. Miele Marco D. Santambrogio

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CSC573: TSHA Introduction to Accelerators

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS427 Multicore Architecture and Parallel Computing

GPUS FOR NGVLA. M Clark, April 2015

GRAPHICS PROCESSING UNITS

CME 213 S PRING Eric Darve

Portland State University ECE 588/688. Graphics Processors

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Multi-Processors and GPU

GPU Computing with NVIDIA s new Kepler Architecture

Mathematical computations with GPUs

Tesla GPU Computing A Revolution in High Performance Computing

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Threading Hardware in G80

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

GPUs and Emerging Architectures

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Parallel Accelerators

Fast-multipole algorithms moving to Exascale

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Exotic Methods in Parallel Computing [GPU Computing]

Ian Buck, GM GPU Computing Software

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

An Introduction to OpenACC

GPU Architecture. Michael Doggett Department of Computer Science Lund university

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

General Purpose GPU Computing in Partial Wave Analysis

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

HIGH-PERFORMANCE COMPUTING

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Accelerating High Performance Computing.

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

CUDA Architecture & Programming Model

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Stan Posey, NVIDIA, Santa Clara, CA, USA

Technical Report on IEIIT-CNR

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Parallel Programming Concepts. GPU Computing with OpenCL

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

The Era of Heterogeneous Computing

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

High Performance Computing with Accelerators

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Optimising the Mantevo benchmark suite for multi- and many-core architectures

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CUDA Experiences: Over-Optimization and Future HPC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Graphics Processing Unit Architecture (GPU Arch)

Shaders. Slide credit to Prof. Zwicker

HPC with Multicore and GPUs

Parallel Computing. November 20, W.Homberg

SIGGRAPH Briefing August 2014

Directive-based Programming for Highly-scalable Nodes

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Using GPUs for unstructured grid CFD

Pedraforca: a First ARM + GPU Cluster for HPC

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Transcription:

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU Peng Wang HPC Developer Technology

NVIDIA SuperPhones to SuperComputers

Computers no longer get faster, just wider

Architectural Features Common to All Processors NVIDIA Kepler AMD Southern Islands Intel Xeon Phi Intel Sandy Bridge AMD Bulldozer Processor Pipelined Multi-Issue SIMD Unit (SM / CU / Core / Module) 15 Streaming Multiprocessors 32 Compute Units 6 Cores 8 Cores 8 Modules Pipelined Multi-Issue SIMD Unit Control Unit S1 S2 S3 Processing Unit S4 S5 S6 Processing Unit S4 S5 S6 Processing Unit 32 threads (Warp) 64 threads (Wavefront) 16-SIMD (512-bit vector) 8-SIMD (256-bit vector) 8-SIMD (256-bit vector) S4 S5 S6

Evolution of GPUs: Codesign in Action! Kepler 7B xtors RIVA 128 3M xtors GeForce 256 23M xtors GeForce 3 6M xtors GeForce FX 25M xtors GeForce 88 681M xtors 1995 2 21 23 26 212 Fixed function Programmable shaders CUDA

By 216 the video game market is expected to reach $82 billion

Computer graphics require billions of parallel computations

Why are so many parallel operations needed? Millions of triangles Millions of pixels Image plane Camera Input triangle Transform vertices Tessellate Projection Rasterize Shade

Scientific simulations require quadrillions of parallel computations per second

An Unlikely Symbiosis Scientific computing and gaming is going in the SAME direction!

1 PARTICLE SIMULATION

PARTICLE SIMULATION HPC Ribosome simulated by NAMD, visualized by VMD Bond Atom Forces

PARTICLE SIMULATION GAMING Hair simulation NVIDIA Hair Demo

2 CONVOLUTION Source Pixel 1 1 1 1 2 2 1 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 Convolution kernel 4-4 New pixel value (destination pixel) -8 Center element of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels.

CONVOLUTION HPC RTM Reverse Time Migration Petroleum Geo Services complex wave interaction near a salt tooth propagated using AxRTM

CONVOLUTION GAMING Depth of field Halo 3 Bungie Studios

3 SOLVING PARTIAL DIFFERENTIAL EQUATIONS (PDEs) x t = f(x,t)

SOLVING PDEs HPC On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.

SOLVING PDEs GAMING Planetside 2 Sony Dark Void Capcom NVIDIA Turbulence Demo Dark Void Capcom

4 FAST FOURIER TRANSFORMATION (FFT) + = +

FFT HPC Turbulence simulation

FFT GAMING Ocean Simulation NVIDIA Ocean Demo

5 SPHERICAL HARMONICS

SPHERICAL HARMONICS - HPC Weather Prediction Close up of a mid-latitude cyclone Created by Gordon Bell Award winning atmospheric model AFES using SPH

SPHERICAL HARMONICS - GAMING Indirect Lighting Normal Diffuse Lighting With Indirect Lighting Robin Green, Spherical Harmonic Lighting: The Gritty Details Team Fortress 2 Valve Halo 3 Bungie

HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Gaming Ambient occlusion HPC Sparse Matrix vector multiply

HPC and Gaming: Similarities at a fundamental level Memory Bandwidth Bound Math Bound Team Fortress 2 Valve blood coagulation factor IX simulated by AMBER Gaming most vertex and pixel shaders HPC Simulation of proteins and lipids

LOOKING AHEAD GAMES Today Tomorrow?

LOOKING AHEAD - HPC Today Tomorrow?

Same Fundamental Hardware Design Requirement Power-limited Phone/tablet: ~1W PC: ~2W Supercomputer: ~2MW Energy efficiency

NVIDIA Leverages the GPU Technology Across Multiple Industries HPC = Incremental Investment GRID Visual Computing Appliance GeForce Consumer Graphics Quadro Professional Graphics Tegra Mobile Computing Tesla HPC GRID Cloud Computing GPU Kepler architecture

Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel computing ecosystem.

GPU Computing Momentum 28 213 1M Compute Capable GPUs 43M Compute-Capable GPUs 15K CUDA Toolkit Downloads 1.6M CUDA Toolkit Downloads 1 Supercomputer 5 Supercomputers 6 University Courses 64 University Courses 4, Academic Papers 37, Academic Papers

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Enabling More Programming Languages Developers want to build front-ends for Python, Java, R, DSLs CUDA C, C++, Fortran LLVM Compiler For CUDA New Language Support Target other processors like ARM, FPGAs, GPUs, x86 NVIDIA GPUs x86 CPUs New Processor Support

CPU OpenACC Directives OpenMP OpenACC CPU GPU main() { double pi =.; long i; #pragma omp parallel for reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); } main() { double pi =.; long i; #pragma acc parallel loop reduction(+:pi) for (i=; i<n; i++) { double t = (double)((i+.5)/n); pi += 4./(1.+t*t); } printf( pi = %f\n, pi/n); }

PGI: An NVIDIA Company CUDA Fortran OpenACC CUDA x86

Unified Memory Software prototype working on Kepler in a future CUDA release Hardware support in Maxwell Tesla CUDA Fermi FP64 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory 28 21 212 214

Explicit Memory Copies No Longer Required void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); void sortfile(file *fp, int N) { char *data = (char*)malloc(n); char *sorted = (char*)malloc(n); fread(data, 1, N, fp); char *d_data, *d_sorted; cudamalloc(&d_data, N); cudamalloc(&d_sorted, N); cudamemcpy(d_data, data, N,...); parallel_sort<<<... >>>(d_sorted, d_data, N); parallel_sort<<<... >>>( sorted, data, N); cudamemcpy(sorted, d_sorted, N,...); cudafree(d_data); cudafree(d_sorted); use_data(sorted); free(data); free(sorted); } use_data(sorted); free(data); free(sorted); }

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

DP GFLOPS per Watt More Performance per Watt 32 16 8 4 Kepler Dynamic Parallelism Maxwell Unified Virtual Memory Volta Stacked DRAM 2 Fermi FP64 1.5 Tesla CUDA 28 21 212 214

Investing in the Future Enable More Developers More Performance per Watt Future Computing Platforms

Kayla Development Platform CUDA 5 OpenGL 4.3 Kick starts ARM + CUDA Ecosystem NAMD Ported in 2 Days Quad ARM + Kepler GPU https://developer.nvidia.com/kayla-platform Quad ARM + Any CUDA GPU

OpenPOWER Consortium

LOC LOC 7 Echelon Compute Node & System 218 Vision: Echelon Compute Node & System L2 256KB DRAM Stacks C C 7 SM L2 123 256KB DRAM DIMMs NoC SM 255 MC NV RAM System Interconnect NIC Node : 16 TF, 2 TB/s, 512+ GB Node 255 Cabinet : 4 PF, 128 TB Cabinet N-1 Echelon System (up to 1 EF) Key architectural features: Malleable memory hierarchy Hierarchical register files Hierarchical thread scheduling Place coherency/consistency Temporal SIMT & scalarization PGAS memory HW accelerated queues Active messages AMOs everywhere Collective engines Streamlined LOC/TOC interaction