EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

Similar documents
GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

The Visual Computing Company

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Future Directions for CUDA Presented by Robert Strzodka

GPU-Accelerated Deep Learning

RECENT UPDATES ON ACCELERATING COMPUTING PLATFORM PRADEEP GUPTA SENIOR SOLUTIONS ARCHITECT, NVIDIA

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

IBM Deep Learning Solutions

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

World s most advanced data center accelerator for PCIe-based servers

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Deep learning in MATLAB From Concept to CUDA Code

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

CafeGPI. Single-Sided Communication for Scalable Deep Learning

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Deep Learning mit PowerAI - Ein Überblick

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

HPC with Multicore and GPUs

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

IBM Power AC922 Server

DGX UPDATE. Customer Presentation Deck May 8, 2017

NumbaPro CUDA Python. Square matrix multiplication

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

OpenACC Course. Office Hour #2 Q&A

TESLA V100 PERFORMANCE GUIDE May 2018

A BRIEF HISTORY OF GPGPU. Mark Harris Chief Technologist, GPU Computing UNC Ph.D. 2003

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

CUDNN. DU _v07 December Installation Guide

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Deploying Deep Learning Networks to Embedded GPUs and CPUs

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

CUDA. Matthew Joyner, Jeremy Williams

GPU CUDA Programming

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Research Faculty Summit Systems Fueling future disruptions

JCudaMP: OpenMP/Java on CUDA

Building the Most Efficient Machine Learning System

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Tesla GPU Computing A Revolution in High Performance Computing

The Tesla Accelerated Computing Platform

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

An Introduc+on to OpenACC Part II

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPUS FOR NGVLA. M Clark, April 2015

GPU Computing with NVIDIA s new Kepler Architecture

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Supporting Data Parallelism in Matcloud: Final Report

ARCHER Champions 2 workshop

OpenPOWER Performance

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Automatic Code Generation TVM Stack

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Accelerated Machine Learning Algorithms in Python

OpenStaPLE, an OpenACC Lattice QCD Application

An Evaluation of Unified Memory Technology on NVIDIA GPUs

SPOC : GPGPU programming through Stream Processing with OCaml

GPU ARCHITECTURE Chris Schultz, June 2017

Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.gpu Execution in Java Programs

Addressing Heterogeneity in Manycore Applications

Neural Network Exchange Format

Building NVLink for Developers

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Using a GPU in InSAR processing to improve performance

What s inside: What is deep learning Why is deep learning taking off now? Multiple applications How to implement a system.

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Transcription:

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2

GPUS: THE HOT MACHINE LEARNING PLATFORM Image Recognition Challenge 1.2M training images 1000 object categories Hosted by 120 100 80 60 40 20 0 GPU Entries 110 60 4 2010 2011 2012 2013 2014 person car helmet motorcycle person dog chair bird frog person hammer flower pot power drill 30% 25% 20% 15% 10% 5% 0% Classification Error Rates 28% 26% 16% 12% 7% 2010 2011 2012 2013 2014 3

GPU-ACCELERATED DEEP LEARNING High performance routines for Convolutional Neural Networks Optimized for current and future NVIDIA GPUs Integrated in major open-source frameworks Caffe, Torch7, Theano Flexible and easy-to-use API Also available for ARM / Jetson TK1 Caffe (CPU*) 1x Caffe (GPU) 11x Caffe (cudnn) 14x https://developer.nvidia.com/cudnn Baseline Caffe compared to Caffe accelerated by cudnn on K40 *CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3 4

EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 5

KEPLER GPU PASCAL GPU NVLink NVLINK HIGH-SPEED GPU INTERCONNECT POWER CPU NVLink PCIe PCIe X86 ARM64 POWER CPU 2014 X86 ARM64 POWER CPU 2016 6

NVLINK UNLEASHES MULTI-GPU PERFORMANCE GPUs Interconnected with NVLink CPU Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe Speedup vs PCIe based Server 2.25x 2.00x PCIe Switch 1.75x TESLA GPU TESLA GPU 1.50x 1.25x 5x Faster than PCIe Gen3 x16 1.00x ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT 3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU 7 configuration AMBER Cellulose (256x128x128), FFT problem size (256^3) 7

NVLINK + UNIFIED MEMORY Simpler, Faster CPU NVLink 80 GB/s PCIe Switch TESLA GPU TESLA GPU Unified Memory 5x Faster than PCIe Gen3 x16 Share Data Structures at CPU Memory Speeds, not PCIe speeds Eliminate Multi-GPU Scaling Bottlenecks 8

EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 9

Data Center Infrastructure Development System Solutions Communication Infrastructure Management Programming Languages Development Tools Software Solutions / GPU Accelerators GPU Boost Interconnect GPU Direct NVLink System Management NVML Compiler Solutions LLVM Profile and Debug CUDA Debugging API Libraries cublas Tesla Accelerated Computing Platform 10

COMMON PROGRAMMING APPROACHES Across a Variety of Heterogeneous Systems Libraries AmgX cublas Compiler Directives Programming Languages / x86 11

EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 12

MAINSTREAM PARALLEL PROGRAMMING Enable more programmers to write parallel software Give programmers the choice of language to use Embrace and evolve key programming standards C 13

MAINSTREAM PARALLEL PROGRAMMING Enable more programmers to write parallel software Give programmers the choice of language to use Embrace and evolve key programming standards C 14

C++ PARALLEL ALGORITHMS LIBRARY PROGRESS std::vector<int> vec =... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f); Complete set of parallel primitives: for_each, sort, reduce, scan, etc. ISO C++ committee voted unanimously to accept as official tech. specification working draft N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf Prototype: https://github.com/n3554/n3554 15

Linux GCC Compiler to Support GPU Accelerators Open Source GCC Effort by Mentor Embedded Pervasive Impact Free to all Linux users Mainstream Most Widely Used HPC Compiler On Track for GCC 5 Incorporating OpenACC into GCC is an excellent example of open source and open standards working together to make accelerated computing broadly accessible to all Linux developers. Oscar Hernandez Oak Ridge National Laboratories

NUMBA PYTHON COMPILER Free and open source JIT compiler for array-oriented Python numba.cuda module integrates CUDA directly into Python @cuda.jit( void(float32[:], float32, float32[:], float32[:]) ) def saxpy(out, a, x, y): i = cuda.grid(1) if i < out.size: out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[100, 512](out, a, x, y) NumbaPro: commercial extension of Numba Python interfaces to CUDA libraries http://numba.pydata.org/ 17

ACCELERATING JAVA 3 WAYS Drop-In Acceleration java.util.arrays.sort(int[] a) CUDA4J 8 Accelerate Java SE Libraries with CUDA CUDA C++ Programming Via Java APIs Accelerate Pure Java IBM Developer Kits for Java: ibm.com/java/jdk 18

CUDA4J: GPU PROGRAMMING IN A JAVA API Access CUDA Programming Model with Java Best Practices void add(int[] a, int[] b, int[] c) throws CudaException, IOException { CudaDevice device = new CudaDevice(0); CudaModule module = new CudaModule(device, getclass().getresourceasstream( ArrayAdder )); CudaKernel kernel = new CudaKernel(module, Cuda_cuda4j_samples_adder ); CudaGrid grid = new CudaGrid(512, 512); try (CudaBuffer abuffer = new CudaBuffer(device, a.length * 4); CudaBuffer bbuffer = new CudaBuffer(device, b.length * 4)) { abuffer.copyfrom(a, 0, a.length); bbuffer.copyfrom(b, 0, b.length); Manage CUDA devices and kernels Easily transfer data between Java heap and CUDA device kernel.launch(grid, new CudaKernel.Parameters(aBuffer, bbuffer, a.length)); } } abuffer.copyto(c, 0, a.length); Simple, flexible kernel launch 19

ACCELERATING PURE JAVA ON GPUS Express computation as aggregate parallel operations on data streams IntStream.range(0, N).parallel().forEach(i -> c[i] = a[i] + b[i]); Benefits Java 8 Streams and Lambda Expressions Standard Java idioms, so no code changes required No knowledge of GPU programming model required No low-level device manipulation Java implementation has the controls Future JIT smarts do not require application code changes 20

JIT / GPU OPTIMIZATION OF LAMBDA EXPRESSION JIT-recognized Java matrix multiplication Speed-up factor when run on a GPU enabled host Public void multiply() { IntStream.range(0, COLS*COLS).parallel().forEach( id -> { int i = id / COLS; int j = id % COLS; int sum = 0; } }); for (int k = 0; k < COLS; k++) { sum += left[i*cols + k] * right[k*cols + j]; } output[i*cols + j] = sum; IBM Power 8 with Nvidia K40m GPU 21

COMMON PROGRAMMING APPROACHES Across a Variety of Heterogeneous Systems Libraries AmgX cublas Compiler Directives Programming Languages / x86 22